Concept | Distinct recipe#

Watch the video or read the summary below.

The Distinct recipe can either enforce uniqueness in a dataset or extract unique values or entities for further analysis or downstream usage.

Use case#

The most basic purpose of the Distinct recipe is deduplication: removing duplicate rows from a dataset. However, the recipe also allows you to define uniqueness by a subset of columns instead of an entire row. Therefore, you can use the Distinct recipe to validate assumptions about data uniqueness, create reference datasets, and analyze how frequently values occur.

Use case

Example

Deduplicate rows (enforce uniqueness)

Remove duplicate transaction records so each row appears only once.

Validate uniqueness assumptions

Verify that each transaction_id appears only once. Detect whether a join or data ingestion step introduced unexpected duplicate rows.

Create a unique set of entities or values

Generate a unique list of customers or merchants for use in a lookup table, filter list, or downstream processing step.

Profile value frequency

Compute how many times each merchant or customer appears to understand distribution, identify repeated values, or assess data skew using duplicate counts.

Two concrete examples may better illustrate these kinds of transformation. The first is for pure deduplication. The dataset on the right is the deduplicated version of the dataset on the left.

Raw transactions (before)#

transaction_id

customer

amount

1

Alice

10

2

Bob

20

2

Bob

20

3

Alice

15

3

Alice

15

After (full-row deduplication)#

transaction_id

customer

amount

1

Alice

10

2

Bob

20

3

Alice

15

The Distinct recipe can also define uniqueness with respect to a subset of columns and, optionally, compute the number of occurrences for each combination.

Raw transactions (before)#

transaction_id

customer

amount

1

Alice

10

2

Bob

20

3

Alice

15

4

Bob

20

5

Alice

10

Distinct values + counts (customer, amount)#

customer

amount

count

Alice

10

2

Alice

15

1

Bob

20

2

Deduplication with other technologies#

Deduplication is a common need across many analytical tools. While syntax differs, the underlying operation is largely the same.

Technology

Typical expression

Excel

Use Remove Duplicates on selected columns, or use UNIQUE() to return a deduplicated list of values or rows.

SQL

SELECT DISTINCT ... to retrieve unique rows or GROUP BY ... when computing frequency counts.

Python (pandas)

drop_duplicates() or unique() / nunique() for distinct values in a column.

R (tidyverse)

distinct() to keep unique rows, optionally specifying columns to define uniqueness.

You could perform the same SQL, Python, or R operation with a code recipe. However, in Dataiku, you’ll often want to reach for the visual Distinct recipe for these kinds of data transformations.

Distinct recipe configuration#

To configure the Distinct recipe, you need to understand two fields. The example column refers to the tables above.

Distinct step field

Purpose

Operation mode (1)

Whether to remove duplicate rows considering all columns or find distinct values of a subset of columns.

Settings (2)

Whether to compute the count of each deduplicated row

Dataiku screenshot of the configuration of a Distinct recipe.

Tip

Like many other visual recipes, the Distinct recipe has pre-filter and post-filter steps. When computing counts, the Post-filter step can be particularly useful. For example, to find only records that appear more than once.

Next steps#

Continue learning about this recipe by working through the Tutorial | Distinct recipe article.

Tip

You can find this content (and more) by registering for the Dataiku Academy course, Visual Recipes. When ready, challenge yourself to earn a certification!