Concept | Distinct recipe#

Watch the video

The Distinct recipe allows you to filter a dataset in order to remove some of its duplicate rows.

Use case#

In the following example, we have a simple dataset that gives us information about products, their categories, and their prices.

Use case overview.

We can see that there are duplicate rows for two products: the Computer Mouse and the Refrigerator.

One use case of the Distinct Recipe might be to remove these duplicate rows. After applying the Distinct recipe, the output dataset will only contain unique rows.

Recipe configuration#

Operation modes#

The Distinct Recipe has two operation modes.

Operation mode

Description

Remove duplicates

This lets you identify all rows that have the exact same values on all columns and keep only one of them.

Find distinct values of a subset of all columns

This allows you to choose a sample of the available columns in order to compute the distinct rows related to their combinations. For example, if you select two columns, the resulting dataset will have exactly two columns.

Additional option#

In addition to setting the operation mode, you can configure Dataiku so that it computes the count of original rows for each deduplicated output row. This adds a count column to the output dataset.

What’s next?#

Continue learning about this recipe by working through the Tutorial | Distinct recipe article.

Tip

You can find this content (and more) by registering for the Dataiku Academy course, Visual Recipes. When ready, challenge yourself to earn a certification!