Tutorial | Distinct recipe#

Let’s practice using the Distinct recipe to remove duplicate rows in Dataiku!

Objectives#

In this tutorial, you will:

  • Remove duplicates on all columns.

  • Check that each row in your dataset is unique.

Prerequisites#

To reproduce the steps in this tutorial, you’ll need:

  • Access to an instance of Dataiku 12+.

  • Basic knowledge of Dataiku (Core Designer level or equivalent).

  • You may also want to review this tutorial’s associated concept article.

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Advanced Designer > Distinct recipe.

  2. From the project homepage, click Go to Flow.

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Create the Distinct recipe#

We’ll create a Distinct recipe from the tx_prepared dataset.

  1. From the Flow, select the tx_prepared dataset and click on the Distinct recipe from the Actions tab.

  2. Change the output name to tx_distinct.

  3. Click Create Recipe.

Screenshot of the "New distinct recipe" dialogue window.

Remove duplicates on all columns#

We’ll start with the default settings to understand what happens when you remove duplicates on all columns.

  1. Under Operation mode, keep the default selection Remove duplicates (on all columns).

  2. Make sure Compute count of original rows for each deduplicated output row is checked under Settings.

  3. Run the recipe and then open the output dataset.

A screenshot of Distinct recipe's default settings.

Here you can see a new column: count. Quickly scroll and see that all counts = 1 in this sample. Let’s confirm that this is true and that there were no duplicates in the dataset.

Set the post-filter#

  1. From the output dataset, click Parent Recipe.

  2. Navigate to the Post-filter step of the recipe.

  3. Toggle On the Filter tile.

  4. Add a condition where count > 1.

  5. Run the recipe and then open the output dataset.

A screenshot of the Post-filter step for which the count is greater than one.

Because there are no records in the output dataset, we are now certain that there were no entirely duplicate rows in the dataset. Next we will try to find duplicates on a subset of columns!

Find distinct values of a subset of all columns#

Let’s find out if there are any duplicate combinations of purchase_date and card_id. This will tell us if any card was used more than once in one day.

  1. Reopen the Distinct recipe.

  2. For the Operation mode, choose Find distinct values of a subset of all columns.

  3. Add or keep purchase_date and card_id in the Selected columns box.

  4. Check the box Compute count of original rows for each deduplicated output row.

  5. Make sure the post-filter is still active.

  6. Run the recipe and then open the output dataset.

A screenshot of the column selection available under Settings in the Distinct recipe.

In the new tx_distinct dataset, you should see all records that have one or more duplicates across these two columns. In other words, this output shows all instances where a customer made multiple purchases using the same card on the same day.

Note

Finding duplicates is particularly useful when preparing to use the Window recipe, as it helps to confirm whether or not rows will have a unique order.

What’s next?#

You just practiced using the Distinct recipe to find duplicates in your data.

If you want, you can try out another recipe in Tutorial | Window recipe!