Tutorial | Top N recipe#

Let’s try out the Top N recipe to isolate the biggest purchases found in a practice dataset.

Get started#

Objectives#

In this tutorial, you will:

Find the five most expensive purchases recorded in a dataset.
Find the five most expensive purchases per item category in the dataset.

Prerequisites#

To complete this tutorial, you’ll need the following:

Dataiku 12.0 or later.

Create the project#

From the Dataiku Design homepage, click + New Project.
Select Learning projects.
Search for and select Top N Recipe.
If needed, change the folder into which the project will be installed, and click Install.
From the project homepage, click Go to Flow (or type g + f).

From the Dataiku Design homepage, click + New Project.
Select DSS tutorials.
Filter by Core Designer.
Select Top N Recipe.
From the project homepage, click Go to Flow (or type g + f).

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

Click Flow Actions at the bottom right of the Flow.
Click Build all.
Keep the default settings and click Build.

Use case summary#

Let’s say we’re a financial company that uses some credit card data to detect fraudulent transactions.

The project comes with three datasets, described in the table below.

Dataset	Description
tx	Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made. It also indicates whether the transaction has either been: Authorized (a score of 1 in the authorized_flag column) Flagged for potential fraud (a score of 0)
merchants	Each row is a unique merchant with information such as the merchant’s location and category.
cards	Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

Dataset

Description

Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made.

It also indicates whether the transaction has either been:

Authorized (a score of 1 in the authorized_flag column)
Flagged for potential fraud (a score of 0)

merchants

Each row is a unique merchant with information such as the merchant’s location and category.

cards

Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

Create the Top N recipe#

We’ll create a Top N recipe from the tx_prepared dataset.

From the Flow, select the tx_prepared dataset and click the Actions icon (+) from the right panel to open the Actions tab.
Under the Visual Recipes section, click on Top N.

The New data preparation recipe window opens.
Keep tx_prepared as the input dataset.
Name the output dataset tx_topn.
Click Create Recipe.

Find most expensive purchases#

To find the five largest purchases in a dataset:

Retrieve the 5 top rows, and 0 bottom rows.
Select the purchase_amount column for sorting.
Change the sort order to descending () so the most expensive orders appear at the top of the dataset.
Run the recipe and open the output dataset.

As you can see, the output dataset consists of just five records including the most expensive purchases in the whole dataset. Let’s add a little more complexity.

Group by item category#

Here, we’ll try a different example to find the five biggest purchases in the dataset per item category.

Click Parent Recipe to reopen the Top N recipe.
Under from, select each group of rows identified by
In the dropdown that appears, choose item_category as the key column.
Under In addition, compute for each row, check the row number within its group checkbox.

Retrieve a subset of columns#

To make the output easier to interpret:

Navigate to the Retrieve columns step.
Change the Mode to Select columns.
Move all of the columns into the Available columns section using the double left chevron ().
Move item_category and purchase_amount back to the Selected columns section using the right chevron ().
Run the recipe and then open the output dataset.

The output should have three columns: item_category, purchase_amount, and _row_number.

You’ll see that for each category — A, B, C, and D — there are five purchase amounts that decrease within their grouping. We can confirm from the _row_number values that each grouping has five values.

Next steps#

You just practiced using the Top N recipe to find the most expensive transactions in the dataset.