Concept | Top N recipe#

Watch the video or read the summary below.

The Top N recipe allows you to extract the highest or lowest rows of a dataset based on one or more sort columns, including with respect to groups.

Use case#

Throughout an analytics process, there are many potential situations where you may need to extract rows that represent the largest or smallest records according to some feature. You might also need to do this with respect to groups within a population.

If you imagine a dataset of credit card transactions, where each row is a unique transaction, you could find the Top N recipe helpful in any of the following situations:

Common use case

Example

Identify the most significant records.

Rank and retain the N highest value transactions to understand where the largest purchases are occurring.

Find the top performers within each group.

Select the N largest transactions for each cardholder or merchant.

Validate data quality and investigate outliers.

Review the N largest and smallest transaction amounts to identify unexpected values or potential data issues.

Prioritize records for review or action.

Send the N highest risk transactions for manual review.

Partition a dataset into a target population and the remainder.

Select the top N customers for a marketing campaign, while writing all remaining customers to a separate dataset.

Reduce data volume before downstream processing.

Keep only the N most recent transactions before performing a computationally expensive analysis.

A concrete example may better illustrate this kind of transformation. Compare the two datasets below. The dataset on the right has extracted the top two transactions per customer by amount and assigned a rank within each customer.

Transactions#

date

customer

amount

2020-07-12

Carol

2

2020-07-10

Carol

3

2020-01-25

Alice

1

2020-01-05

Bob

4

2020-07-21

Alice

6

2020-10-01

Carol

5

Top 2 amounts per customer#

customer

amount

_rank

Alice

6

1

Alice

1

2

Bob

4

1

Carol

5

1

Carol

3

2

Top N with other technologies#

The Top N pattern is common across many analytical tools. While syntax differs, the underlying operation is the same: sort a dataset by a metric and retain only the highest (or lowest) ranked rows.

Technology

Typical expression

Excel

Sort a column and take the first N rows, or use formulas like LARGE / SORT + FILTER to extract the top values.

SQL

ORDER BY ... + LIMIT N or ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)

Python (pandas)

sort_values() + optional groupby() + head(n)

R (tidyverse)

arrange() + optional group_by() + slice_head(n)

You could perform the same SQL, Python, or R operation with a code recipe. However, in Dataiku, you’ll often want to reach for the visual Top N recipe for these kinds of data transformations. This recipe unifies multiple transformations into a single reusable operation with built-in ranking, grouping, and dataset splitting.

Top N configuration#

To configure the Top N recipe, you need to understand the following five fields found on the Top N step. The example column refers to the tables above.

Top N step field

Purpose

Example

Retrieve the (1)

The number of rows to return from the top or bottom of the dataset

2, 0

sorted by (2)

The column(s) and the direction by which to arrange the rows

amount (descending)

from (3)

Whole dataset or group subsets

customer

In addition, compute for each row (4)

Optional count or rank columns

rank of row within its group

Remaining rows (5)

Option to send non-qualifying rows to another dataset

remaining_transactions

Screenshot of the Top N recipe in Dataiku.

Tip

When defining a sort column(s), think about whether the resulting order is deterministic. In this example, amount, on its own, may result in a number of ties. Different recipe engines may handle ties differently. You’ll want to think about possible edge cases when configuring the recipe.

Next steps#

Continue learning about this recipe by working through the Tutorial | Top N recipe article.

Tip

You can find this content (and more) by registering for the Dataiku Academy course, Visual Recipes. When ready, challenge yourself to earn a certification!