Concept | Sort recipe#

Watch the video

The Sort recipe allows you to sort the rows of an input dataset by the values of one or more columns in the dataset.

Use case#

Let’s say that our dataset provides customer information and includes a revenue prediction column.

../../_images/sort-before.png

Our goal is to output a dataset sorted by predictions for each country.

Sort configuration#

Here’s how we can configure the Sort recipe:

../../_images/sort-recipe.png

By default, the Sort recipe sorts columns in ascending order. In order to meet our business goal, we’ll change the sort option so that revenue predictions sort in descending order.

Additionally, we need to pay careful attention to the sort order. Here, we make sure that the data is sorted first by ip_country and then by prediction. You can change the sort order by dragging and dropping the variable fields.

Finally, we can choose to make certain computations for each row. We’ll choose each option so we can see each output.

Output#

After running the recipe, our output dataset contains rows sorted by the customer’s country of origin and the prediction of revenue. In addition, Dataiku has appended three computed columns, which we will explain further.

../../_images/sort-output.png

Let’s understand the three new columns:

  • _row_number contains each row’s respective row number.

  • _rank contains a row’s ranking based on its value in the sorting column(s). When there is a tie between rankings, subsequent rankings will skip ranks based on the number of ties there are.

  • _dense_rank contains the dense rank of each row. This is the same as the row’s ranking, but rankings are consecutive, as no ranks are skipped.

You’ll be able to choose the appropriate computations depending on your own use case.