Tutorial | Top N Recipe (Advanced Designer part 5)¶
Recall that the Explore tab of a dataset only includes a sample of the actual dataset. Accordingly, sorting a column in that view only sorts the rows in the current sample. The true minimum or maximum value for a column might not be included.
The Top N recipe allows you to retrieve records from a dataset based on the top and bottom values of a given column, or groups within a column. This can be especially helpful when publishing results to a dashboard.
Let’s get started!¶
Using the same credit card transactions project from other Advanced Designer tutorials, you’ll learn how to:
retrieve records of a dataset based on the top and/or bottom values of a column using the visual Top N recipe
create new columns with the Computed Columns step of the Top N recipe.
Advanced Designer Prerequisites
This lesson assumes that you have basic knowledge of working with Dataiku datasets and recipes.
If not already on the Advanced Designer learning path, completing the Core Designer Certificate is recommended.
To complete the Advanced Designer learning path, you’ll need access to an instance of Dataiku (version 8.0 or above) with the following plugins installed:
Census USA (minimum version 0.3)
These plugins are available through the Dataiku Plugin store (or for Dataiku Cloud users, the Plugins panel of the Launchpad), and you can find the instructions for installing plugins in the reference documentation. To check whether the plugin is already installed on your instance, go to the Installed tab in the Plugin Store to see a list of all installed plugins.
If your goal is to complete only the tutorials in Visual Recipes 102, the Census USA plugin is not required.
The following lessons explain the concepts you’ll be working with in this hands-on lesson:
The final Flow having added a Top N recipe is shown below.
Create your project¶
Click +New Project > DSS Tutorials > Advanced Designer > Visual Recipes & Plugins(Tutorial).
If you’ve already completed the Advanced Formula & Regex hands-on tutorials, you can use the same project.
You can also download the starter project from this website and import it as a zip file.
Change dataset connections (optional)
Aside from the input datasets, all of the others are empty managed filesystem datasets.
You are welcome to leave the storage connection of these datasets in place, but you can also use another storage system depending on the infrastructure available to you.
To use another connection, such as a SQL database, follow these steps:
Select the empty datasets from the Flow. (On a Mac, hold Shift to select multiple datasets).
Click Change connection in the “Other actions” section of the Actions sidebar.
Use the dropdown menu to select the new connection.
For a dataset that is already built, changing to a new connection clears the dataset so that it would need to be rebuilt.
Another way to select datasets is from the Datasets page (G+D). There are also programmatic ways of doing operations like this that you’ll learn about in the Developer learning path.
The screenshots below demonstrate using a PostgreSQL database.
Whether starting from an existing or fresh project, ensure that the dataset transactions_known_prepared is built, and its schema includes the columns created in the Window recipe.
See build details here if necessary
From the Flow, select the end dataset required for this tutorial: transactions_known_prepared
Choose Build from the Actions sidebar.
Choose Recursive > Smart reconstruction.
Click Build to start the job, or click Preview to view the suggested job.
If previewing, in the Jobs tab, you can see all the activities that Dataiku will perform.
Click Run, and observe how Dataiku progresses through the list of activities.
See Schema Propagation Details Here if Necessary
If the transactions_known_prepared dataset does not include columns like card_purchase_amount_min, then you need to propagate the schema changes downstream. If you completed the Advanced Formula & Regex tutorial, this should already be done.
Enter the compute_transactions_known_prepared recipe.
Click Run from inside the recipe editor.
Accept the schema change update, dropping and recreating the output.
Confirm the output dataset includes the Window-generated columns.
See the reference documentation on schema propagation to learn more.
Retrieve top and bottom values¶
Let’s find the largest and smallest purchases for every card_id.
From the Actions menu of the transactions_known_prepared dataset, choose Top N.
Name the output dataset
top_purchase_amt_by_card, and click Create Recipe.
The first step is to determine the number of top and/or bottom rows to return according to which column(s).
In the Top N step, retrieve the top
Select purchase_amount as the column to sort by.
Click the icon to the right of the selected column to sort in descending order .
By default, Dataiku retrieves and sorts the top and bottom values from the whole dataset.
Change this behavior in the “from” section by selecting each group of rows identified by… and specifying card_id as the column to use as key.
In addition, for each row, choose to compute the count of rows in its group and the rank of row within its group.
The Top N recipe provides the option of returning all or a selection of columns in the output dataset.
On the Retrieve columns step, choose to retrieve “A selection of columns.”
Select to retrieve the following columns, and then Run the recipe:
For every card_id, the resulting dataset displays the top and bottom purchases according to the sort column (purchase_amount), as well as the other retrieved columns.
In addition, the _rank column shows how each transaction ranks from highest to lowest within its group, and the _duplicate_count column shows the total number of transactions made with a given card. Applying a filter to a single card_id makes this easier to see.
Compute additional columns¶
Certain visual recipes also have the option of computing additional columns within the recipe instead of having to add a separate Prepare recipe.
Let’s compute the range of purchase_amount for the rows and groups specified in the recipe.
On the Computed columns step of the Top N recipe, click +Add a Computed Column.
Name the new computed column
In the Mode dropdown menu, you can choose between DSS formula and SQL Expression. Keep the default selection, DSS formula.
To compute the difference between the min and max of the card purchase amount, type the following Dataiku formula expression into the formula editor:
(card_purchase_amount_max - card_purchase_amount_min)
The correct storage type in this case (double) should already be specified.
Run the recipe again, updating the schema.
The output dataset contains the newly computed column card_purchase_amount_range.
Instead of using the Computed columns step in the Top N recipe, we could also have used the Formula processor in a Prepare recipe. However, the Computed columns step provides the flexibility of another option of how and when to calculate this column.
In this lesson, we used the Top N recipe in Dataiku to filter a dataset based on the top and bottom values of some of its rows.
We also learned to display aggregated row statistics and to create additional columns in a dataset using the Computed columns step in the Top N recipe.
Now you can take your advanced data preparation skills to the next level with other Academy courses such as Plugin Store.