Tutorial | Pivot Recipe (Advanced Designer part 4)¶
The Pivot Recipe transforms datasets into pivot tables, which are tables of summary statistics. This operation can also be referred to as reshaping from long to wide format.
Let’s get started!¶
In this tutorial, you’ll learn how to:
create a pivot table using the visual Pivot recipe;
define a project variable, and
apply a Pre-filter within a Pivot recipe.
We will continue to work with the credit card transactions use case found in other Advanced Designer tutorials, and we’ll use the transactions_known_prepared dataset as input for the Pivot recipe.
Advanced Designer Prerequisites
This lesson assumes that you have basic knowledge of working with Dataiku datasets and recipes.
Note
If not already on the Advanced Designer learning path, completing the Core Designer Certificate is recommended.
To complete the Advanced Designer learning path, you’ll need access to an instance of Dataiku (version 8.0 or above) with the following plugins installed:
Census USA (minimum version 0.3)
These plugins are available through the Dataiku Plugin store (or for Dataiku Cloud users, the Plugins panel of the Launchpad), and you can find the instructions for installing plugins in the reference documentation. To check whether the plugin is already installed on your instance, go to the Installed tab in the Plugin Store to see a list of all installed plugins.

Note
If your goal is to complete only the tutorials in Visual Recipes 102, the Census USA plugin is not required.
The following lessons explain the concepts you’ll be working with in this hands-on tutorial:
Create your project¶
Click +New Project > DSS Tutorials > Advanced Designer > Visual Recipes & Plugins(Tutorial).
Note
If you’ve already completed the Advanced Formula & Regex hands-on tutorials, you can use the same project.
Note
You can also download the starter project from this website and import it as a zip file.
Change dataset connections (optional)
Aside from the input datasets, all of the others are empty managed filesystem datasets.
You are welcome to leave the storage connection of these datasets in place, but you can also use another storage system depending on the infrastructure available to you.
To use another connection, such as a SQL database, follow these steps:
Select the empty datasets from the Flow. (On a Mac, hold Shift to select multiple datasets).
Click Change connection in the “Other actions” section of the Actions sidebar.
Use the dropdown menu to select the new connection.
Click Save.
Note
For a dataset that is already built, changing to a new connection clears the dataset so that it would need to be rebuilt.

Note
Another way to select datasets is from the Datasets page (G+D). There are also programmatic ways of doing operations like this that you’ll learn about in the Developer learning path.
The screenshots below demonstrate using a PostgreSQL database.
Whether starting from an existing or fresh project, ensure that the dataset transactions_known_prepared is built, and its schema includes the columns created in the Window recipe.
See build details here if necessary
From the Flow, select the end dataset required for this tutorial: transactions_known_prepared
Choose Build from the Actions sidebar.
Choose Recursive > Smart reconstruction.
Click Build to start the job, or click Preview to view the suggested job.
If previewing, in the Jobs tab, you can see all the activities that Dataiku will perform.
Click Run, and observe how Dataiku progresses through the list of activities.
See Schema Propagation Details Here if Necessary
If the transactions_known_prepared dataset does not include columns like card_purchase_amount_min, then you need to propagate the schema changes downstream. If you completed the Advanced Formula & Regex tutorial, this should already be done.
Enter the compute_transactions_known_prepared recipe.
Click Run from inside the recipe editor.
Accept the schema change update, dropping and recreating the output.
Confirm the output dataset includes the Window-generated columns.
Note
See the reference documentation on schema propagation to learn more.
Create the Pivot recipe¶
The Prepare recipe that creates the transactions_known_prepared dataset already has a step that categorizes FICO scores (a US system of rating credit worthiness) into four bins.
In this exercise, we will focus on reshaping data from the transactions_known_prepared dataset from long to wide format using these bins.
From the Actions menu of the transactions_known_prepared dataset, choose Pivot.
Choose card_fico_range as the column to pivot by.
Name the output dataset
transactions_by_card_fico_range
, and click Create Recipe.

Configure the pivot step¶
On the Pivot step, the Create columns with field at the top right is already populated by card_fico_range because of our choice in the recipe’s initial creation. For each distinct value of this column, the Pivot recipe will create a new column for each value specified in the Populate content with section in the bottom right.
By default, only the 20 most frequently occurring values are pivoted. In this case, new columns will be created for the four FICO ranges as specified in the parent Prepare recipe: BAD
, FAIR
, GOOD
, and EXCELLENT
.
We also want to add Row identifiers, which allow us to produce aggregations according to specific groups. Without a row identifier, the output dataset will just be a single row.
Select merchant_state and merchant_subsector as the Row identifiers.
The output dataset will have a single row for each unique combination of merchant_state and merchant_subsector.
Choose to populate content with purchase_amount.
The count of purchase_amount is selected by default. Change it from “Count” to “Avg”.
For each unique combination of merchant_state and merchant_subsector, we also want to calculate the sum of all purchases by FICO range.
Click on Add new, and select purchase_amount once again–this time changing the aggregate to “Sum”.
Deselect Count of records.

Before running the recipe, check the Output step. Note that the schema of the output has not yet been inferred.

Think about why that may be the case, and then Run the recipe.
The resulting dataset contains the average and sum of all purchase amounts, for each merchant state and subsector, divided across credit score ranges.

Note
An empty cell means that there are no transactions with that credit range for that merchant state and subsector.
What happens when new values appear?¶
Unlike most other visual recipes, the values (not the columns) of the input dataset to a Pivot recipe affect the schema of the output dataset.
In this case, the four distinct values in the card_fico_range column (“BAD”, “FAIR”, “GOOD”, and “EXCELLENT”) determine the number of columns in the output dataset. But what happens if there is another FICO range category added, like “PRETTY GOOD”?
The Pivot recipe is a two-pass algorithm that determines the values to pivot on in the first pass in order to construct the schema of the output dataset, and then uses the second pass to do the pivot.
The schema of the output dataset was established after the Pivot recipe’s first run. If you return to the Output step, you can now see the output column names listed.
Rebuilding the dataset continues to use the same schema, performing a single pass on the input of the Pivot recipe – and ignoring any new values of card_fico_range.
There are two options for updating the schema from the Output step of the Pivot recipe:
Manually drop the existing schema by clicking the Drop button.
Select Recompute schema at each run to automatically update every time the transactions_by_card_fico_range dataset is built. The danger in this method is that an updated schema can break the Flow if the following recipes expect the old schema.

These options will become clearer through the use of a pre-filter.
Apply a pre-filter¶
When using the Pivot recipe, if there are too many distinct values in the column to pivot by, the number of columns in the output dataset can grow out of control.
Warning
By default, only the 20 most frequently occurring values are pivoted, and the maximum possible number of columns created by the Pivot recipe is 200.
In order to avoid having too many columns in the output dataset, it can be useful to apply a Pre-filter in the Pivot recipe.
Let’s define a pre-filter to keep only rows where the card FICO range is “BAD” or “FAIR”. Accordingly, the schema of the output dataset will only have aggregated columns for these two values.
Click on the Pre-filter step of the same Pivot recipe.
Activate Filter, and choose to keep rows that satisfy:
at least one of the conditions: card_fico_range equals
BAD
orFAIR
.

Before running the recipe, go to the Output step.
Notice the same warning that if label columns have changed, then the schema may be out of date. After setting the pre-filter, the output schema will have only columns for “BAD” and “FAIR” values. However, the “GOOD” and “EXCELLENT” columns remain in the schema shown on this step.
Run the recipe again, without dropping the existing schema or recomputing it.
The GOOD and EXCELLENT columns remain in the output dataset, but the values are empty.

Instead let’s drop the existing schema.
Return to the Output step of the Pivot recipe, and click on the Drop button to manually update the schema.
Run the recipe again.
The schema of the output dataset now only contains columns for the “BAD” and “FAIR” values of the card_fico_range column.

Note
If using file-based datasets, the order of your rows may be sorted differently than what’s shown above using PostgreSQL. Scroll down to see rows ordered by merchant_state.
Use a variable in a pre-filter¶
We often may wish to introduce greater flexibility into pre- and post-filters using variables.
In our Pivot recipe, imagine we now want to look at summary statistics by FICO range only for a specific merchant state–Delaware.
In the Pre-filter step of the Pivot recipe, clear the existing filter conditions.
Filter for records where merchant_state equals
Delaware
.

In the Output step, drop the existing schema again (since the values of the pivot column will be different again), and run the recipe.
The resulting dataset only contains records for the state of Delaware. Also, the GOOD and EXCELLENT columns have returned to the schema.

As it turns out, the exact same filtering condition used in the Pivot recipe is also found in the Group recipe.
Navigate to the Pre-filter step of the Group recipe that creates transactions_by_merchant_id to confirm this for yourself.

Instead of manually defining the same filter value in multiple locations, we can define a project variable.
Go to the More Options menu, and click on Variables.
In Project variables > Global variables, define a variable called
state_name
with the syntax below.
{
"state_name": "Delaware"
}
Note
Note that variable names and string values must be wrapped in quotation marks, separated by a colon.

Let’s put this variable to work in our Flow!
Return to the Pre-filter step of the Group recipe.
Change the filter condition so that instead of the fixed value
Delaware
, merchant_state must equal the value of the state_name variable.You can call the value of a variable with a dollar sign followed by the variable name wrapped in curly braces, like
${state_name}
.Then Run the recipe.

Finally, let’s do the same for the Pre-filter step of the Pivot recipe.
Return to the Pre-filter step of the Pivot recipe.
Change the filter condition so that it references the state_name variable instead of the fixed value
Delaware
in the same way as you did for the Group recipe.The schema has not changed so you can immediately run the recipe.
Explore the output datasets of the Group and Pivot recipes. Their output should be the same as before using the variable. However, if we wanted to conduct this analysis for another state, we’d only need to update one project variable.
What’s next?¶
In this tutorial, we learned how to create pivot tables to report summary statistics using the Pivot recipe in Dataiku. We also learned how to apply a Pre-filter in a Pivot recipe and how to define and use variables in a Pre-filter step.
You can see how the Pivot recipe was used to compute “Pivot by Year” in a project on the Dataiku gallery.