Tutorial | Pivot recipe#

The Pivot recipe transforms datasets into pivot tables, which are tables of summary statistics. This operation can also be referred to as reshaping from long to wide format.

Objectives#

In this tutorial, you will:

  • Create a pivot table using the visual Pivot recipe.

  • Understand how the Pivot recipe computes a schema using a pre-filter step.

Prerequisites#

To reproduce the steps in this tutorial, you’ll need:

  • Access to an instance of Dataiku 12+.

  • Basic knowledge of Dataiku (Core Designer level or equivalent).

  • You may also want to review this tutorial’s associated concept article.

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Advanced Designer > Pivot recipe.

  2. From the project homepage, click Go to Flow.

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Create the Pivot recipe#

Early in the Flow, the first Prepare recipe includes a step that categorizes FICO scores (a US ratings system of credit worthiness) into four bins.

In this exercise, we will focus on reshaping the downstream tx_prepared dataset from long to wide format using these bins.

  1. From the Flow, select tx_prepared.

  2. In the Actions tab on the right, select Pivot from the menu of visual recipes.

  3. In the dialog, choose to pivot by cardholder_fico_range.

  4. Name the output dataset tx_pivot.

  5. Click Create Recipe.

Dataiku screenshot of the Pivot recipe dialog.

Configure the Pivot step#

On the Pivot step, cardholder_fico_range already populates the Create columns with field at the top right because of the choice in the recipe’s initial creation.

For each distinct value of this pivot column, the Pivot recipe will create a new column for each aggregation specified in the Populate content with section in the bottom right.

This includes Count of records by default. Before we add any row identifiers or more aggregations, let’s first see how the four unique values in the selected pivot column get transposed into columns.

  1. On the Pivot step, make no changes.

  2. Navigate to the Output step, and observe how the schema could not be inferred. We’ll explore this soon.

  3. Click Run at the bottom left, and open the output dataset when the recipe finishes running.

Dataiku screenshot of the Pivot step of the Pivot recipe.

Without selecting any row identifiers, the output to this Pivot recipe is one row. So far, we only have the record counts for each bin. Because we have four values in the pivot column, one aggregation, and no row identifier, the output dataset has four columns.

Dataiku screenshot of an output to the Pivot recipe.

Add a row identifier#

What’s typically more useful is to add row identifiers. These allow us to produce aggregations according to specific groups.

  1. Near the top right, click Parent Recipe to return to the Pivot step.

  2. On the Pivot step, select authorized_flag as a row identifier.

  3. Click Run at the bottom left, and open the output dataset when the recipe finishes running.

Dataiku screenshot of the Pivot step of the Pivot recipe.

Because authorized_flag has two values, we now have two rows in the output dataset — producing a table of counts by FICO score for both authorized (1) and unauthorized (0) transactions.

Dataiku screenshot of an output to the Pivot recipe.

Tip

To go further on your own, experiment adding additional row identifiers, such as item_category!

Populate content with aggregations#

Now let’s select an additional aggregation: the average purchase_amount of authorized and unauthorized transactions across FICO categories.

  1. Click Parent Recipe to return to the Pivot step.

  2. On the Pivot step, in the Populate content with tile, click Add new, and select purchase_amount.

  3. Click on count(purchase_amount) to change the aggregation. Switch to Avg.

  4. Click Run at the bottom left, and open the output dataset when the recipe finishes running.

Dataiku screenshot of the Pivot step of a Pivot recipe with an aggregation.

The addition of one aggregation has added four columns to the output dataset (one for each unique value of the pivot column).

What happens when new values appear?#

The Pivot recipe is a two-pass algorithm:

  • On the first pass, it finds the values for new output columns and generates the dataset schema.

  • On the second pass, it populates the pivot table.

If you return to the Output step, you can see the output column names listed — the same as many other visual recipes.

  1. Click Parent Recipe to return to the Pivot step.

  2. Navigate to the Output step on the left.

Dataiku screenshot of the Output step of the Pivot recipe with the schema already computed.

What’s different is that in most visual recipes, only the column names of the input dataset impact the schema of the output dataset. In the Pivot recipe, however, the actual values in the pivot column impact the output schema.

In this case, the four distinct values in the cardholder_fico_range column (“bad”, “fair”, “good”, and “excellent”) determine the number of columns in the output dataset. But what happens if another FICO range category — such as “extraordinary” — is added?

By default, rebuilding the dataset skips the first pass described above. It does not look for any new values of the pivot column and just re-populates the table (the second pass).

Two options exist for updating the schema from the Output step of the Pivot recipe:

  • Manually drop the existing schema by clicking the Drop button.

  • Select Recompute schema at each run to automatically update every time the tx_pivot dataset is built.

The danger in the second method is that an updated schema can break the Flow if the following recipes expect the old schema.

Apply a pre-filter to the Pivot recipe#

These options will become clearer through the use of a pre-filter. When using the Pivot recipe, if there are too many distinct values in the column to pivot by, the number of columns in the output dataset can grow out of control.

Warning

By default, the Pivot recipe pivots only the 20 most frequently occurring values. The Pivot recipe can create a maximum of up to 200 columns.

Let’s define a pre-filter to keep only rows where the card holder’s FICO range is “bad” or “fair”. Then, the first pass of the Pivot recipe’s algorithm will only find two instead of four values.

  1. Navigate to the Pre-filter step of the Pivot recipe.

  2. Toggle the Filter On, and choose to keep rows that satisfy the condition:

    cardholder_fico_range Equals bad
    Or
    cardholder_fico_range Equals fair
    Dataiku screenshot of the Pre-filter step of the Pivot recipe.
  3. Before running the recipe, navigate to the Output step.

    Notice the warning that if label columns have changed, then the schema may be out of date. After setting the pre-filter, the dataset will have only “bad” and “fair” values in the pivot column. However, the “good” and “excellent” columns remain in the schema.

  4. Click Run to execute the recipe again without dropping the existing schema or recomputing it. Then open the output dataset.

The “good” and “excellent” columns remain in the output dataset, but the values are empty. In many cases, this may be a perfectly fine outcome. The Flow did not break!

Dataiku screenshot of the Explore tab of the pre-filtered Pivot recipe output without dropping or re-computing the schema.

Drop the existing schema of a Pivot recipe#

To remove the empty columns, let’s drop the existing schema before re-running the recipe.

  1. Click Parent Recipe to return to the Pivot step.

  2. Navigate to the Output step.

  3. Click on the Drop button to manually update the schema and Confirm.

  4. Click Run again, and open the output dataset.

Dataiku screenshot of the output dataset to the Pivot recipe.

The schema of the output dataset now only contains columns for the “bad” and “fair” values of the cardholder_fico_range column. After dropping the “good” and “excellent” columns, if there were further downstream datasets, we’d need to check for incompatible schemas.

What’s next?#

In this tutorial, we learned how to create pivot tables to report summary statistics using the Pivot recipe in Dataiku. We also learned how to apply a pre-filter in a Pivot recipe in order to understand how the output schema is computed.

There are still many other visual recipes. Try the Distinct recipe next!