Tutorial | Group recipe#

The Group recipe allows you to aggregate data in a project based on specified keys (i.e. criteria).

Get started#

Objectives#

In this tutorial, you will:

  • Use the Group recipe to aggregate data.

Prerequisites#

To complete this tutorial, you’ll need the following:

  • A Dataiku instance (version 12.0 and above).

Create the project#

To create the project:

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Core Designer > Group Recipe.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Use case summary#

Let’s say we’re a financial company that uses some credit card data to detect fraudulent transactions.

The project comes with three datasets, described in the table below.

Dataset

Description

tx

Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made.

It also indicates whether the transaction has either been:

  • Authorized (a score of 1 in the authorized_flag column)

  • Flagged for potential fraud (a score of 0)

merchants

Each row is a unique merchant with information such as the merchant’s location and category.

cards

Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

Create the Group recipe#

The Group recipe allows you to aggregate the values of some columns by the values of one or more keys. Let’s create the recipe.

  1. From the Flow, select the tx_prepared dataset and click the Actions icon (+) from the right panel to open the Actions tab.

  2. Under the Visual Recipes section, click on Group.

    The New group recipe window opens.

  3. In the recipe dialog, keep tx_prepared as the input dataset.

  4. Choose to group by card_id.

  5. Name the output dataset tx_group.

  6. Click Create Recipe.

Dataiku screenshot of how to create the Group recipe.

Select aggregations by group key#

The core step of the Group recipe is the Group step, where you choose which columns to serve as keys and which aggregations you want performed.

In this tutorial, we want to group the data by card ID and compute the sum and average of the purchase amount for each card.

  1. On the Group step, ensure that the card_id column is mentioned as a group key.

  2. In the Per field aggregations section, select the following aggregations:

    • Avg of purchase_amount

    • Sum of purchase_amount

    For each unique card ID grouping, the output dataset will provide the sum of purchases and the average amount spent on purchases. By default, Dataiku will also compute the count of purchases for each group.

    Note

    The recipe reminds us of the storage type of each column in the Per field aggregations tile. We are able to retrieve the purchase sum and average because the data is a number (more specifically, a double).

  3. Before running the recipe, navigate to the Output step. Notice that the output has four columns (one for the group key and one for each selected aggregation).

  4. Click Run to create the new grouped output dataset.

Dataiku screenshot of the configuration of a Group recipe.

Note

Columns in the input dataset not used in the group key or per field aggregations (like purchase_date and merchant_id) are not included in the output dataset. To keep columns not included in the aggregation, you likely need a Window recipe.

Explore the output dataset#

Let’s quickly observe the output.

  1. Open the tx_group dataset.

  2. Click on the card_id column dropdown, and select Analyze.

Exploring a column of the output dataset using the Analyze tool.

Note

Note that all values are unique. We have exactly one record for every card after grouping by card_ID.

Also, keep in mind that by default, the output dataset shows a sample of your data rather than all of the data.

What’s next?#

Continue to the Tutorial | Window recipe to use advanced grouping in Dataiku.