Tutorial | Group recipe#

The Group recipe allows you to aggregate data in a project based on specified keys (i.e. criteria).

Get started#

Objectives#

In this tutorial, you will:

Prerequisites#

To complete this tutorial, you’ll need the following:

  • A Dataiku instance (version 9.0 and above).

Create the project#

To create the project:

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Core Designer > Group Recipe.

  2. From the project homepage, click Go to Flow (or G + F).

  3. Click the Flow Actions > Build all menu at the bottom left of the Flow.

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

Let’s say we’re a company selling t-shirts and our dataset details each t-shirt order from 2013 to 2017.

To understand our customers better, we want to group all past orders by unique customers, aggregating their past interactions.

Create a Group recipe#

Tip

A screencast at the end of the page recaps all of the actions described here.

Let’s get started!

  1. From the Flow, select the orders_prepared dataset.

  2. In the Actions tab of the right panel (+ button), choose Group in the list of Visual recipes. The Group recipe allows you to aggregate the values of some columns by the values of one or more keys.

  3. In the recipe dialog, choose to group by customer_id.

  4. Change the name of the output dataset to orders_by_customer.

  5. Select Create Recipe.

Create the Group recipe.

Select aggregations by group key#

The core step of the Group recipe is the Group step, where you choose which columns to serve as keys and which aggregations you want performed.

  1. On the Group step, in the Per field aggregations section, select the following aggregations:

    • Min of order_date

    • Avg of pages_visited

    • Sum of total

    For each unique customer ID, the output will have the date of first order, the average number of visited pages per visit, and the sum of all orders. We’ll also compute the count for each group — a default setting.

    Group step in the Settings tab of a Group recipe.

    Note

    The recipe reminds us of the storage type of each column in the Per field aggregations tile. We are able to retrieve the minimum of order_date because its storage type is a date. If it were a string, the “minimum” would be the first result in alphabetical order.

  2. Before running the recipe, navigate to the Output step.

  3. Rename order_date_min to first_order_date.

  4. Click Run to create the new grouped output dataset.

Output column names in the Group step of the Settings tab of a Group recipe.

Note

Columns in the input dataset not used in the group key or per field aggregations (like order_id and tshirt_category) are not included in the output dataset.

Explore the output dataset#

Let’s quickly observe the output.

  1. Open the orders_by_customer dataset.

  2. Click on the customer_id column dropdown, and select Analyze.

Exploring a column of the output dataset using the Analyze tool.

Note

Note that all values are unique. We have exactly one record for every customer after grouping by customer_ID.

See a screencast covering this tutorial’s steps

What’s next?#

Continue to the tutorial on enriching the dataset to learn more about basic data preparation in Dataiku.