Tutorial | Group recipe#
The Group recipe allows you to aggregate data in a project based on specified keys (i.e. criteria).
Get started#
Objectives#
In this tutorial, you will:
Use the Group recipe to aggregate data.
Prerequisites#
To complete this tutorial, you’ll need the following:
A Dataiku instance (version 9.0 and above).
Create the project#
To create the project:
From the Dataiku Design homepage, click + New Project > DSS tutorials > Core Designer > Group Recipe.
From the project homepage, click Go to Flow (or
G
+F
).Click the Flow Actions > Build all menu at the bottom left of the Flow.
Note
You can also download the starter project from this website and import it as a zip file.
Use case summary#
Let’s say we’re a company selling t-shirts and our dataset details each t-shirt order from 2013 to 2017.
To understand our customers better, we want to group all past orders by unique customers, aggregating their past interactions.
Create a Group recipe#
Tip
A screencast at the end of the page recaps all of the actions described here.
Let’s get started!
From the Flow, select the orders_prepared dataset.
In the Actions tab of the right panel (+ button), choose Group in the list of Visual recipes. The Group recipe allows you to aggregate the values of some columns by the values of one or more keys.
In the recipe dialog, choose to group by customer_id.
Change the name of the output dataset to
orders_by_customer
.Select Create Recipe.
Select aggregations by group key#
The core step of the Group recipe is the Group step, where you choose which columns to serve as keys and which aggregations you want performed.
On the Group step, in the Per field aggregations section, select the following aggregations:
Min of order_date
Avg of pages_visited
Sum of total
For each unique customer ID, the output will have the date of first order, the average number of visited pages per visit, and the sum of all orders. We’ll also compute the count for each group — a default setting.
Note
The recipe reminds us of the storage type of each column in the Per field aggregations tile. We are able to retrieve the minimum of order_date because its storage type is a date. If it were a string, the “minimum” would be the first result in alphabetical order.
Before running the recipe, navigate to the Output step.
Rename order_date_min to
first_order_date
.Click Run to create the new grouped output dataset.
Note
Columns in the input dataset not used in the group key or per field aggregations (like order_id and tshirt_category) are not included in the output dataset.
Explore the output dataset#
Let’s quickly observe the output.
Open the orders_by_customer dataset.
Click on the customer_id column dropdown, and select Analyze.
Note
Note that all values are unique. We have exactly one record for every customer after grouping by customer_ID.
See a screencast covering this tutorial’s steps
What’s next?#
Continue to the tutorial on enriching the dataset to learn more about basic data preparation in Dataiku.