Tutorial | Join recipe#

Getting started#

The Join recipe allows you to enrich your data with columns from another dataset.

Objectives#

In this tutorial, you will:

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Core Designer > Join Recipe.

  2. From the project homepage, click Go to Flow (or G + F).

  3. Click the Flow Actions > Build all menu at the bottom left of the Flow.

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

Let’s say we’re a company selling t-shirts and have two datasets, described in the table below.

Dataset

Description

orders_by_customer

Details t-shirt orders grouped by unique customers from 2013 to 2017.

customers

Includes some detailed information on each customer.

In this tutorial, we want to enrich our orders_by_customer dataset with the customer data coming from customers.

Explore the customer data#

Tip

A screencast at the end of the page recaps the instructions described here.

Let’s explore the customers dataset.

  1. From the Flow, double-click on customers dataset to open it.

  2. Explore the dataset. As you can see, each row in this dataset represents a separate customer, and records:

    • The unique customer ID

    • The customer’s gender

    • The customer’s birth date

    • The user agent most commonly used by the customer

    • The customer’s IP address

    • Whether the customer is part of Haiku T-Shirts’ marketing campaign

Note

Take a few minutes to explore it with tools like Analyze. Also, note the gray portion of the gender column’s data quality bar representing missing values.

Create a Join recipe#

We can use the Join recipe to enrich the customers dataset with the information about the aggregate orders customers have made from the orders_by_customer dataset.

  1. Click the Actions button in the top right of the Explore tab of the customers dataset. The Actions tab opens in the right panel.

  2. Under Visual recipes, choose Join with….

  3. Select orders_by_customer as the second input dataset.

    Note

    Although only two datasets can be added in the Join recipe creation dialog, more datasets can be added at the Join step after creating the recipe.

  4. Change the name of the output dataset to customers_orders_joined.

  5. Click Create Recipe.

Screenshot of the New join recipe dialog box with two input datasets selected.

Define the join condition#

The core step of this recipe is the Join step, where you choose how to match rows between the datasets. In this case, we want to match rows from the customers and orders_by_customer datasets that have the same value of customerID and customer_id, respectively.

Screenshot of the Join recipe and steps in the left navigation bar.

Note

Notice the + button at the top right of each dataset in the Join step. You can use this button to add more datasets to join with the customers and the orders_by_customer datasets.

  1. Click on Add a Condition to tell Dataiku which columns to match.

    The Join conditions dialog opens, where Dataiku automatically recognizes that the ID columns are the join key, even though they have different names. This is the only condition we need to add here.

  2. Select OK and return to the Join recipe.

The join conditions dialog box in the Join recipe.

Note

By default, the recipe will drop unmatched rows, but we can also configure the recipe to send unmatched rows to another output dataset. This can be helpful to confirm your expectations of the output.

Dataiku screenshot of a Join recipe dialog showing the drop unmatched rows option.

Set the join type#

By default, the Join recipe performs a left join, which retains all rows in the left dataset, even if there is no matching information in the right. Since in this case we only want to work with customers who have made at least one order, let’s modify the join type to inner join, which keeps only the rows that match in both datasets.

Note

For more information about join types, visit Concept | Join recipe.

  1. Click on the Left join indicator.

  2. Select Inner join.

This will retain only the customers who have made an order, and remove the others from the output dataset.

Dataiku screenshot showing the join types with Inner join selected.

Define output columns#

The Selected columns step of the Join recipe allows you to choose which columns to retain from the input datasets.

In this tutorial, let’s skip this step and keep all the columns.

Execute the recipe and explore the output#

Now, let’s run the recipe and view the output.

  1. Click Run to execute the recipe.

  2. When the recipe has finished running, click Explore dataset customers_orders_joined at the bottom of the screen to explore the customers_orders_joined dataset.

  3. From the top navigation bar, go back to the Flow to check your progress.

Dataiku screenshot of resulting Flow showing the schema of the output dataset.

See a screencast covering these steps

What’s next?#

Now that you have a few datasets and recipes in the Flow, it’s time to take stock of what you’ve accomplished in the next tutorial on exploring the Flow.