Tutorial | Enrich the dataset (Core Designer part 8)

Having explored the Flow in another tutorial, now let’s demonstrate a key visual recipe, the Join recipe, which allows you to enrich your data with columns from another dataset.

Objectives

In this tutorial, you will:

  • Use a Join recipe to join two datasets.

Resume or create the project

There are two ways to get started:

Starting here?

If you skipped the previous sections, you need to:

  • Have access to a Dataiku instance (version 9.0 or above).

  • Create the project (+New Project > DSS Tutorials > Core Designer / Basics > Basics 103)) or download and import it from this website.

Screenshot showing starter projects for the Dataiku basics courses.

Continuing from previous tutorials?

If you have worked on previous tutorials in this series so that you already have a project with the orders_by_customer dataset, you’ll need to import one new dataset to that project.

  • Download the customers CSV file.

  • From the Flow of your existing project, click + Dataset > Upload your files.

  • Click on Select Files, and choose your customers.csv file.

  • Finish importing the dataset by clicking the Create button or using the shortcut Cmd/Ctrl+S.


Once you have the dataset customers and orders_by_customer in your Flow, you’re ready to start!

Dataiku screenshot of the Flow with four datasets.

Explore the customer data

Tip

A screencast at the end of the page recaps the instructions described here.

The Flow contains a dataset of orders grouped by unique customers. Now we have a customers dataset with more information about our customers.

  • Open the customers dataset by double-clicking on its icon in the Flow.

Each row in this dataset represents a separate customer, and records:

  • the unique customer ID

  • the customer’s gender

  • the customer’s birth date

  • the user agent most commonly used by the customer

  • the customer’s IP address

  • whether the customer is part of Haiku T-Shirts’ marketing campaign

Note

Take a few minutes to explore it with tools like Analyze. Also, note the gray portion of the gender column’s data quality bar representing missing values.

Create a Join recipe

We can use the Join recipe to enrich the customers dataset with the information about the aggregate orders customers have made from the orders_by_customer dataset.

  • From the Actions tab in the right panel, choose Join with… from the list of visual recipes.

Dataiku screenshot showing the Join with... recipe in the Actions menu.

  • Select orders_by_customer as the second input dataset.

Note

Although only two datasets can be added in the Join recipe creation dialog, more datasets can be added at the Join step after creating the recipe.

  • Change the name of the output dataset to customers_orders_joined.

  • Click Create Recipe.

Screenshot of the New join recipe dialog box with two input datasets selected.

Define the join condition

The core step of this recipe is the Join step, where you choose how to match rows between the datasets. In this case, we want to match rows from the customers and orders_by_customer datasets that have the same value of customerID and customer_id, respectively.

Screenshot of the Join recipe and steps in the left navigation bar.

Note

Notice the + button at the top right of each dataset in the Join step. You can use this button to add more datasets to join with the customers and the orders_by_customer datasets.

  • Click on Add a condition to tell Dataiku which columns to match.

It opens the Join conditions dialog, where Dataiku automatically recognizes that the ID columns are the join key, even though they have different names. This is the only condition we need to add here.

  • Select OK and return to the Join recipe.

The join conditions dialog box in the Join recipe.

Note

By default, the recipe will drop unmatched rows, but we can also configure to send unmatched rows to another output dataset. This can be helpful to confirm your expectations of the output.

Set the join type

By default, the Join recipe performs a left join, which retains all rows in the left dataset, even if there is no matching information in the right. Since we only want to work with customers who have made at least one order, let’s modify the join type.

Note

Types of joins

There are multiple methods for joining two datasets; the method you choose will depend upon your data and your goals in analysis.

  • Left join (default type) keeps all rows of the left dataset and adds information from the right dataset when there is a match. This is useful when you need to retain all the information in the rows of the left dataset, and the right dataset is providing extra, possibly incomplete, information.

  • Inner join keeps only the rows that that match in both datasets. This is useful when only the rows with complete information from both datasets will be useful downflow.

  • Outer join keeps all rows from both datasets, combining rows where there is a match. This is useful when you need to retain all the information in both datasets.

  • Right join is similar to a left join, but keeps all rows of the right dataset and adds information from the left dataset when there is a match.

  • Cross join is a Cartesian product that matches all rows of the left dataset with all rows of the right dataset. This is useful when you need to compare every row in one dataset to every row of another.

  • Advanced join provides custom options for row selection and deduplication for when none of the other options are suitable.

Keeping unmatched rows

When performing a left, right, or inner join in Dataiku versions 11.3 and above, you can add a dataset to capture the unmatched rows.

  • Click on the Left join indicator.

  • Select Inner join.

This will retain only the customers who have made an order, and remove the others from the output dataset.

Dataiku screenshot showing the join types with Inner join selected.

Define the columns to keep in the output dataset

The next step is to choose which columns to retain from the input datasets.

We want to carry over all columns from both datasets into the output dataset, with the exception of customer_id (since the customerID column from the customers dataset has the same information).

  • On the left-hand side, navigate to the Selected columns step.

  • Deselect the customer_id column in the orders_by_customer dataset.

  • Select Run to execute the recipe.

Screenshot showing selected columns in the Join recipe.

When the recipe has finished running, click Explore dataset customers_orders_joined at the bottom of the screen to explore the customers_orders_joined dataset.

Screenshot of resulting joined dataset with 10 columns.

See a video covering these steps

What’s next?

So far all of your work has been in the Flow. Now it’s time to learn about the Lab!