Tutorial | Join recipe#

Get started#

The Join recipe allows you to enrich your data with columns from one or more datasets.

Objectives#

In this tutorial, you will:

  • Use a Join recipe to join three datasets into one output dataset.

Prerequisites#

To complete this tutorial, you’ll need the following:

  • A Dataiku instance (version 12.0 and above).

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Core Designer > Join Recipe.

  2. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

Let’s say we’re a financial company that uses data to detect fraudulent transactions.

The project comes with three datasets, described in the table below.

Dataset

Description

tx

Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made.

It also indicates whether the transaction has either been:

  • Authorized (a score of 1 in the authorized_flag column)

  • Flagged for potential fraud (a score of 0)

merchants

Each row is a unique merchant with information such as the merchant’s location and category.

cards

Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

In this tutorial, we want to enrich our tx dataset with:

  • The merchants data coming from merchants.

  • The cards data coming from cards.

Preview the datasets#

Let’s explore the data at hand and which columns could be used as the join key.

To do so:

  1. From the Flow, click on the tx dataset to select it.

  2. Click the Preview button at the bottom of the Flow to see a preview of the dataset.

  3. Explore its columns to determine which column could serve as the join key.

  4. With the Preview panel still open, select the cards and merchants to preview them as well.

Screenshot of the tx dataset preview in the Flow.

As you can see:

To join tx with …

We should use the following columns as join keys …

cards

  • card_id from tx, and

  • id from cards.

merchants

  • merchant_id from tx, and

  • id from merchants.

Join the datasets#

We can use the Join recipe to enrich the tx dataset with the information on:

  • Cards from the cards dataset.

  • Merchants from the merchants dataset.

Create a Join recipe#

Let’s first create the recipe.

  1. From the Flow, select the tx dataset and click the Actions icon (+) from the right panel to open the Actions tab.

  2. Under Visual recipes, choose Join with….

  3. Select cards as the second input dataset.

    Note

    Although only two datasets can be added in the Join recipe creation dialog, more datasets can be added at the Join step after creating the recipe.

  4. Ensure that the name of the output dataset is tx_joined.

  5. Click Create Recipe.

Screenshot of the New join recipe dialog box with two input datasets selected.

Join tx and cards#

The core step of this recipe is the Join step, where you choose how to match rows between the datasets.

Define the join condition#

In this first join, we want to match rows from the tx and cards datasets that have the same value of card_id and id, respectively.

  1. Click on Add a Condition to tell Dataiku which columns to match.

    The Join conditions dialog opens, where Dataiku automatically defines that the ID columns are the join key. Yet, we have to change the condition as we want to use the card_id column from the tx dataset (the id column in the tx dataset stores the transaction ID, not the card ID).

  2. Under Column from tx, select card_id.

  3. Select OK and return to the Join recipe.

Dataiku screenshot of the Join conditions dialog box in the Join recipe.

Note

By default, the recipe will drop unmatched rows, but we can also configure the recipe to send unmatched rows to another output dataset. This can be helpful to confirm your expectations of the output.

Dataiku screenshot of a Join recipe dialog showing the drop unmatched rows option.

Set the join type#

By default, the Join recipe performs a left join, which retains all rows in the left dataset, even if there is no matching information in the right.

Let’s keep this join type.

Note

For more information about join types, visit Concept | Join recipe.

Execute the recipe and explore the output#

Now, let’s run the recipe and view the output.

  1. Click Run to execute the recipe.

  2. When the recipe has finished running, click Explore dataset tx_joined at the bottom of the screen to explore the tx_joined dataset.

  3. Switch to column view to easily confirm the tx_joined dataset has the original columns of tx (in purple below), plus new columns from cards like first_active_month (in green below).

    Dataiku screenshot of the dataset output after joining *tx* and *cards*.

Join tx and merchants#

At this stage, we have only joined tx and cards.

Now, we have to configure the join condition between tx and merchants. We can achieve this in the same Join recipe.

Edit the Join step#

  1. From the output dataset, click the Parent Recipe button at the top right of the Explore tab.

    This takes you back to the Join step of the Join recipe.

  2. Click the + button at the top right of the tx dataset in the Join step. It allows adding more datasets to join with the tx and the cards datasets.

  3. In the Add a dataset to join with tx dialog, select merchants.

  4. Click Add Dataset. Merchants is added to the Join step.

  5. Now, define the join condition between tx and merchants, like we did before when joining tx and cards. This time though use the merchant_id column from tx.

Screenshot of the Join steps with the three datasets.

Define output columns#

The Selected columns step of the Join recipe allows you to:

  • Choose which columns to retain from the input datasets. By default, Dataiku keeps all non-conflicting columns in the output dataset, which means it excludes columns that would have the same name as a column from another dataset in the join. It also means that new, non-conflicting columns added upstream will be automatically added to the output.

  • Add a prefix to the name of all columns from a dataset of the join in order to avoid conflicts.

As we’re joining three datasets, to avoid conflicts and confusion in the column names of the output dataset, let’s use this step to add a prefix to all columns from the merchant dataset.

  1. Go to the Selected columns step.

  2. Under the merchants column, enter merchant in the Prefix field.

  3. Go to the Output step to check the name of the new merchant columns in the schema.

Screenshot of the Selected columns step.

Execute the recipe again and explore the output#

Now, let’s run the recipe once more and view the output.

  1. Click Run to execute the recipe.

  2. When the recipe has finished running, click Explore dataset tx_joined at the bottom of the screen to explore the tx_joined dataset.

    As you can see, Dataiku has added the merchant prefix to all the relevant columns.

    Screenshot of the final output dataset.
  3. Go back to the Flow to check your progress.

Dataiku screenshot of resulting Flow showing the schema of the output dataset.

The yellow circle icon represents the Join recipe. At any time, you can double-click on it to open it and edit its configuration.

Each time you run the recipe, it updates the output dataset, tx_joined.

What’s next?#

See also

For more information on the Join recipe, see also the Join: joining datasets article in the reference documentation.

Now that we have joined our three datasets into a consolidated one, we recommend that you follow the Tutorial | Prepare recipe article to learn how to prepare the data.