Concept | Join recipe#

Watch the video

The primary purpose of the Join recipe is to enrich one dataset with columns from another.

Types of joins#

Dataiku matches values using a key column that is common to both datasets. The output of the matching depends on the join type. There are many methods to choose from for joining datasets:

Screenshot showing the join types in Dataiku.



Left join (default)

Keeps all rows of the left dataset and adds information of matched records from the right dataset.

Inner join

Keeps only the rows that match in both datasets. This is useful when only the rows with complete information from both datasets will be useful downstream in the Flow.

Outer join

Keeps all rows from both datasets, combining rows where there is a match. This is useful when you need to retain all the information in both datasets.

Right join

Similar to a left join, this join keeps all rows of the right dataset and adds information from the left dataset when there is a match.

Left anti-join

Keeps all the rows from the left dataset that have no match in the right dataset.

Right anti-join

Keeps all the rows from the right dataset that have no match in the left dataset.

Cross join

Denotes a Cartesian product that matches all rows of the left dataset with all rows of the right dataset. This is useful when you need to compare every row in one dataset to every row of another.

Advanced join

Provides custom options for row selection and deduplication for when none of the other options are suitable.

Join steps#

Change key column#

You can change the detected key column by selecting your own columns to match on and setting the conditions in the Join step.

Screenshot showing the join conditions dialog.

Add dataset#

When performing a left, right, or inner join, you can add a dataset to capture any unmatched rows.

Screenshot showing the option to Send unmatched rows to other output dataset(s).

Choose output columns#

In the Selected columns step, you can tell Dataiku which columns you want to see in the output dataset.

Screenshot showing the Selected columns step of a Join recipe.

Set pre-filters#

There are a few other options including Pre-filters which allows you to keep or drop rows based on your criteria. This can be useful before joining unwanted rows in large datasets.

Screenshot showing the pre-filter options in the Join recipe.

Set post-filters#

You can use the Post-filter step to filter the results of the Join operation before writing the output dataset. For example, you can inform Dataiku if duplicate rows are allowed and if you want to return only rows that match a condition.

Screenshot showing the post-filter options in Dataiku.

Review output#

Finally, you can use the Output step to review the execution specs and output columns.

What’s next?#

There are a lot of reasons to use joins when building a Flow.

Continue learning about this recipe by working through the Tutorial | Join recipe article.