Concept | Join recipe#

The primary purpose of the Join recipe is to enrich one dataset with columns from another.

Types of joins#

Dataiku matches values using a key column that’s common to both datasets. The output of the matching depends on the join type. There are many methods to choose from for joining datasets:

Type	Description
Left join (default)	Keeps all rows of the left dataset and adds information of matched records from the right dataset.
Inner join	Keeps only the rows that match in both datasets. This is useful when only the rows with complete information from both datasets will be useful downstream in the Flow.
Outer join	Keeps all rows from both datasets, combining rows where there is a match. This is useful when you need to retain all the information in both datasets.
Right join	Similar to a left join, this join keeps all rows of the right dataset and adds information from the left dataset when there is a match.
Left anti-join	Keeps all the rows from the left dataset that have no match in the right dataset.
Right anti-join	Keeps all the rows from the right dataset that have no match in the left dataset.
Cross join	Denotes a Cartesian product that matches all rows of the left dataset with all rows of the right dataset. This is useful when you need to compare every row in one dataset to every row of another.
Advanced join	Provides custom options for row selection and deduplication for when none of the other options are suitable.

Join steps#

Change key column#

You can change the detected key column by selecting your own columns to match on and setting the conditions in the Join step.

Add dataset#

When performing a left, right, or inner join, you can add a dataset to capture any unmatched rows.

Choose output columns#

In the Selected columns step, you can tell Dataiku which columns you want to see in the output dataset.

Set pre-filters#

There are a few other options including Pre-filters which allows you to keep or drop rows based on your criteria. This can be useful before joining unwanted rows in large datasets.

Set post-filters#

You can use the Post-filter step to filter the results of the Join operation before writing the output dataset. For example, you can tell Dataiku if duplicate rows are allowed and if you want to return only rows that match a condition.

Review output#

Finally, you can use the Output step to review the execution specs and output columns.

Next steps#

There are a lot of reasons to use joins when building a Flow.

Continue learning about this recipe by working through the Tutorial | Join recipe article.

Tip

You can find this content (and more) by registering for the Dataiku Academy course, Visual Recipes. When ready, challenge yourself to earn a certification!