Concept | Build modes#

Watch the video

Datasets in Dataiku often have dependencies on upstream datasets in the Flow. As a result, these downstream datasets can become outdated if changes are made to preceding datasets or recipes.

Dataiku offers many build methods to counter or prevent issues that may arise from these dependencies. But why so many options? Once you start scaling your datasets, you won’t want to run any unnecessary computation. This article outlines different build strategies that will help you build exactly what you need.

Note

Once you’ve learned these strategies, you can apply them to building other objects in Dataiku like models and model evaluation stores.

Build modes#

There are three main build modes to consider when building datasets in Dataiku.

Build only this#

The simplest way to build a dataset is by using the Build only this option. This is the default build option for all datasets aside from those at the beginning of the Flow. It only runs the specific recipe that outputs the dataset.

Slide showing flow items that run "build only this".

Build upstream#

The Build upstream option builds the selected dataset and upstream datasets. There are two types of upstream builds:

Build only modified#

This option will only build upstream datasets starting from the first outdated dataset. This way, you won’t have to rebuild the entire branch.

Slide showing flow items that build when you build only modified or required dependencies.

Build all upstream#

This option rebuilds all upstream datasets, regardless of if they are already up-to-date.

Slide showing flow items that build when you build all upstream.

You can also choose to build upstream and Stop at zone boundary. This means that upstream dependencies located outside the Flow zone will not be rebuilt even if they are outdated.

Build downstream#

The Build downstream option builds downstream datasets. Let’s look at the Advanced settings.

Build all downstream#

This runs recipes from the selected dataset until the end of the Flow is reached. The selected dataset itself is not built.

Slide showing flow items that build when you run downstream recipes.

Find outputs and build recursively#

This will make Dataiku find all final datasets downstream from selected dataset and build any upstream dependencies. In this case, you can choose to either build required dependencies or force-build those upstream datasets.

Slide showing flow items that build when you find outputs and build recursively.

Update output schemas#

A schema describes the structure of a dataset. It includes the list of columns and their respective names and storage types. Often, the schema of our datasets will change when designing the Flow. Dataiku provides a way to ensure that your schema changes are correctly applied to downstream datasets.

Slide showing one type of schema change, or removing a column, that would require a schema update.

The Update output schemas option propagates the schema to output datasets before each recipe runs. This way, your data is built into the correct structure.

Screenshot of the update output schemas checkbox.

Building sections of the Flow#

You can build the entire Flow using the Build all button in the Flow Actions menu. If you prefer, you can build a Flow zone using the Build button directly on the Flow zone instead.

Slide showing the different Dataiku user interface options for building sections or all of your flow.

Building shared datasets#

Note that when using a shared dataset in the Flow, the dataset can only be built from the Flow where it was created — that is, from its source project.

Selecting Build upstream or Build downstream options from our current project will not build a shared dataset in a Flow outside of its source project. In other words, only the datasets that are created in the current project will be built. Furthermore, non-managed or external datasets are also unaccounted for in these builds.

Advanced rebuild behavior#

The Rebuild behavior of a dataset lets us control how or if that dataset can be built.

For one, if we set the rebuild behavior of a dataset to Explicit, the dataset won’t be rebuilt unless we specifically choose to rebuild it. We can also make a dataset Write-protected so it never gets rebuilt.

A screenshot highlighting the rebuild behavior settings of a dataset.

What’s next?#

Now that you’ve explored different build modes and configurations, there’s more learning to do!

If you’re curious about automating builds, check out our Automation course or other courses in the Advanced Designer learning path.