Concept | Build modes¶
Datasets in Dataiku often have dependencies on upstream datasets in the Flow. As a result, these downstream datasets can become outdated if changes are made to preceding datasets or recipes.
Dataiku offers many build methods to counter or prevent issues that may arise from these dependencies. But why so many options? Once you start scaling your datasets, you won’t want to run any unnecessary computation. This article outlines different build strategies that will help you build exactly what you need.
There are three main build modes to consider when building datasets in Dataiku.
The simplest way to build a dataset is by using the Build only this option. This is the default build option for all datasets aside from those at the beginning of the Flow. It only runs the specific recipe that outputs the dataset.
The Build upstream option builds the selected dataset and upstream datasets. There are two types of upstream builds:
This option will only build upstream datasets starting from the first outdated dataset. This way, you won’t have to rebuild the entire branch.
The Build downstream option builds downstream datasets. Let’s look at the Advanced settings.
This runs recipes from the selected dataset until the end of the Flow is reached. The selected dataset itself is not built.
A schema describes the structure of a dataset. It includes the list of columns and their respective names and storage types. Often, the schema of our datasets will change when designing the Flow. Dataiku provides a way to ensure that your schema changes are correctly applied to downstream datasets.
The Update output schemas option propagates the schema to output datasets before each recipe runs. This way, your data is built into the correct structure.
You can build the entire Flow using the Build all button in the Flow Actions menu. If you prefer, you can build a Flow zone using the Build button directly on the Flow zone instead.
The Rebuild behavior of a dataset lets us control how or if that dataset can be built.
For one, if we set the rebuild behavior of a dataset to Explicit, the dataset won’t be rebuilt unless we specifically choose to rebuild it. We can also make a dataset Write-protected so it never gets rebuilt.