Concept | Build modes#
Watch the video
Datasets in Dataiku often have dependencies on upstream datasets in the Flow. As a result, these downstream datasets can become outdated if changes are made to preceding datasets or recipes.
Dataiku offers many build methods to counter or prevent issues that may arise from these dependencies. But why so many options? Once you start scaling your datasets, you won’t want to run any unnecessary computation. This article outlines different build strategies that will help you build exactly what you need.
Note
Once you’ve learned these strategies, you can apply them to building other objects in Dataiku like models and model evaluation stores.
Build modes#
There are three main build modes to consider when building datasets in Dataiku.
Build only this#
The simplest way to build a dataset is by using the Build only this option. This is the default build option for all datasets aside from those at the beginning of the Flow. It only runs the specific recipe that outputs the dataset.
Build upstream#
The Build upstream option builds the selected dataset and upstream datasets. There are two types of upstream builds:
Build only modified#
This option will only build upstream datasets starting from the first outdated dataset. This way, you won’t have to rebuild the entire branch.
Build all upstream#
This option rebuilds all upstream datasets, regardless of if they are already up-to-date.
You can also choose to build upstream and Stop at zone boundary. This means that upstream dependencies located outside the Flow zone will not be rebuilt even if they are outdated.
Build downstream#
The Build downstream option builds downstream datasets. Let’s look at the Advanced settings.
Build all downstream#
This runs recipes from the selected dataset until the end of the Flow is reached. The selected dataset itself is not built.
Find outputs and build recursively#
This will make Dataiku find all final datasets downstream from selected dataset and build any upstream dependencies. In this case, you can choose to either build required dependencies or force-build those upstream datasets.
Update output schemas#
A schema describes the structure of a dataset. It includes the list of columns and their respective names and storage types. Often, the schema of our datasets will change when designing the Flow. Dataiku provides a way to ensure that your schema changes are correctly applied to downstream datasets.
The Update output schemas option propagates the schema to output datasets before each recipe runs. This way, your data is built into the correct structure.
Building sections of the Flow#
You can build the entire Flow using the Build all button in the Flow Actions menu. If you prefer, you can build a Flow zone using the Build button directly on the Flow zone instead.
Advanced rebuild behavior#
The Rebuild behavior of a dataset lets us control how or if that dataset can be built.
For one, if we set the rebuild behavior of a dataset to Explicit, the dataset won’t be rebuilt unless we specifically choose to rebuild it. We can also make a dataset Write-protected so it never gets rebuilt.
What’s next?#
Now that you’ve explored different build modes and configurations, there’s more learning to do!
If you’re curious about automating builds, check out our Automation course or other courses in the Advanced Designer learning path.