Concept | Dataset building strategies¶
Datasets in Dataiku often have dependencies on upstream datasets in the Flow. As a result, these downstream datasets can become outdated if changes are made to preceding datasets or recipes.
Dataiku comes with various build methods to counter or prevent issues that may arise from these dependencies. This lesson outlines these different build strategies and how to use them.
Build dataset options¶
The simplest way to build a dataset is by using the Not recursive option. A non-recursive build is the default build option for all datasets aside from those at the beginning of the Flow. It only runs the specific recipe that outputs the dataset.
You’ll also see the option to Update output schemas for this build. This checkbox is available for all build types. We’ll explain this further in the section Update output schemas.
The Recursive upstream option builds the selected dataset and upstream datasets.
This build type lets you choose how to handle dependencies. You have the choice to:
Build required dependencies. This will only rebuild required upstream datasets that have become outdated before building the selected dataset.
Force-build. This rebuilds all upstream datasets, regardless of if they are already up-to-date.
You can also choose to build upstream and Stop at zone boundary. This means that upstream dependencies located outside the Flow zone will not be rebuilt even if they are outdated.
The Recursive downstream option builds the selected dataset and downstream datasets. When performing a recursive downstream build, you have the option to either:
Run downstream recipes. This runs recipes from the selected dataset until the end of the Flow is reached.
Find outputs and build recursively. This will have Dataiku search for all outputs related to the selected dataset and build recursively. It is also possible that upstream dependencies of the output datasets will be built. In this case, you can choose to either build required dependencies or force-build those upstream datasets.
Update output schemas¶
A schema describes the structure of a dataset. It includes the list of columns and their respective names and storage types. Often, the schema of our datasets will change when designing the Flow. Dataiku provides a way to ensure that your schema changes are correctly applied to downstream datasets.
The Update output schemas option propagates the schema to output datasets before each recipe runs. This way, your data is built into the correct structure.
Note that this is the default option when running recipes. When building datasets, on the other hand, you will have to manually select the checkbox to update output schemas.
Building sections of the Flow¶
You can build the entire Flow using the Build all button in the Flow Actions menu. If you prefer, you can build a Flow zone using the Build button directly on the Flow zone instead.
There are a few things to keep in mind while building datasets.
First, when we are using a shared dataset in our Flow, the dataset can only be built from the Flow where it was created — that is, from its source project.
Selecting any of the recursive build options from our current project will not build a shared dataset in a Flow outside of its source project. In other words, the recursive build only builds the datasets that are created in the current project.
Furthermore, non-managed or external datasets are also unaccounted for in recursive builds.
We can change a dataset’s settings to control how or if it can be built.
For one, by specifying the rebuild behavior of a dataset as Explicit, the dataset does not get rebuilt, unless we specifically choose to rebuild it.
We can also write protect a dataset so that it never gets rebuilt. In this example, the dataset is read-only, and the only way to edit it is by writing from a code notebook.
Automating the build¶
Note that you can schedule or automate reconstruction using complex triggers thanks to Dataiku’s scenario system. Please refer to the Automation course for more details.
Congrats on discovering the different ways to rebuild a Flow and how dependencies impact the build!
To learn more about Flow Views & Actions, including through hands-on exercises, please register for the free Academy course on this subject found in the Advanced Designer learning path.
The reference documentation also contains more information about dataset rebuilding strategies.