Dataset Building Strategies¶
In this lesson, you’ll learn about important concepts for rebuilding datasets, including:
recursive vs. non-recursive builds,
types of recursive builds, and
how or when datasets can be rebuilt.
This content is also included in a free Dataiku Academy course on Flow Views & Actions, which is part of the Advanced Designer learning path. Register for the course there if you’d like to track and validate your progress alongside concept videos, summaries, hands-on tutorials, and quizzes.
Datasets in a Flow can depend on other datasets that lie upstream, or they can depend on the outputs of upstream recipes. As a result, these dependent (or downstream) datasets can become outdated if changes are made upstream.
Dataiku DSS is aware of the dependencies for every dataset in our Flow and can efficiently rebuild outdated downstream datasets. When there are changes upstream, we have different strategies to choose from, including performing a non-recursive or a recursive build.
The simplest way to build or rebuild a dataset is by simply building the dataset or running its parent recipe in a non-recursive manner. A non-recursive build is the default build option. It runs only the specific recipe that outputs the dataset.
The Recursive build option builds the dataset along with upstream datasets. The recursive build aims to rebuild dependencies, while avoiding a rebuild of upstream datasets that are already up-to-date.
With a recursive build, we have three options:
Forced recursive rebuild, and
‘Missing’ data only.
With Smart reconstruction, Dataiku DSS utilizes its awareness of upstream changes. When selected, Dataiku DSS first rebuilds any required upstream datasets that have become outdated due to changes such as edits to their parent recipes. Only after building the upstream datasets is the dataset selected for re-building built by Dataiku DSS.
In this example, we’ll build merchants_by_state using Smart reconstruction.
When we select to preview the job, we can see the computed dependencies. For example, building the dataset, cardholder_info_copy, is required because it has not been built, and it lies upstream relative to the merchant_by_state dataset. The upstream datasets that are already up-to-date, are not rebuilt.
As a result of this recursive build of the merchant_by_state dataset using smart reconstruction, the empty dataset and all the intermediate datasets that depend on it are built.
Forced Recursive Build¶
The second option, Forced recursive rebuild, rebuilds all of the dependencies of the selected dataset, going back to the beginning of the Flow. In this example, we’ll build merchants_by_state using Forced recursive build.
When we select to preview the job, we can see the computed dependencies. The upstream datasets, even those that are already up-to-date, are rebuilt.
As a result of this recursive build of the merchant_by_state dataset using forced recursive build, the dataset and all required upstream datasets are rebuilt, even if they are already up-to-date.
Missing Data Only¶
The third option ‘Missing data only’ rebuilds only the required upstream datasets that are completely missing. This method is very specific and generally not recommended.
Controlling How and When a Dataset Can Be Built¶
We can change a dataset’s settings to control how it can be built, or if it can be re-built.
Explicit Rebuild Behavior¶
By specifying the rebuild behavior of a dataset as “Explicit”, the dataset does not get rebuilt, unless we specifically choose to rebuild it.
We can also write protect a dataset so that it never gets rebuilt. In this example, the dataset is “read-only”, and the only way to edit it is by writing from a code notebook.
Building From the Flow¶
We can build all the downstream datasets relative to a particular recipe or dataset, right from the Flow. To do this, we use the Build Flow outputs reachable from here option. Using this option, the selected dataset and downstream datasets along its branches get built according to the specified way for handling dependencies.
Congrats on discovering the different ways to rebuild a Flow and how dependencies impact the build!
The product documentation also contains more information about dataset rebuilding strategies.