Tutorial | Build modes#

Let’s practice using various build modes in Dataiku to recompute datasets and manage schema changes.

Get started#

Objectives#

In this tutorial, you will:

  • Build datasets using upstream and downstream builds.

  • Successfully troubleshoot a failed job.

  • Build datasets by Flow zones.

  • Manage schema changes such as adding columns.

Prerequisites#

To reproduce the steps in this tutorial, you’ll need:

  • Dataiku 12.0 or later.

  • Basic knowledge of Dataiku (Core Designer level or equivalent).

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Advanced Designer > Build Modes.

  2. From the project homepage, click Go to Flow.

Note

You can also download the starter project from this website and import it as a zip file.

Use case summary#

The project has three data sources:

Dataset

Description

tx

Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made.

It also indicates whether the transaction has either been:

  • Authorized (a score of 1 in the authorized_flag column)

  • Flagged for potential fraud (a score of 0)

merchants

Each row is a unique merchant with information such as the merchant’s location and category.

cards

Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

Build datasets#

Build only this#

Let’s try to build the tx_windows dataset.

  1. In the Data preparation Flow zone, right-click on the tx_windows dataset and select Build.

  2. Keep the default Build only this selection.

  3. Try to Build dataset.

You should see an error message when you try this! You can’t build an item whose upstream datasets are not built. Let’s troubleshoot this issue.

Review the job log#

After any job fails, you can look at the job log to figure out what went wrong.

  1. From the top navigation bar, select the Jobs menu (or use g + j).

  2. Click on the Build tx_windows (NP) failed job.

  3. Check the failed compute_tx_windows activity.

A screenshot of the failed "compute_tx_windows" job in the job log.

You’ll see here that this dataset’s root path does not exist. This confirms that there was an issue with the upstream path.

Build upstream#

Let’s fix that by instructing Dataiku to build our target dataset, as well as any out-of-date upstream dependencies!

  1. Return to the Flow.

  2. Right click on the tx_windows dataset and select Build.

  3. Choose to Build upstream and keep the default selections.

  4. Select Preview to see what the build job will run.

  5. Click Run in the top right corner.

    A Dataiku screenshot of the Preview of the build upstream job.
  6. When the job completes, return to the Flow and see that while all data upstream from tx_windows was built, other final branches were not built.

Build a Flow zone#

After building upstream, the other final outputs in the Data Preparation Flow zone remain empty. Let’s build these datasets by instructing Dataiku to build an entire Flow zone.

  1. Click on the Build button at the top right of the Data Preparation Flow zone. This opens the same dialog window as when you click Flow Actions > Build all from the Flow. However, it automatically enables the Stop at zone boundary option to restrict the build to the selected Flow zone.

  2. For the purpose of demonstration, select Force-rebuild all dependencies next to Handling of dependencies.

  3. Click Preview.

    Notice that each recipe in the Data Preparation zone will be built, even though not all datasets are out of date. Also, see how none of the recipes in the Data Ingestion zone are set to run even though we are force-rebuilding all dependencies.

    A Dataiku screenshot of the Preview of the build Flow zones job.
  4. Run the job.

Manage schema changes#

It’s important to consider how schema changes will impact datasets downstream in your Flow. We’ll see what happens when we add a column to a dataset early in the Flow.

Add a column#

  1. From the Flow, open the Prepare recipe next to the cards dataset.

  2. Click the dropdown arrow next to the first_active_month column and select Parse date.

  3. Keep the default and click Use Date Format.

  4. Take a moment to review the new first_active_month_parsed column.

  5. Click Run and make sure to Run Downstream. Then, click Run once more.

A Dataiku screenshot of the "Run recipe 'compute_cards_prepared'" window.

You just saw that when running a recipe, the default action is to run downstream and update output schemas. This helps to keep schemas updated in the Flow as you’re designing your recipes.

Let’s verify this and review the schemas of the downstream datasets.

  1. Select tx_joined from the Flow.

  2. Switch to the Schema tab in the right panel.

  3. Review the schema and notice that the new first_active_month_parsed column is present.

  4. Select downstream datasets one by one and review whether they include the new column.

A Dataiku screenshot of the schema of the tx_joined dataset.

How do certain recipes handle schema changes?#

You’re probably wondering why some downstream datasets include the column you added upstream, whereas others do not. We’ll explain what happened to each dataset here.

Recipe

Explanation

Join

The tx_joined dataset includes the new column. In a Join recipe, schema propagation will happen if you choose Select all non-conflicting columns or Select all columns.

However, a new column will not be added if you chose Manually select columns for the changed dataset.

Prepare

The tx_prepared dataset includes the new column. As users, we do not have to explicitly select every column for it to be included in the output of a Prepare recipe. All new columns are present by default.

Pivot

In the tx_pivot dataset, you don’t see the new column. Because the Pivot recipe generates an entirely new schema from the values of the pivot column, the new column is somewhat irrelevant. The new column is available to add in the Other columns step, but Dataiku does not forcibly add it.

Distinct

The tx_distinct dataset does not include the new column. This happens because we explicitly selected columns in the Distinct recipe. The new column will be available for selection inside the recipe, but Dataiku does not assume that it should be added.

Window

The tx_windows dataset includes new columns because in the Window recipe, the aggregation is set to Always retrieve all. However, similarly to other recipes, Dataiku will not retrieve additional columns if you have manually selected columns.

Top N

The tx_topn dataset includes the new column because in the Retrieve columns step of the Top N recipe, we kept the default mode as All columns. Yet, if we chose to manually select columns, then no new column would be automatically selected.

Note

Though we didn’t cover every Dataiku recipe here, you can use what you’ve learned to test how other recipes handle schema changes.

What’s next?#

If you’re interested in automating builds, check out the Data Quality & Automation course in the Academy!