Tutorial | Automation scenarios#

Scenarios are the main tool for automating actions in Dataiku, such as rebuilding datasets or retraining models. Let’s see how they work!

Get started#

Objectives#

In this tutorial, you will:

  • Create a scenario to automate actions in Dataiku.

  • Add triggers to control the timing of a scenario’s execution.

  • Understand how to use a combination of metrics, checks, and/or data quality rules to control the logic of a scenario’s actions.

Prerequisites#

To reproduce the steps in this tutorial, you’ll need:

  • Access to an instance of Dataiku 12+.

  • Basic knowledge of Dataiku (Core Designer level or equivalent).

You may also want to review this tutorial’s associated concept article.

For those interested, this tutorial also includes an optional exercise for SQL triggers. To complete this, you’ll also need a supported SQL connection.

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Advanced Designer > Automation Scenarios.

  2. From the project homepage, click Go to Flow.

Note

You can also download the starter project from this website and import it as a zip file.

Create and manually run a scenario#

Consider the final dataset at the end of a pipeline. This dataset may be the key input to a dashboard, webapp, or Dataiku application. It may be routinely shared with other Dataiku projects or exported to other software tools.

It’s a common need to automate the rebuilding of a dataset like this as new data becomes available at the start of a pipeline. To automate this task, let’s create a scenario.

Note

For this example, we are rebuilding a dataset, but the logic works the same for other Dataiku objects such as retraining a model or refreshing a model evaluation store.

  1. From the Jobs menu in the top navigation bar, click Scenarios.

  2. Click + New Scenario.

  3. Name it Data Refresh.

  4. Click Create.

Dataiku screenshot of the dialog for creating a new scenario.

Tip

Here we are creating a step-based scenario. Although this type of scenario can include custom Python and SQL steps, it’s also possible to create custom scenarios entirely in Python.

Add steps to a scenario#

Let’s assume tx_windows is the downstream dataset that we want to rebuild. We just need to add that instruction to the scenario.

  1. Navigate to the Steps tab of the Data Refresh scenario.

  2. Click Add Step at the bottom left.

  3. Select Build / Train.

  4. Click Add Dataset to Build.

  5. Choose tx_windows as the dataset to build, and click Add.

  6. Click Save (or Cmd/Ctrl + S).

Dataiku screenshot of the steps tab of a scenario.

Manually run a scenario#

Before this first test of a scenario run, let’s make sure we have an empty Flow. If not already empty, clear all data computed by Dataiku to observe the first scenario run with an empty Flow.

  1. Go to the Datasets page (g + d).

  2. Click the box at the top left to select all datasets.

  3. Deselect the tx, cards, and merchants datasets.

  4. Click Clear data in the Actions sidebar.

  5. Click Confirm.

Dataiku screenshot of the dialog for clearing data.

With an empty Flow, let’s manually run the scenario.

  1. Return to the Data Refresh scenario.

  2. Click the Run button near the top right.

  3. Navigate to the Last runs tab of the scenario to see what happens.

Dataiku screenshot of the last runs tab of a scenario.

Running this scenario triggered the exact same set of actions as a default upstream build of the tx_windows dataset from the Flow. Why then bother with the scenario?

At least two important reasons:

  • We can automate exactly when this set of actions should run: for example, at a specified time, when a dataset changes, or even when another scenario finishes.

  • We can execute a scenario in more flexible ways: for example, through a button on a dashboard, a Dataiku application, or the Python API. This enables other users on the platform to execute actions created by colleagues.

Trigger a scenario#

Let’s demonstrate the value of automating when actions occur by adding various kinds of triggers to the scenario.

Add a time-based trigger#

The simplest trigger is a time-based one. We’ll use a very short interval for testing purposes.

  1. Navigate to the Settings tab of the Data Refresh scenario.

  2. Click Add Trigger.

  3. Choose Time-based trigger.

  4. Under Triggers, change hours to minutes.

  5. In the Run tile, toggle Auto-triggers for the entire scenario to On.

  6. Click Save.

Dataiku screenshot of the settings tab of a scenario.

Depending on your instance settings, you may soon see a pop-up notification that the scenario has started and finished.

  1. Once one run has finished, toggle the time-based trigger (not the auto-triggers for the entire scenario) to Off to stop it from repeatedly running, and click Save.

  2. Navigate to the Last runs tab of the scenario.

  3. With the most recent scenario run selected, click to view the job it triggered.

Dataiku screenshot of the Last runs tab of a scenario.

Important

You’ll see that there was nothing to do for this job. The scenario was triggered based on the small amount of time that had passed, but there was no actual work for the scenario to do. If you return to the Steps tab, you’ll see that the build mode is Build dependencies then these items. With no upstream changes, no action was required.

Add a dataset change trigger#

Caution

Only business and enterprise license users have access to triggers on dataset changes. Please skip ahead to Use data quality rules or checks in a scenario if these triggers are not available to you.

For other use cases, we may want to initiate actions based on dataset modifications. Let’s demonstrate how this works first for filesystem datasets and then (optionally) for SQL datasets.

  1. Return to the Data Refresh scenario, and navigate to the Settings tab.

  2. Click Add Trigger.

  3. Choose Trigger on dataset change.

  4. For quicker feedback during this test, change it to run every 9 seconds with a grace delay of 12 seconds.

  5. Click Add Dataset. Choose tx, and click Add.

  6. Click Save.

Dataiku screenshot of the settings tab of a scenario.

Now let’s simulate a change to the upstream dataset on which we’ve added the trigger.

  1. Open the tx dataset, and navigate to the Settings tab.

  2. Click List Files.

  3. Check the box next to /tx_2018.csv to include both files in the tx dataset.

  4. Click Save.

Dataiku screenshot of the settings tab of a dataset.

This should set off a new run shortly!

  1. Navigate back to the Last runs tab of the Data Refresh scenario.

  2. Find the run log triggered by the dataset modification.

  3. Once one run has finished, return to the Settings tab.

  4. Toggle the dataset modification trigger to Off, and click Save.

Dataiku screenshot of the Last runs tab of a scenario.
Optional: Trigger on SQL query change

The Trigger on dataset change option only reads the dataset’s settings (not the actual data). Accordingly, it does not detect changes in SQL tables managed outside of Dataiku. Therefore, a SQL query trigger must be used instead. If the output of the query changes, the trigger sets off the scenario run.

For example, let’s trigger the scenario if the latest purchase date in the transactions dataset changes.

  1. From the Flow, create and run a Sync recipe to move the tx dataset to a SQL connection.

  2. Navigate to the Settings tab of the Data Refresh scenario.

  3. Click Add Trigger.

  4. Choose Trigger on sql query change.

  5. For quicker feedback during this test, change it to run every 9 seconds with a grace delay of 12 seconds.

  6. Select the connection holding the synced transactions data (tx_copy).

  7. For the SQL script, copy-paste the following block, making any necessary changes to your project key and table name in the FROM clause.

    SELECT MAX("purchase_date")
    FROM "TUT_SCENARIOS_tx_copy"
    

    Warning

    It is advisable to workshop this query first in a SQL notebook. Dataiku Cloud users, for example, will need to include instance information in the FROM clause.

  8. Click Save.

Dataiku screenshot of the settings tab of a scenario.

As before, let’s simulate a dataset change to set off the trigger.

  1. In order to force a new maximum purchase date value, return to the Settings tab of the tx dataset.

  2. Click Show Advanced options, change which files to include, and then Save.

  3. Build the tx_copy dataset to simulate a change in the dataset tied to the SQL trigger.

  4. When the Sync recipe finishes running, switch back to the Last runs tab of the Data Refresh scenario to see the log run triggered by the SQL query change.

Optional: Custom Python trigger

Although we could achieve this functionality with the standard time-based trigger, have a look at a custom Python trigger that would run every first Friday in a month.

  1. Navigate to the Settings tab of the Data Refresh scenario.

  2. Click Add Trigger.

  3. Choose Custom trigger.

  4. Name it Every first Friday.

  5. Copy-paste the following code block, and then Save.

from dataiku.scenario import Trigger
from datetime import date

t = Trigger()

today = date.today()
dayofweek = today.isoweekday()
day = today.day

if dayofweek == 5 and day <= 7:
    t.fire()

See also

See the API documentation for more on Python-based scenarios.

Turn off auto-triggers#

Now that we have finished testing triggers for this scenario, let’s remember to switch them off to avoid unwanted runs. There are two levels of controls to be aware of:

  • The On/Off switch for an individual trigger.

  • The On/Off switch for all triggers for the entire scenario.

Let’s turn off all triggers for the scenario.

  1. Open the main Scenarios page (or the Settings tab of the Data Refresh scenario).

  2. Toggle auto-triggers to Off.

Dataiku screenshot of the scenarios page showing auto-triggers off.

Use data quality rules or checks in a scenario#

Triggers grant us control over when scenarios run. Metrics, checks, and data quality rules allow us to take more fine-grained control of how scenarios run, once triggered.

For example, a scenario may have a weekly trigger. However, if only a small number of new records are available, we may want steps within the scenario to take different actions.

Important

Data quality rules were introduced in Dataiku 12.6. If using a pre-12.6 version of Dataiku, you’ll be able to achieve the same functionality shown here with metrics and checks.

View the project’s data quality or existing checks#

This project already has data quality rules (or checks) in place on the tx_prepared dataset.

  1. From the Flow, open the tx_prepared dataset.

  2. Navigate to either the Data Quality tab or, for pre-12.6 users, the Status > Checks tab.

  3. On this page, review the rules or checks already in place. Moving forward, we’ll specifically pay attention to whether the record count of tx_prepared is within an expected range.

Verify data quality rules or run checks in a scenario#

Let’s have the scenario execute only if all rules (or checks) on the tx_prepared dataset pass.

  1. Return to the Steps tab of the Data Refresh scenario.

  2. Click Add Step.

  3. Select Verify rules or run checks.

  4. Drag this step to the first position before the build step.

  5. Click Add Dataset to Verify.

  6. Choose tx_prepared, and click Add.

  7. Open the Build step, and confirm that the “Run this step” setting is set to If no prior step failed.

Dataiku screenshot of the verify rules step in a scenario.

Now before executing the build step, this scenario first attempts to verify the data quality rules (or runs the checks) on the tx_prepared dataset. Only if no prior step fails will the scenario move on to executing the build step.

Without new data in the pipeline, we know one rule (or check) will fail. Let’s see what happens when we trigger the scenario.

  1. Click Run to manually trigger the scenario.

  2. Navigate to the Last runs tab to see its progress.

  3. For the most recent scenario run, click on the failed step to find the most recent scenario run’s source of failure. Note how the scenario did not move on to the build step.

Dataiku screenshot of the last runs tab of a failed scenario.

Tip

Experiment with adjusting the values in the record count rule (or check) to generate failure, warning, and success states. Then explore the Run this step options to produce different outcomes.

What’s next?#

Congratulations on creating a scenario to automate a routine action in Dataiku! Once you have mastered these basic building blocks, you can explore other built-in scenario steps, such as those that create exports or execute code.

A common next step would be to add a reporter to the scenario to communicate activities through various messaging channels.

Try that in Tutorial | Scenario reporters!

See also

You can learn more about scenarios in the reference documentation.