Tutorial | Automation scenarios¶
Scenarios are the main tool for automating actions in Dataiku, such as rebuilding datasets or retraining models. Let’s see how they work!
Objectives¶
In this tutorial, you will:
Create a scenario to automate actions in Dataiku.
Add triggers to control the timing of a scenario’s execution.
Use metrics and checks to control the logic of a scenario’s actions.
Prerequisites¶
To reproduce the steps in this tutorial, you’ll need:
Access to an instance of Dataiku 12+.
Basic knowledge of Dataiku (Core Designer level or equivalent).
You may also want to review this tutorial’s associated concept article.
For those interested, this tutorial also includes optional exercises for SQL triggers, metrics, and checks. To complete these, you’ll also need a supported SQL connection.
Create the project¶
From the Dataiku Design homepage, click + New Project > DSS tutorials > Advanced Designer > Automation scenarios.
From the project homepage, click Go to Flow.
Note
You can also download the starter project from this website and import it as a zip file.
Create and manually run a scenario¶
Consider the final dataset at the end of a pipeline. This dataset may be the key input to a dashboard, webapp, or Dataiku application. It may be routinely shared with other Dataiku projects or exported to other software tools.
It’s a common need to automate the rebuilding of a dataset like this as new data becomes available at the start of a pipeline. To automate this task, let’s create a scenario.
Note
For this example, we are rebuilding a dataset, but the logic works the same for other Dataiku objects such as retraining a model or refreshing a model evaluation store.
From the Jobs menu in the top navigation bar, click Scenarios.
Click + New Scenario.
Name it
Data Refresh
.Click Create.

Note
Here we are creating a step-based scenario. Although this type of scenario can include custom Python and SQL steps, it’s also possible to create custom scenarios entirely in Python.
Add steps to a scenario¶
Let’s assume tx_windows is the downstream dataset that we want to rebuild. We just need to add that instruction to the scenario.
Navigate to the Steps tab of the Data Refresh scenario.
Click Add Step at the bottom left.
Select Build / Train.
Click Add Dataset to Build.
Choose tx_windows as the dataset to build, and click Add.
In the top right, click Save.

Manually run a scenario¶
Before this first test of a scenario run, let’s make sure we have an empty Flow. If not already empty, clear all data computed by Dataiku to observe the first scenario run with an empty Flow.
Go to the Datasets page (
G
+D
).Click the box at the top left to select all datasets.
Deselect the tx, cards, and merchants datasets.
Click Clear data in the right panel.
Click Confirm.

With an empty Flow, let’s manually run the scenario.
Return to the Data Refresh scenario.
Click the Run button near the top right.
Navigate to the Last runs tab of the scenario to see what happens.

Running this scenario triggered the exact same set of actions as a default upstream build of the tx_windows dataset from the Flow. Why then bother with the scenario?
At least two important reasons:
We can automate exactly when this set of actions should run: for example, at a specified time, when a dataset changes, or even when another scenario finishes.
We can execute a scenario in more flexible ways: for example, through a button on a dashboard, a Dataiku application, or the Python API. This enables other users on the platform to execute actions created by colleagues.
Trigger a scenario¶
Let’s demonstrate the value of automating when actions occur by adding a trigger to the scenario. We’ll use a very short interval for testing.
Navigate to the Settings tab of the scenario.
Click Add Trigger.
Choose Time-based trigger.
Under Triggers, change hours to minutes.
In the Run tile, toggle Auto-triggers for the entire scenario to On.
Click Save.

Depending on your instance settings, you may see a pop-up notification that the scenario has started and finished.
Once one run has finished, toggle the time-based trigger (not the auto-triggers for the entire scenario) to Off to stop it from repeatedly running, and click Save.
Navigate to the Last runs tab of the scenario.
With the most recent scenario run selected, click to view the job it triggered.

Note
You’ll see that there was nothing to do for this job. The scenario was triggered based on the small amount of time that had passed, but there was no actual work for the scenario to do. If you return to the Steps tab, you’ll see that the build mode is Build dependencies then these items. With no upstream changes, no action was required.
Trigger a scenario on a dataset change¶
Warning
Only business and enterprise license users have access to this feature. Please skip ahead to Compute metrics if these triggers are not available to you.
For other use cases, we may want to initiate actions based on dataset modifications. Let’s demonstrate how this works first for filesystem datasets and then (optionally) for SQL datasets.
Return to the Data Refresh scenario, and navigate to the Settings tab.
Click Add Trigger.
Choose Trigger on dataset change.
For quicker feedback during this test, change it to run every
9
seconds with a grace delay of12
seconds.Click Add Dataset. Choose tx, and click Add.
Click Save.

Now let’s simulate a change to the upstream dataset on which we’ve added the trigger.
Open the tx dataset, and switch to the Settings tab.
Click List Files.
Click Add next to
/tx_2018.csv
.Click Save.

This should set off a new run shortly!
Navigate back to the Last runs tab of the Data Refresh scenario.
Find the run log triggered by the dataset modification.
Once one run has finished, return to the Settings tab.
Toggle the dataset modification trigger to Off, and click Save.

Optional: Trigger on SQL query change
The Trigger on dataset change option only reads the dataset’s settings (not the actual data). Accordingly, it does not detect changes in SQL tables managed outside of Dataiku. Therefore, a SQL query trigger must be used instead. If the output of the query changes, the trigger sets off the scenario run.
For example, let’s trigger the scenario if the latest purchase date in the transactions dataset changes.
From the Flow, create and run a Sync recipe to move the tx dataset to a SQL connection.
Navigate to the Settings tab of the Data Refresh scenario.
Click Add Trigger.
Choose Trigger on sql query change.
For quicker feedback during this test, change it to run every
9
seconds with a grace delay of12
seconds.Select the connection holding the synced transactions data (tx_copy).
For the SQL script, copy-paste the following block, making any necessary changes to your project key and table name in the FROM clause.
SELECT MAX("purchase_date") FROM "TUT_SCENARIOS_tx_copy"Warning
It is advisable to workshop this query first in a SQL notebook. Dataiku Cloud users, for example, will need to include instance information in the FROM clause.
Click Save.

As before, let’s simulate a dataset change to set off the trigger.
In order to force a new maximum purchase date value, return to the Settings tab of the tx dataset.
Click Show Advanced options, change which files to include, and then Save.
Build the tx_copy dataset to simulate a change in the dataset tied to the SQL trigger.
When the Sync recipe finishes running, switch back to the Last runs tab of the Data Refresh scenario to see the log run triggered by the SQL query change.
Optional: Custom Python trigger
Although we could achieve this functionality with the standard time-based trigger, have a look at a custom Python trigger that would run every first Friday in a month.
Navigate to the Settings tab of the Data Refresh scenario.
Click Add Trigger.
Choose Custom trigger.
Name it
Every first Friday
.Copy-paste the following code block, and then Save.
from dataiku.scenario import Trigger
from datetime import date
t = Trigger()
today = date.today()
dayofweek = today.isoweekday()
day = today.day
if dayofweek == 5 and day <= 7:
t.fire()
Note
See the API documentation for more on Python-based scenarios.
Tip
An additional option for every scenario is the ability to add a reporter to send alerts about scenario activities via email or other messaging channels. For an example, see this tutorial on creating a reporter for a Microsoft Teams channel.
Turn off auto-triggers¶
Now that we have finished testing triggers for this scenario, let’s remember to switch them off to avoid unwanted runs. There are two levels of controls to be aware of:
the On/Off switch for an individual trigger
the On/Off switch for all triggers for the entire scenario
Let’s turn off all triggers for the scenario.
Open the main Scenarios page (or the Settings tab of the Data Refresh scenario).
Toggle auto-triggers to Off.

Compute metrics¶
Triggers grant us control over when scenarios run. Metrics and checks allow us to take more fine-grained control of how scenarios run, once triggered. For example, a scenario may have a weekly trigger. However, if only a small number of new records are available, we may want steps within the scenario to take different actions.
Our upstream dataset of interest is tx. Let’s start by computing the default metrics already available for this dataset.
Open the tx dataset, and navigate to the Status tab.
On the Metrics subtab, click Compute to calculate the default metrics such as column count and record count.

Create new metrics¶
In addition to the default metrics already available, we can also create our own. Let’s create a metric on the minimum and maximum of the authorized_flag column since we expect these values to always be 0 and 1, respectively.
Within the Status tab of the tx dataset, navigate to the Edit subtab.
Toggle On the Column statistics section of metrics.
Add the Min and Max of authorized_flag.
Select Click to run this now and then Run to test out these new metrics.
Click to view the Last run results.
Click Save.

Optional: Custom Python metric
In addition to the built-in metric options, you can also create custom metrics with Python. For example, we might want to track the most and least authorized item category.
Scroll to the bottom of the Edit Metrics page.
Click New Python Probe.
Toggle it On, and click the pencil icon to rename it
Most/least authorized categories
.Copy-paste the following code block.
import numpy as np # Define here a function that returns the metric. def process(dataset, partition_id): # dataset is a dataiku.Dataset object df = dataset.get_dataframe() df.authorized_flag.fillna(0, inplace=True) df_per_category = (df .groupby('item_category') .agg({'authorized_flag':np.mean}) .sort_values('authorized_flag', ascending=False) ) most_authorized_category = df_per_category.index[0] least_authorized_category = df_per_category.index[-1] return {'Most authorized item category' : most_authorized_category, 'Least authorized item category' : least_authorized_category}
Select Click to run this now and then Run to test out these new metrics.
Click to view the Last run results.
Click Save.
Display metrics¶
The last step is to display any new metrics you have created.
Navigate back to the Metrics subtab.
Click X/Y Metrics.
In the column of available metrics, click Add all.
Click Save.

Create checks on metrics¶
After creating metrics important to our use case, the next step is to use them as the basis for checks.
Check values in a range¶
Let’s assume we have strong expectations about the number of records in the dataset. We can create a check to see if a metric falls outside of this range, and generate a warning or a failure when it does.
Let’s purposely create a check that will fail.
Navigate to the Edit subtab of the tx dataset.
Click on the Checks panel.
Click Metric Value is in a Numeric Range.
Name the check
50k < Records < 200k
.Select Record count as the metric to check.
Set the minimum to
50000
, the soft minimum to100000
, the soft maximum to150000
, and the maximum to200000
.Click Check to test it, and confirm the error.

Check values in a set¶
We can also check if a metric is in a set of values. Since we know that the minimum value for authorized_flag must be 0, we can create a check to ensure that this is true. If it is not 0, the check will return an error.
At the bottom of the Edit checks page, click Metric Value is in a Set of Values.
Name it
min(authorized_flag) == 0
.Select Min of authorized_flag as the metric.
Click Add Value, and give
0
as the value.Click Check to test it, and confirm it is OK.
Save.

Optional: Custom Python check
We can also create custom checks in Python. For example, we could create a check that the difference between the minimum and maximum values of authorized_flag is 1.
Scroll to the bottom of the Edit checks page.
Click Custom (Python).
Name the check
Expected target values
.Copy-paste the following code block.
Click Check to test it returns “OK”, and then Save.
# Define here a function that returns the outcome of the check.
def process(last_values, dataset, partition_id):
# last_values is a dict of the last values of the metrics,
# with the values as a dataiku.metrics.MetricDataPoint.
# dataset is a dataiku.Dataset object
min_flag = int(last_values['col_stats:MIN:authorized_flag'].get_value())
max_flag = int(last_values['col_stats:MAX:authorized_flag'].get_value())
if max_flag - min_flag == 1:
return 'OK', 'Expected target values'
else:
return 'ERROR', 'Unexpected target values'
Display checks¶
The last step is to display any new checks you have created.
Navigate to the Checks subtab.
Click X/Y Checks.
In the column of available checks, click Add all.
Click Save.

Use metrics and checks in a scenario¶
Metrics and checks can be informative on their own. For example, we can quickly view not only their last value, but also their history. Moreover, we can create a dedicated metrics (or checks) dataset where we can dive deeper into the numbers themselves.

However, metrics and checks can be even more powerful when teamed with scenarios. To demonstrate this, let’s first clear an upstream dataset.
From the Flow, select the tx_joined dataset (as an example of an upstream dataset).
In the Actions menu on the right, click Clear Data, and Confirm.
Because of the cleared upstream dataset, we know that the Data Refresh scenario, once triggered, will rebuild the tx_windows dataset. What’s interesting is we may want the build step of the scenario to run only if all checks pass.
Return to the Steps tab of the Data Refresh scenario.
Click Add Step.
Select Run checks.
Drag this step to the first position before the build step.
Click Add Dataset to Check.
Choose tx, and click Add.
Open the Build step, and confirm that the “Run this step” setting is set to If no prior step failed.

Now before executing the build step, this scenario first executes checks on the tx dataset. Only if no prior step fails will the scenario move on to executing the build step. Since we know one check will fail, let’s see what happens.
Click Run to manually trigger the scenario.
Navigate to the Last runs tab to see its progress.
Click on the failed check to find the most recent scenario run’s source of failure. Note how the scenario did not move on to the build step.

Tip
Experiment with adjusting the values in the record count check to generate failure, warning, and success states. Then explore the Run this step options to produce different outcomes.
What’s next?¶
Congratulations on creating a scenario to automate a routine action in Dataiku! Once you have mastered these basic building blocks, you can explore other built-in scenario steps, such as those that create exports or execute code.
Note
You can learn more about scenarios in the reference documentation.