Tutorial | Automation scenarios#

Scenarios are the main tool for automating actions in Dataiku, such as rebuilding datasets or retraining models. Let’s see how they work!

Objectives#

In this tutorial, you will:

  • Create a scenario to automate actions in Dataiku.

  • Add triggers to control the timing of a scenario’s execution.

  • Use metrics and checks to control the logic of a scenario’s actions.

Prerequisites#

To reproduce the steps in this tutorial, you’ll need:

  • Access to an instance of Dataiku 12+.

  • Basic knowledge of Dataiku (Core Designer level or equivalent).

You may also want to review this tutorial’s associated concept article.

For those interested, this tutorial also includes optional exercises for SQL triggers, metrics, and checks. To complete these, you’ll also need a supported SQL connection.

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Advanced Designer > Automation scenarios.

  2. From the project homepage, click Go to Flow.

Note

You can also download the starter project from this website and import it as a zip file.

Create and manually run a scenario#

Consider the final dataset at the end of a pipeline. This dataset may be the key input to a dashboard, webapp, or Dataiku application. It may be routinely shared with other Dataiku projects or exported to other software tools.

It’s a common need to automate the rebuilding of a dataset like this as new data becomes available at the start of a pipeline. To automate this task, let’s create a scenario.

Note

For this example, we are rebuilding a dataset, but the logic works the same for other Dataiku objects such as retraining a model or refreshing a model evaluation store.

  1. From the Jobs menu in the top navigation bar, click Scenarios.

  2. Click + New Scenario.

  3. Name it Data Refresh.

  4. Click Create.

Dataiku screenshot of the dialog for creating a new scenario.

Note

Here we are creating a step-based scenario. Although this type of scenario can include custom Python and SQL steps, it’s also possible to create custom scenarios entirely in Python.

Add steps to a scenario#

Let’s assume tx_windows is the downstream dataset that we want to rebuild. We just need to add that instruction to the scenario.

  1. Navigate to the Steps tab of the Data Refresh scenario.

  2. Click Add Step at the bottom left.

  3. Select Build / Train.

  4. Click Add Dataset to Build.

  5. Choose tx_windows as the dataset to build, and click Add.

  6. In the top right, click Save.

Dataiku screenshot of the steps tab of a scenario.

Manually run a scenario#

Before this first test of a scenario run, let’s make sure we have an empty Flow. If not already empty, clear all data computed by Dataiku to observe the first scenario run with an empty Flow.

  1. Go to the Datasets page (g + d).

  2. Click the box at the top left to select all datasets.

  3. Deselect the tx, cards, and merchants datasets.

  4. Click Clear data in the right panel.

  5. Click Confirm.

Dataiku screenshot of the dialog for clearing data.

With an empty Flow, let’s manually run the scenario.

  1. Return to the Data Refresh scenario.

  2. Click the Run button near the top right.

  3. Navigate to the Last runs tab of the scenario to see what happens.

Dataiku screenshot of the last runs tab of a scenario.

Running this scenario triggered the exact same set of actions as a default upstream build of the tx_windows dataset from the Flow. Why then bother with the scenario?

At least two important reasons:

  • We can automate exactly when this set of actions should run: for example, at a specified time, when a dataset changes, or even when another scenario finishes.

  • We can execute a scenario in more flexible ways: for example, through a button on a dashboard, a Dataiku application, or the Python API. This enables other users on the platform to execute actions created by colleagues.

Trigger a scenario#

Let’s demonstrate the value of automating when actions occur by adding a trigger to the scenario. We’ll use a very short interval for testing.

  1. Navigate to the Settings tab of the scenario.

  2. Click Add Trigger.

  3. Choose Time-based trigger.

  4. Under Triggers, change hours to minutes.

  5. In the Run tile, toggle Auto-triggers for the entire scenario to On.

  6. Click Save.

Dataiku screenshot of the settings tab of a scenario.

Depending on your instance settings, you may see a pop-up notification that the scenario has started and finished.

  1. Once one run has finished, toggle the time-based trigger (not the auto-triggers for the entire scenario) to Off to stop it from repeatedly running, and click Save.

  2. Navigate to the Last runs tab of the scenario.

  3. With the most recent scenario run selected, click to view the job it triggered.

Dataiku screenshot of the Last runs tab of a scenario.

Note

You’ll see that there was nothing to do for this job. The scenario was triggered based on the small amount of time that had passed, but there was no actual work for the scenario to do. If you return to the Steps tab, you’ll see that the build mode is Build dependencies then these items. With no upstream changes, no action was required.

Trigger a scenario on a dataset change#

Warning

Only business and enterprise license users have access to this feature. Please skip ahead to Compute metrics if these triggers are not available to you.

For other use cases, we may want to initiate actions based on dataset modifications. Let’s demonstrate how this works first for filesystem datasets and then (optionally) for SQL datasets.

  1. Return to the Data Refresh scenario, and navigate to the Settings tab.

  2. Click Add Trigger.

  3. Choose Trigger on dataset change.

  4. For quicker feedback during this test, change it to run every 9 seconds with a grace delay of 12 seconds.

  5. Click Add Dataset. Choose tx, and click Add.

  6. Click Save.

Dataiku screenshot of the settings tab of a scenario.

Now let’s simulate a change to the upstream dataset on which we’ve added the trigger.

  1. Open the tx dataset, and switch to the Settings tab.

  2. Click List Files.

  3. Click Add next to /tx_2018.csv.

  4. Click Save.

Dataiku screenshot of the settings tab of a dataset.

This should set off a new run shortly!

  1. Navigate back to the Last runs tab of the Data Refresh scenario.

  2. Find the run log triggered by the dataset modification.

  3. Once one run has finished, return to the Settings tab.

  4. Toggle the dataset modification trigger to Off, and click Save.

Dataiku screenshot of the Last runs tab of a scenario.
Optional: Trigger on SQL query change

The Trigger on dataset change option only reads the dataset’s settings (not the actual data). Accordingly, it does not detect changes in SQL tables managed outside of Dataiku. Therefore, a SQL query trigger must be used instead. If the output of the query changes, the trigger sets off the scenario run.

For example, let’s trigger the scenario if the latest purchase date in the transactions dataset changes.

  1. From the Flow, create and run a Sync recipe to move the tx dataset to a SQL connection.

  2. Navigate to the Settings tab of the Data Refresh scenario.

  3. Click Add Trigger.

  4. Choose Trigger on sql query change.

  5. For quicker feedback during this test, change it to run every 9 seconds with a grace delay of 12 seconds.

  6. Select the connection holding the synced transactions data (tx_copy).

  7. For the SQL script, copy-paste the following block, making any necessary changes to your project key and table name in the FROM clause.

    SELECT MAX("purchase_date")
    FROM "TUT_SCENARIOS_tx_copy"
    

    Warning

    It is advisable to workshop this query first in a SQL notebook. Dataiku Cloud users, for example, will need to include instance information in the FROM clause.

  8. Click Save.

Dataiku screenshot of the settings tab of a scenario.

As before, let’s simulate a dataset change to set off the trigger.

  1. In order to force a new maximum purchase date value, return to the Settings tab of the tx dataset.

  2. Click Show Advanced options, change which files to include, and then Save.

  3. Build the tx_copy dataset to simulate a change in the dataset tied to the SQL trigger.

  4. When the Sync recipe finishes running, switch back to the Last runs tab of the Data Refresh scenario to see the log run triggered by the SQL query change.

Optional: Custom Python trigger

Although we could achieve this functionality with the standard time-based trigger, have a look at a custom Python trigger that would run every first Friday in a month.

  1. Navigate to the Settings tab of the Data Refresh scenario.

  2. Click Add Trigger.

  3. Choose Custom trigger.

  4. Name it Every first Friday.

  5. Copy-paste the following code block, and then Save.

from dataiku.scenario import Trigger
from datetime import date

t = Trigger()

today = date.today()
dayofweek = today.isoweekday()
day = today.day

if dayofweek == 5 and day <= 7:
    t.fire()

Note

See the API documentation for more on Python-based scenarios.

Tip

An additional option for every scenario is the ability to add a reporter to send alerts about scenario activities via email or other messaging channels. For an example, see this tutorial on creating a reporter for a Microsoft Teams channel.

Turn off auto-triggers#

Now that we have finished testing triggers for this scenario, let’s remember to switch them off to avoid unwanted runs. There are two levels of controls to be aware of:

  • The On/Off switch for an individual trigger.

  • The On/Off switch for all triggers for the entire scenario.

Let’s turn off all triggers for the scenario.

  1. Open the main Scenarios page (or the Settings tab of the Data Refresh scenario).

  2. Toggle auto-triggers to Off.

Dataiku screenshot of the scenarios page showing auto-triggers off.

Compute metrics#

Triggers grant us control over when scenarios run. Metrics and checks allow us to take more fine-grained control of how scenarios run, once triggered. For example, a scenario may have a weekly trigger. However, if only a small number of new records are available, we may want steps within the scenario to take different actions.

Our upstream dataset of interest is tx. Let’s start by computing the default metrics already available for this dataset.

  1. Open the tx dataset, and navigate to the Status tab.

  2. On the Metrics subtab, click Compute to calculate the default metrics such as column count and record count.

Dataiku screenshot of the metrics subtab of a dataset.

Create new metrics#

In addition to the default metrics already available, we can also create our own. Let’s create a metric on the minimum and maximum of the authorized_flag column since we expect these values to always be 0 and 1, respectively.

  1. Within the Status tab of the tx dataset, navigate to the Edit subtab.

  2. Toggle On the Column statistics section of metrics.

  3. Add the Min and Max of authorized_flag.

  4. Select Click to run this now and then Run to test out these new metrics.

  5. Click to view the Last run results.

  6. Click Save.

Dataiku screenshot of the edit metrics page of a dataset.
Optional: Custom Python metric

In addition to the built-in metric options, you can also create custom metrics with Python. For example, we might want to track the most and least authorized item category.

  1. Scroll to the bottom of the Edit Metrics page.

  2. Click New Python Probe.

  3. Toggle it On, and click the pencil icon to rename it Most/least authorized categories.

  4. Copy-paste the following code block.

    import numpy as np
    
    # Define here a function that returns the metric.
    def process(dataset, partition_id):
        # dataset is a dataiku.Dataset object
        df = dataset.get_dataframe()
        df.authorized_flag.fillna(0, inplace=True)
        df_per_category = (df
                            .groupby('item_category')
                            .agg({'authorized_flag':np.mean})
                            .sort_values('authorized_flag', ascending=False)
                        )
    
        most_authorized_category = df_per_category.index[0]
        least_authorized_category = df_per_category.index[-1]
        return {'Most authorized item category' : most_authorized_category,
                'Least authorized item category' : least_authorized_category}
    
  5. Select Click to run this now and then Run to test out these new metrics.

  6. Click to view the Last run results.

  7. Click Save.

Display metrics#

The last step is to display any new metrics you have created.

  1. Navigate back to the Metrics subtab.

  2. Click X/Y Metrics.

  3. In the column of available metrics, click Add all.

  4. Click Save.

Dataiku screenshot of the dialog for displaying metrics.

Create checks on metrics#

After creating metrics important to our use case, the next step is to use them as the basis for checks.

Check values in a range#

Let’s assume we have strong expectations about the number of records in the dataset. We can create a check to see if a metric falls outside of this range, and generate a warning or a failure when it does.

Let’s purposely create a check that will fail.

  1. Navigate to the Edit subtab of the tx dataset.

  2. Click on the Checks panel.

  3. Click Metric Value is in a Numeric Range.

  4. Name the check 50k < Records < 200k.

  5. Select Record count as the metric to check.

  6. Set the minimum to 50000, the soft minimum to 100000, the soft maximum to 150000, and the maximum to 200000.

  7. Click Check to test it, and confirm the error.

Dataiku screenshot of a check for a metric in a range of values.

Check values in a set#

We can also check if a metric is in a set of values. Since we know that the minimum value for authorized_flag must be 0, we can create a check to ensure that this is true. If it is not 0, the check will return an error.

  1. At the bottom of the Edit checks page, click Metric Value is in a Set of Values.

  2. Name it min(authorized_flag) == 0.

  3. Select Min of authorized_flag as the metric.

  4. Click Add Value, and give 0 as the value.

  5. Click Check to test it, and confirm it is OK.

  6. Save.

Dataiku screenshot of a check for a metric in a set of values.
Optional: Custom Python check

We can also create custom checks in Python. For example, we could create a check that the difference between the minimum and maximum values of authorized_flag is 1.

  1. Scroll to the bottom of the Edit checks page.

  2. Click Custom (Python).

  3. Name the check Expected target values.

  4. Copy-paste the following code block.

  5. Click Check to test it returns “OK”, and then Save.

# Define here a function that returns the outcome of the check.
def process(last_values, dataset, partition_id):
    # last_values is a dict of the last values of the metrics,
    # with the values as a dataiku.metrics.MetricDataPoint.
    # dataset is a dataiku.Dataset object
    min_flag = int(last_values['col_stats:MIN:authorized_flag'].get_value())
    max_flag = int(last_values['col_stats:MAX:authorized_flag'].get_value())

    if max_flag - min_flag == 1:
        return 'OK', 'Expected target values'
    else:
        return 'ERROR', 'Unexpected target values'

Display checks#

The last step is to display any new checks you have created.

  1. Navigate to the Checks subtab.

  2. Click X/Y Checks.

  3. In the column of available checks, click Add all.

  4. Click Save.

Dataiku screenshot of the dialog for displaying checks.

Use metrics and checks in a scenario#

Metrics and checks can be informative on their own. For example, we can quickly view not only their last value, but also their history. Moreover, we can create a dedicated metrics (or checks) dataset where we can dive deeper into the numbers themselves.

Dataiku screenshot of the metrics page of a dataset.

However, metrics and checks can be even more powerful when teamed with scenarios. To demonstrate this, let’s first clear an upstream dataset.

  1. From the Flow, select the tx_joined dataset (as an example of an upstream dataset).

  2. In the Actions menu on the right, click Clear Data, and Confirm.

Because of the cleared upstream dataset, we know that the Data Refresh scenario, once triggered, will rebuild the tx_windows dataset. What’s interesting is we may want the build step of the scenario to run only if all checks pass.

  1. Return to the Steps tab of the Data Refresh scenario.

  2. Click Add Step.

  3. Select Run checks.

  4. Drag this step to the first position before the build step.

  5. Click Add Dataset to Check.

  6. Choose tx, and click Add.

  7. Open the Build step, and confirm that the “Run this step” setting is set to If no prior step failed.

Dataiku screenshot of the run checks and build step in a scenario.

Now before executing the build step, this scenario first executes checks on the tx dataset. Only if no prior step fails will the scenario move on to executing the build step. Since we know one check will fail, let’s see what happens.

  1. Click Run to manually trigger the scenario.

  2. Navigate to the Last runs tab to see its progress.

  3. Click on the failed check to find the most recent scenario run’s source of failure. Note how the scenario did not move on to the build step.

Dataiku screenshot of the last runs tab of a failed scenario.

Tip

Experiment with adjusting the values in the record count check to generate failure, warning, and success states. Then explore the Run this step options to produce different outcomes.

What’s next?#

Congratulations on creating a scenario to automate a routine action in Dataiku! Once you have mastered these basic building blocks, you can explore other built-in scenario steps, such as those that create exports or execute code.

Note

You can learn more about scenarios in the reference documentation.