Tutorial | Custom step-based scenarios#

Scenarios are the main tool for automating actions, such as rebuilding datasets or retraining models, in Dataiku. Every scenario feature, such as triggers and steps, can be customized with Python code so you can adapt your automation to your specific use case.

Let’s see how they work!

Get started#

Objectives#

In this tutorial, you will:

  • Add custom steps with Python code inside a step-based scenario.

  • Write custom scenario triggers with Python code.

  • Create custom scenario variables.

Note

This tutorial covers adding Python within a step-based scenario. For a scenario written entirely in Python code, please see Tutorial | Custom script scenarios.

Prerequisites#

To reproduce the steps in this tutorial, you’ll need:

  • Access to an instance of Dataiku 12+.

  • Basic knowledge of Dataiku (Core Designer level or equivalent).

  • Basic knowledge of Python code.

Note

This tutorial is based on the scenario feature. Thus, we encourage you to follow the Data Quality & Automation course.

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Developer > Custom step-based scenarios.

  2. From the project homepage, click Go to Flow.

Note

You can also download the starter project from this website and import it as a zip file.

Add a Custom Python step#

Consider the final dataset at the end of a pipeline. This dataset may be the key input to a dashboard, webapp, or Dataiku application. It may be routinely shared with other Dataiku projects or exported to other software tools.

It’s a common need to automate the rebuilding of a dataset like this as new data becomes available at the start of a pipeline. A scenario is already created.

Note

For this example, we are rebuilding a dataset, but the logic works the same for other Dataiku objects such as retraining a model or refreshing a model evaluation store.

  1. From the Jobs menu in the top navigation bar, click Scenarios.

  2. Open the Data Refresh scenario.

The scenario is composed of a visual step Dataset build that rebuilds the tx_windows dataset if the check on the upstream tx folder passes. Let’s assume a colleague has created this visual scenario. It works as expected but we would like to add a specificity during the step.

Build a dataset with code#

Let’s write the code to replace the visual step into a code step.

  1. Navigate to the Steps tab of the Data Refresh scenario.

  2. Click Add Step at the bottom left.

  3. Select Execute Python code under Code.

    Note

    With a Python code step you have access to the Python API and can manage your scenario directly through it, which offers you a large panel of actions and possibilities.

  4. Rename the step Python Refresh.

  5. Copy-paste the code below.

    from dataiku.scenario import Scenario
    
    # The Scenario object is the main handle from which you initiate steps
    s = Scenario()
    
    # The dataset to be built
    step = s.build_dataset("tx_windows", asynchronous=True)
    
  6. Click Save (or Cmd/Ctrl + S).

Dataiku screenshot of a custom python scenario step creation.

We have now the same scenario that we had at the beginning but as a Python code step. Let’s add custom code to bring more flexibility to our step.

Add custom functionality#

Sometimes, the scenario takes too much time because of an error. To avoid this issue, we can set a time-out directly within our code step. To do so:

  1. Replace the previous code with this complete one:

    import dataiku
    import time
    from dataiku.scenario import Scenario
    from dataiku.scenario import BuildFlowItemsStepDefHelper
    from dataikuapi.dss.future import DSSFuture
    
    # Setting the wanted timeout in second.
    TIMEOUT_SECONDS = 3600
    
    # Handling our scenario.
    s = Scenario()
    
    step = s.build_dataset("tx_windows", asynchronous=True)
    
    start = time.time()
    while not step.is_done():
       end = time.time()
       print("Duration: {}s".format(end-start))
       if end - start > TIMEOUT_SECONDS:
           f = DSSFuture(dataiku.api_client(), step.future_id)
           f.abort()
           raise Exception("Scenario was aborted because it took too much time.")
    
  2. Click Save.

Our step is complete. We can now disable the previous visual step:

  1. Check the box of the Build datasets step card.

  2. Click on Actions.

  3. Select Toggle enable/disable.

  4. Click Save.

Dataiku screenshot of a custom python scenario step creation.

To test the new Python step:

  1. Click Run. You should soon have a notification that says the scenario and the job of building the dataset are completed.

  2. Navigate to the Last runs tab of the scenario to see what happens in detail.

Optional: Trigger a scenario with custom Python code#

Important

This section covers custom triggers. This feature is limited to Business or Enterprise license users. If you don’t have this feature, skip ahead to Add scenario variables.

Let’s demonstrate the value of automating when actions occur by adding various kinds of triggers to the scenario. The custom trigger allows you to write Python script to trigger your scenario when a fully flexible approach is required. We want to trigger the scenario based on a check result of the tx folder.

  1. Navigate to the Settings tab of the Data Refresh scenario.

  2. Click Add Trigger.

  3. Choose Custom trigger.

  4. Name it Ok check status.

  5. Copy-paste the following code block.

    import dataiku
    from dataiku.scenario import Trigger
    
    # Call the Managed Folder "tx" where the check is computed.
    folder = dataiku.Folder("tx")
    
    # Get the last outcome of the check called "Empty folder".
    checks_message = folder.get_last_check_values().get_check_by_name('Empty folder')['lastValues'][0]['outcome']
    
    # Call the trigger.
    t = Trigger()
    
    # Set up the condition.
    if checks_message == 'OK':
       t.fire()
    
  6. Click Save.

Dataiku screenshot of adding a custom trigger to a step-based scenario.

To test the trigger:

  1. Enter 5 in the Run every (seconds) input.

  2. Toggle on the Auto triggers in the Run panel.

  3. Click Save.

  4. Wait for the scenario to start and finish (Depending on your instance settings, you may see a pop-up notification.)

  5. Toggle off the Auto triggers and click Save again to avoid unwanted runs.

  6. Navigate to the Last runs tab of the scenario.

  7. With the most recent scenario run selected, click to view the job it triggered.

Dataiku screenshot of the auto-triggers running.

You can then decide to turn on or off the auto-trigger to fully automate or avoid unwanted runs, respectively, according to your needs.

Tip

Trigger logs are not recorded in the scenario logs. If the trigger doesn’t actually run as expected, you can ask your admin to check the backend logs in Administration > Maintenance > Log files > backend.log.

Add scenario variables#

It is possible to add variables that will be used across the scenario. They allow access to information during the scenario run and even to all reporters executed at the end of the run. For example, we can set a variable that retrieves the trigger. The trigger passes data across the scenario that can be retrieved and used as a variable. We are going to change the Timeout value in case the trigger is fired manually.

A variable can be defined in the UI with a Define Variable step, but we prefer using Python code because of its flexibility.

  1. Navigate on the Steps tab.

  2. Click Add Step and Execute Python code.

  3. Drag the step card into the first position in the left panel (with the grip handle on the left of the card).

  4. Rename the step Set Variable.

  5. Copy-paste the code below:

    from dataiku.scenario import Scenario
    
    # Handling our scenario.
    s = Scenario()
    
    # Get the trigger type
    trigger_type = s.get_trigger_type()
    
    # trigger_var is now the name of the variable across the scenario
    # and can be changed at your convenience.
    s.set_scenario_variables(trigger_var = trigger_type)
    
  6. Click Save.

Dataiku screenshot of custom scenario variable creation.

Now we have a scenario variable that we can call on every step after this one, either by code or within the UI. Let’s use it in our first Python code step Python Refresh. To do so,

  1. Navigate back to the Python Refresh step.

  2. Copy-paste the updated code below:

    import dataiku
    import time
    from dataiku.scenario import Scenario
    from dataiku.scenario import BuildFlowItemsStepDefHelper
    from dataikuapi.dss.future import DSSFuture
    
    # Handling our scenario.
    s = Scenario()
    # Call the previously set variable
    TRIGGER_VAR = s.get_all_variables()['trigger_var']
    # If the scenario was triggered manually
    if TRIGGER_VAR == "manual":
        TIMEOUT_SECONDS = 120
    else:
        TIMEOUT_SECONDS = 3600
    step = s.build_dataset("tx_windows", asynchronous=True)
    start = time.time()
    while not step.is_done():
        end = time.time()
        print("Duration: {}s".format(end-start))
        if end - start > TIMEOUT_SECONDS:
            f = DSSFuture(dataiku.api_client(), step.future_id)
            f.abort()
            raise Exception("Scenario was aborted because it took too much time.")
    
  3. Click Save.

This code decreases the timeout if the trigger is manual. For example, we can consider we don’t want to wait too much time if an error loops the scenario.

You can now test this scenario. To do so:

  1. Click Run.

  2. Navigate to the Last runs tab of the scenario to see what happens.

Tip

You can make checks and debugger with some print() amongst the code and see them after in the Step logs of the Last Runs tab.

Visualize your results#

The project provides a dataset that records the scenario results in a dataset. To check them you can create a reporter a link the scenario results into the dataset at the end of a scenario run. To do so:

  1. Navigate to the Settings tab, click Add Reporter.

  2. Select Send to dataset.

Configure a reporter#

Let’s configure the contents of the reporter next.

  1. Name the reporter Store scenario results.

  2. Turn Off the run condition to report all results, regardless of the scenario’s success or failure.

  3. Provide the project key found in your URL.

  4. Provide the dataset name scenario_results.

  5. Provide timestamp as the name of the Timestamp column.

  6. Copy-paste the JSON below for the other two columns found in the schema.

    {
      "scenario": "${scenarioName}",
      "status": "${outcome}",
      "triggerType": "${triggerType}"
    }
    
  7. Click Save to activate the reporter.

Dataiku screenshot of reporter creation.

Check the results#

Once your scenario has been run a few times, you can then check the results:

  1. Navigate to the Flow.

  2. Open the scenario_results dataset in the Results Flow zone.

  3. Click on Update Sample to force the refresh and have the latest results.

Dataiku screenshot of adding a custom trigger to a step-based scenario.

You can see the dataset populated with your previous test runs. To see more on this feature, please check this tutorial.

What’s next?#

Congratulations on creating a custom step-based scenario to automate a routine action in Dataiku with the flexibility added by custom code!

A common next step would be to add a reporter to the scenario to communicate activities through various messaging channels.

Create one with code in the Custom Script Scenario tutorial!

See also

You can learn more about scenarios in the reference documentation.