Automation Scenarios, Metrics & Checks¶
Learn how to use metrics, checks, and scenarios to schedule jobs and monitor the status and quality of your dataset.
Tip
Validate your knowledge of this area by registering for the Dataiku Academy course, Automation. Then challenge yourself to earn a certification!
Tutorials¶
- Tutorial | Metrics & Checks (Automation part 1)
- Tutorial | Scenarios (Automation part 2)
- Tutorial | Custom Metrics, Checks & Scenarios (Automation part 3)
- Tutorial | Automation practice
- Tutorial | Scenario reporter for a Microsoft Teams channel
- Tutorial | Create a Jira issue after a scenario failure
How-to | Create a Google Chat reporter¶
Dataiku provides the means to add reporters in a scenario. These reporters can be used to inform teams of users about scenario activities. For example, scenario reporters can update users about the training of models or changes in data quality. Reporters can also create actionable messages that users can receive within their email or through other messaging channels.
There are many built in messaging channels available for use by scenario reporters but many more can be accessed through the usage of a Webhook Reporter
First, within a project, navigate to the Scenarios page.

From here you can select a scenario to add a reporter to.

Create a new reporter with the “Webhook” type

Then fill in the necessary fields for a Scenario Reporter. Instructions on how to set up Scenario Reporters can be found within the Reference Documentation. For setting up a Google Chat reporter pay particular attention to the URL field. Here you should paste the Google Chat Webhook where you want Scenario Reports to be sent to. Instructions on setting up a Webhook for Google Chat can be found in the Google Chat Documentation.

Now, once a Scenario is run and the Reporter conditions are met you will receive Scenario Reporters directly in Google Chat.

You’re all set up now to have Scenario Reports be sent to any Google Chat Room of your choosing for improved monitoring and increased collaboration!
How-to | Build missing partitions with a scenario¶
Consider the following situation:
You have a Dataiku Flow with datasets partitioned by date (format: YYYY-MM-DD)
You run daily a scenario to build this Flow with the current date
You’d like to have a way to build your flow/scenario for (many) dates other than the current date, in particular for all missing dates/partitions.
You can easily adapt this how-to to other similar use cases with partitions.
Step 1: Scenario to build your Flow for one given partition¶
You have a Dataiku Flow with datasets partitioned by date that you would like to rebuild.

First, define a scenario that runs the Flow for a single partition.
You probably already have such a scenario that runs, for example, for the current day using the keyword CURRENT_DAY
as partition identifier.

Add a new step in your scenario that will first run to define a scenario variable.
Let’s call this variable partition
and evaluate it with the following Dataiku formula:
coalesce(partition_to_build, scenarioTriggerParam_partition_to_build, now().toString('yyyy-MM-dd'))
This variable either gets the value of another variable called partition_to_build
if defined (that our main scenario will define in step 2), the value of scenarioTriggerParam_partition_to_build
that we can define manually, or the current date as a fallback.

Now, use this variable in the build steps as a partition identifier:
${partition}

You can try to run the scenario. It will run for the current day.
You can also run the scenario for another date choosing the “Run with custom parameters” in the top-right corner and entering a value for the parameter “partition_to_build”:

Step 2: Meta-scenario that runs the first scenario for all missing partitions¶
Now that we have a scenario that can build the Flow for a given partition, let’s create another scenario that will be able to run this scenario for all missing partitions.
First, create a “Custom Python script” scenario.

You can now add a script that:
gets all existing partitions
generates a list of partitions that should exist
finds missing partitions (difference of the two following lists)
executes the scenario to build the Flow for any missing partition, one by one
from dataiku.scenario import Scenario
import dataiku
from datetime import timedelta, date
# object for this scenario
scenario = Scenario()
# let's get all curent existing partitions from a dataset of the flow
dataset = dataiku.Dataset('weather_conditions_prepared')
partitions = dataset.list_partitions()
print("Existing partitions:")
print(partitions)
# generate all partitions that should be buikt (here from Jan 1st 2020 until current day)
def dates_range(date1, date2):
for n in range(int ((date2 - date1).days)+1):
yield date1 + timedelta(n)
all_dates = [dt.strftime("%Y-%m-%d") for dt in dates_range(date(2020, 1, 1), date.today())]
print("Partitions that should exist:")
print(all_dates)
# let's find missing partitions
for partition in all_dates:
if partition not in partitions:
print("%s : missing partition" % partition)
# let's set a variable (on the current scenario) with the missing partition to build
scenario.set_scenario_variables(
partition_to_build=partition
)
# let's run the scenario that builds the flow for a given partition
# note that scenario variables are propagated to children scenarios, so the scenario
# will be able to read the variable 'partition_to_build'
scenario.run_scenario("BUILD_ONE_DAY")
Here is the same Python script as a scenario.

Finally, you can run the scenario and see in the list of jobs that missing partitions get built.
Code Sample | Set a timeout for a scenario build step¶
There is no explicit timeout functionality for a Build step within a Dataiku scenario. A common question is how to setup a timeout for a scenario or scenario step to avoid situations where a scenario gets stuck/hung in a running state indefinitely.
You can implement it using the Dataiku Python API. The same scenario step can be re-written as a custom Python step, in which case you can add additional Python code to implement a timeout.
Here is a code sample that you can try:
import time
import dataiku
from dataiku.scenario import Scenario, BuildFlowItemsStepDefHelper
from dataikuapi.dss.future import DSSFuture
s = Scenario()
# Define your build step below - this code is for specific building a dataset
step_handle = s.build_dataset("your_dataset", async=True)
start = time.time()
while not step_handle.is_done():
end = time.time()
print end, start, end-start
# Define your timeout time - example below is for more than 1000 seconds
if end - start > 1000:
f = DSSFuture(dataiku.api_client(), step_handle.future_id)
f.abort()
raise 'Took too much time'
Code Sample | Set email recipients in a “Send email” reporter¶
The Dataiku API allows the users to programmatically interact with the product. It can be very useful when having to apply some operations that would be quite repetitive to achieve through the UI, or when it comes to automating some interactions with Dataiku.
In this use case, let’s consider we have a project containing multiple scenarios, and for all of them, we want to add a new recipient for all the existing “Send email” reporters.
We’re going to achieve that with a Python script that will be executed from inside of Dataiku, but the same logic can be used from outside of Dataiku.
The idea of this operation is to list all the existing scenarios in a specified project then search for all the “Send email” reporters, then retrieve the list of recipients and then finally update the list if the new recipient doesn’t already exist.
To interact with scenarios, we first need to access the REST API client.
From a script running inside of Dataiku, it’s pretty straight forward using dataiku.api_client()
:
import dataiku
client = dataiku.api_client()
From outside of Dataiku, you need additional steps to configure how to access the instance. Please refer to the article Using the APIs outside of DSS from the Developer Guide to know more.
Then, let’s create a variable to store the new recipient email address:
new_recipient_email_address = "john.doe@here.com"
The next step is to retrieve the project and the list of scenario it contains:
project = client.get_project("PROJECT_KEY")
scenarios_list = project.list_scenarios()
list_scenarios()
returns a dictionary, let’s use the id property to retrieve a handle to interact with the scenario:
for scenario_metadata in scenarios_list:
scenario = project.get_scenario(scenario_metadata['id'])
Let’s then retrieve the scenario definition to interact with the scenario attributes:
scenario_definition = scenario.get_definition(with_status=False)
Now, it’s time to iterate on all the existing reporters, check if they are “Send email” reporters, if so then retrieve the existing list of recipients, and add the new recipient email address when missing:
update_scenario = False
for i in range(0, len(scenario_definition['reporters'])):
if scenario_definition['reporters'][i]['messaging']['type'] == "mail-scenario":
recipients = [recipient.strip() for recipient in scenario_definition['reporters'][i]['messaging']['configuration']['recipient'].split(',')]
if not new_recipient_email_address in recipients:
recipients.append(new_recipient_email_address)
scenario_definition['reporters'][i]['messaging']['configuration']['recipient'] = ', '.join(recipients)
update_scenario = True
print("Updating recipient for mail reporter \"{}\" of scenario \"{}\"".format(scenario_definition['reporters'][i]['name'], scenario_metadata['name']))
Finally, if we’ve edited the list of recipients, let’s update the definition of the scenario :
if update_scenario:
scenario.set_definition(scenario_definition,with_status=False)
Final code sample¶
import dataiku
client = dataiku.api_client()
project = client.get_project("PROJECT_KEY")
scenarios_list = project.list_scenarios()
new_recipient_email_address = "john.doe@here.com"
for scenario_metadata in scenarios_list:
scenario = project.get_scenario(scenario_metadata['id'])
scenario_definition = scenario.get_definition(with_status=False)
update_scenario = False
for i in range(0, len(scenario_definition['reporters'])):
if scenario_definition['reporters'][i]['messaging']['type'] == "mail-scenario":
recipients = [recipient.strip() for recipient in scenario_definition['reporters'][i]['messaging']['configuration']['recipient'].split(',')]
if not new_recipient_email_address in recipients:
recipients.append(new_recipient_email_address)
scenario_definition['reporters'][i]['messaging']['configuration']['recipient'] = ', '.join(recipients)
update_scenario = True
print("Updating recipient for mail reporter \"{}\" of scenario \"{}\"".format(scenario_definition['reporters'][i]['name'], scenario_metadata['name']))
if update_scenario:
scenario.set_definition(scenario_definition,with_status=False)
This code sample is also available on our GitHub repository.
FAQ | Can I control which datasets in my Flow get rebuilt during a scenario?¶
When configuring a Build / Train step in an Dataiku scenario, different build modes let you control which items in your Flow are rebuilt or retrained when the scenario is triggered.
In the Steps tab of your Scenario, click Add Step and select Build / Train.
Next, select the item you would like to build, the appropriate build mode, and — if applicable — your preferred handling of dependencies.

Note
To understand the various build modes that you can choose in Dataiku, visit the reference documentation or our concept article on dataset building strategies for further detail.