# Automation Scenarios, Metrics & Checks¶

Learn how to use metrics, checks, and scenarios to schedule jobs and monitor the status and quality of your dataset.

Tip

To validate your knowledge of this area, register for the Automation course, part of the Advanced Designer learning path, on the Dataiku Academy.

## How-to | Create a Google Chat reporter¶

Dataiku DSS provides the means to add reporters in a scenario. These reporters can be used to inform teams of users about scenario activities. For example, scenario reporters can update users about the training of models or changes in data quality. Reporters can also create actionable messages that users can receive within their email or through other messaging channels.

There are many built in messaging channels available for use by scenario reporters but many more can be accessed through the usage of a Webhook Reporter

First, within a project, navigate to the Scenarios page.

From here you can select a scenario to add a reporter to.

Create a new reporter with the “Webhook” type

Then fill in the necessary fields for a Scenario Reporter. Instructions on how to set up Scenario Reporters can be found within the Reference Documentation. For setting up a Google Chat reporter pay particular attention to the URL field. Here you should paste the Google Chat Webhook where you want Scenario Reports to be sent to. Instructions on setting up a Webhook for Google Chat can be found in the Google Chat Documentation.

Now, once a Scenario is run and the Reporter conditions are met you will receive Scenario Reporters directly in Google Chat.

You’re all set up now to have Scenario Reports be sent to any Google Chat Room of your choosing for improved monitoring and increased collaboration!

## How-to | Build missing partitions with a scenario¶

Consider the following situation:

• You have a Dataiku DSS Flow with datasets partitioned by date (format: YYYY-MM-DD)

• You run daily a scenario to build this Flow with the current date

You’d like to have a way to build your flow/scenario for (many) dates other than the current date, in particular for all missing dates/partitions.

You can easily adapt this how-to to other similar use cases with partitions.

### Step 1: Scenario to build your Flow for one given partition¶

You have a DSS Flow with datasets partitioned by date that you would like to rebuild.

First, define a scenario that runs the Flow for a single partition.

You probably already have such a scenario that runs, for example, for the current day using the keyword CURRENT_DAY as partition identifier.

Add a new step in your scenario that will first run to define a scenario variable.

Let’s call this variable partition and evaluate it with the following DSS formula:

coalesce(partition_to_build, scenarioTriggerParam_partition_to_build, now().toString('yyyy-MM-dd'))


This variable either gets the value of another variable called partition_to_build if defined (that our main scenario will define in step 2), the value of scenarioTriggerParam_partition_to_build that we can define manually, or the current date as a fallback.

Now, use this variable in the build steps as a partition identifier:

\${partition}


You can try to run the scenario. It will run for the current day.

You can also run the scenario for another date choosing the “Run with custom parameters” in the top-right corner and entering a value for the parameter “partition_to_build”:

### Step 2: Meta-scenario that runs the first scenario for all missing partitions¶

Now that we have a scenario that can build the Flow for a given partition, let’s create another scenario that will be able to run this scenario for all missing partitions.

First, create a “Custom Python script” scenario.

You can now add a script that:

• gets all existing partitions

• generates a list of partitions that should exist

• finds missing partitions (difference of the two following lists)

• executes the scenario to build the Flow for any missing partition, one by one

from dataiku.scenario import Scenario
import dataiku
from datetime import timedelta, date

# object for this scenario
scenario = Scenario()

# let's get all curent existing partitions from a dataset of the flow
dataset = dataiku.Dataset('weather_conditions_prepared')
partitions = dataset.list_partitions()
print("Existing partitions:")
print(partitions)

# generate all partitions that should be buikt (here from Jan 1st 2020 until current day)
def dates_range(date1, date2):
for n in range(int ((date2 - date1).days)+1):
yield date1 + timedelta(n)
all_dates = [dt.strftime("%Y-%m-%d") for dt in dates_range(date(2020, 1, 1), date.today())]
print("Partitions that should exist:")
print(all_dates)

# let's find missing partitions
for partition in all_dates:
if partition not in partitions:
print("%s : missing partition" % partition)
# let's set a variable (on the current scenario) with the missing partition to build
scenario.set_scenario_variables(
partition_to_build=partition
)
# let's run the scenario that builds the flow for a given partition
# note that scenario variables are propagated to children scenarios, so the scenario
# will be able to read the variable 'partition_to_build'
scenario.run_scenario("BUILD_ONE_DAY")


Here is the same Python script as a scenario.

Finally, you can run the scenario and see in the list of jobs that missing partitions get built.

## Code Sample | Set a timeout for a scenario build step¶

There is no explicit timeout functionality for a Build step within a Dataiku DSS scenario. A common question is how to setup a timeout for a scenario or scenario step to avoid situations where a scenario gets stuck/hung in a running state indefinitely.

You can implement it using the Dataiku Python API. The same scenario step can be re-written as a custom Python step, in which case you can add additional Python code to implement a timeout.

Here is a code sample that you can try:

import time
import dataiku
from dataiku.scenario import Scenario, BuildFlowItemsStepDefHelper
from dataikuapi.dss.future import DSSFuture

s = Scenario()

# Define your build step below - this code is for specific building a dataset
step_handle = s.build_dataset("your_dataset", async=True)

start = time.time()
while not step_handle.is_done():
end = time.time()
print end, start, end-start
# Define your timeout time - example below is for more than 1000 seconds
if end - start > 1000:
f = DSSFuture(dataiku.api_client(), step_handle.future_id)
f.abort()
raise 'Took too much time'


## Code Sample | Set email recipients in a “Send email” reporter¶

The Dataiku DSS API allows the users to programmatically interact with the product. It can be very useful when having to apply some operations that would be quite repetitive to achieve through the UI, or when it comes to automating some interactions with DSS.

In this use case, let’s consider we have a project containing multiple scenarios, and for all of them, we want to add a new recipient for all the existing “Send email” reporters.

We’re going to achieve that with a python script that will be executed from inside of DSS, but the same logic can be used from outside of DSS.

The idea of this operation is to list all the existing scenarios in a specified project then search for all the “Send email” reporters, then retrieve the list of recipients and then finally update the list if the new recipient doesn’t already exist.

To interact with scenarios, we first need to access the REST API client. From a script running inside of DSS, it’s pretty straight forward using dataiku.api_client():

import dataiku
client = dataiku.api_client()


From outside of DSS, you need additional steps to configure how to access the instance. Please refer to the article Using the APIs outside of DSS from the reference documentation to know more.

Then, let’s create a variable to store the new recipient email address:

new_recipient_email_address = "john.doe@here.com"


The next step is to retrieve the project and the list of scenario it contains:

project = client.get_project("PROJECT_KEY")
scenarios_list = project.list_scenarios()


list_scenarios() returns a dictionary, let’s use the id property to retrieve a handle to interact with the scenario:

for scenario_metadata in scenarios_list:


Let’s then retrieve the scenario definition to interact with the scenario attributes:

scenario_definition = scenario.get_definition(with_status=False)


Now, it’s time to iterate on all the existing reporters, check if they are “Send email” reporters, if so then retrieve the existing list of recipients, and add the new recipient email address when missing:

update_scenario = False
for i in range(0, len(scenario_definition['reporters'])):
if scenario_definition['reporters'][i]['messaging']['type'] == "mail-scenario":
recipients = [recipient.strip() for recipient in scenario_definition['reporters'][i]['messaging']['configuration']['recipient'].split(',')]
scenario_definition['reporters'][i]['messaging']['configuration']['recipient'] = ', '.join(recipients)
update_scenario = True
print("Updating recipient for mail reporter \"{}\" of scenario \"{}\"".format(scenario_definition['reporters'][i]['name'], scenario_metadata['name']))


Finally, if we’ve edited the list of recipients, let’s update the definition of the scenario :

if update_scenario:
scenario.set_definition(scenario_definition,with_status=False)


### Final code sample¶

import dataiku

client = dataiku.api_client()

project = client.get_project("PROJECT_KEY")
scenarios_list = project.list_scenarios()

scenario_definition = scenario.get_definition(with_status=False)

update_scenario = False
for i in range(0, len(scenario_definition['reporters'])):
if scenario_definition['reporters'][i]['messaging']['type'] == "mail-scenario":
recipients = [recipient.strip() for recipient in scenario_definition['reporters'][i]['messaging']['configuration']['recipient'].split(',')]
scenario_definition['reporters'][i]['messaging']['configuration']['recipient'] = ', '.join(recipients)
update_scenario = True
print("Updating recipient for mail reporter \"{}\" of scenario \"{}\"".format(scenario_definition['reporters'][i]['name'], scenario_metadata['name']))
if update_scenario:
scenario.set_definition(scenario_definition,with_status=False)


This code sample is also available on our GitHub repository.

## FAQ | Can I control which datasets in my Flow get rebuilt during a scenario?¶

When configuring a Build/Train step in an Dataiku scenario, four different build modes let you control which items in your Flow are rebuilt or retrained when the scenario is triggered.

In the Steps tab of your Scenario, click Add Step and select Build / Train.

Next, select the item you would like to build (a dataset, folder, or model). In the options below, you have the option of four different build modes to control which datasets and upstream dependencies are executed each time the trigger fires.

Build only this dataset

• Select this option to build only the selected dataset using its parent recipe. This option requires the least computation, but does not take into account any upstream changes to datasets or recipes. Therefore, if previous dependencies to this recipe are not built, the job will fail.

Build required datasets

• Also known as smart reconstruction, this option checks each dataset and recipe upstream of the selected dataset to see if it has been modified more recently than the selected dataset. Dataiku DSS then rebuilds all impacted datasets down to the selected one. This is the recommended default.

Force-rebuild dataset and dependencies

• Also known as forced recursive rebuild, select this option to rebuild all of the dependencies of the selected dataset going back to the start of the Flow. This is the most computationally-intense of the build modes, but can be useful for overnight-build scenarios in order to start the next day with a double-checked and up to date Flow.

Build missing dependent datasets then this one

• This option is not recommended for general usage. It works a bit like “Build required datasets”, but a dependent dataset needs to be (re)built only if it’s completely empty, prior to then building the specified dataset.

Note

In all of these cases, if a dataset is built, its siblings (other outputs of the source recipe) will also be built. If you specify multiple datasets in the same build step, they will be built in parallel at run-time. Shared datasets will also not be built, as the flow build stops when it reaches foreign objects.