Hands-On: Automation with Metrics, Checks & Scenarios

Tip

The hands-on tutorial below on how to automate workflows in Dataiku DSS using metrics, checks, and scenarios is also found in the Dataiku Academy’s Automation course, which is part of the Advanced Designer learning path. Register for the course if you’d like to track and validate your knowledge alongside concept videos, summaries, and quizzes.

Hands-On: Metrics & Checks (Part 1)

A data science project is never finished. Datasets need to be updated. Models need to be rebuilt. In order to put such a project into production, it is essential to have tools that can track the evolution of objects like datasets and models. Key to achieving this process in Dataiku DSS are metrics, checks, and scenarios.

This hands-on tutorial has three parts:

  • In this part, we’ll focus on understanding how to use metrics and checks in Dataiku DSS in order to monitor the status of datasets and models.

  • In the second part, we’ll demonstrate how to use metrics and checks inside of a scenario in order to safely automate workflows.

  • In the third part, we’ll look at ways to customize metrics, checks, and scenarios with code.

Prerequisites

This hands-on lesson assumes that you have a basic level of comfort using Dataiku DSS, as well as knowledge of variables.

Note

If not already on the Advanced Designer learning path, completing the Core Designer certificate is recommended.

You’ll need access to an instance of Dataiku DSS (version 8.0 or above) with the following plugins installed:

These plugins are available through the Dataiku Plugin store, and you can find the instructions for installing plugins in the reference documentation. To check whether the plugin is already installed on your instance, go to the Installed tab in the Plugin Store to see a list of all installed plugins.

../../_images/automation-plugins.png

Tip

Two notes for users of Dataiku Online:

  • The process for installing a plugin is slightly different.

    • From your instance launchpad, open to the Features panel on the left hand side.

    • Click Add a Feature and choose “US Census” from the Extensions menu. (Reverse geocoding is already available by default).

  • The end of this tutorial includes a demonstration of custom Python triggers in scenarios, which is not available to Online users. All other parts of the tutorial, however, can be completed.

Create Your Project

Rather than starting from scratch, we’ll use an existing Flow.

  • Click +New Project > DSS Tutorials > Advanced Designer > Automation (Tutorial).

Note

You can also use a successfully completed project from the Plugin Store course.

Change Connections

Aside from the input datasets, all of the others are empty managed filesystem datasets.

You are welcome to leave the storage connection of these datasets in place, but you can also use another storage system depending on the infrastructure available to you.

To use another connection, such as a SQL database, follow these steps:

  • Select the empty datasets from the Flow. (On a Mac, hold Shift to select multiple datasets).

  • Click Change connection in the Other actions section of the Actions sidebar.

  • Use the dropdown menu to select the new connection.

  • Click Save.

../../_images/automation-change-connection.png

Note

Another way to select datasets is from the Datasets page (G+D). There are also programmatic ways of doing operations like this that you’ll learn about in the Developer learning path.

The screenshots below demonstrate using a PostgreSQL database.

Build Your Project

Your starter project only has the skeleton of the Flow. The datasets have not yet been built.

../../_images/starting-flow.png

Let’s build the parts we need for this tutorial.

  • From the Flow, select the two output datasets relevant for this tutorial:

    • merchants_with_tract_income

    • transactions_unknown_scored.

  • With both datasets selected, choose Build from the Actions sidebar.

  • By default, the option “Build required dependencies” should be chosen. Click Preview to view the suggested job.

  • Now in the Jobs tab, we can see all of the activities Dataiku DSS will perform.

  • Click Run and observe how Dataiku DSS progresses through the list of activities.

../../_images/job-preview.png

Note

Dataiku DSS issues a warning about a possible quoting issue in an unusually large column. This warning is generated in response to the product_description column. As we are not working with natural language data here, we can safely ignore this warning.

When the job finishes, note that the project has a variable state_name defined as “Delaware” (see the Variables tab of the More Options menu). This variable gets used in the Group recipe that creates the merchants_by_state dataset. Accordingly, the only value for merchant_state in the merchants_with_tract_income dataset is “Delaware”. We’ll see how to change this with a scenario later.

Default Metrics

A key dataset in the Flow is transactions_joined_prepared. Let’s establish some metrics and checks for this dataset in order to monitor its status.

On the Status tab of every dataset in Dataiku DSS, we find a few default metrics depending on the type of dataset.

  • Open transactions_joined_prepared and navigate to the Status tab.

  • On the Metrics subtab, we can control which metrics to display and in what format. Two metrics in this case, Column count and Record count, are displayed by default.

  • Click Compute to calculate all of the displayed metrics if not already done so.

../../_images/default-metrics.png

Create Your Own Metrics

Now let’s take further control of our metrics. Navigate to the Edit subtab.

We can see that both “Column count” and “Record count” are switched ON. Only “Column count”, however, is set to auto-compute after build (although this may differ based on your chosen storage connection).

../../_images/auto-compute.png

Now, let’s create a new metric from the available options.

  • Turn on the Column statistics section. Here we can create metrics out of basic statistics like the minimum, maximum, or count of any column in the dataset.

By definition, FICO scores, a system of scoring credit worthiness, range from 300 to 850.

  • With this range in mind, let’s track the Min and Max of card_fico_score.

  • Also, select the Distinct value count of merchant_state.

  • From this screen, run the probe to compute the three requested metrics.

../../_images/col-stats-metrics.png

The next section of metrics allows us to compute the most frequent values of a column.

  • Turn on Most frequent values.

  • Select the Mode of merchant_subsector and the Top 10 values of merchant_state.

  • Click to run this probe as well and see the last run results.

../../_images/most-frequent-values.png

In addition to column statistics, we can also retrieve cell values as metrics. Imagine that we want to be alerted if any transaction from one of our largest customers is not authorized. We can create this kind of metric with a cell value probe.

  • From the bottom of the screen, create a New Cell Value Probe.

  • With the Filter enabled, keep all rows satisfying the conditions:

    • card_id equals C_ID_2d1fec14d8 (the card_id of one large customer).

    • authorized_flag equals 0.0 (those failing authorization).

  • The output of the filter should be a First row matching the filter.

  • Select the authorized_flag column.

  • Save the metric. Click to run it now.

../../_images/cell-value-probe.png

The cell value probe should return “No data” because no transaction from this customer has failed authorization. We could verify this with the present dataset using a Filter recipe.

Note

In addition to these kinds of built-in metrics, we can also define custom metrics with Python, SQL, or plugins. We’ll explore these in another lesson.

When returning to the Metrics subtab, we find our newly-created metrics as options available to be displayed.

  • Click on the button showing the number of metrics currently displayed, and Add all.

  • Click Compute to calculate the metrics if not already done so.

  • Experiment with different views by changing the Display from “Last value”, to “History”, to “Columns”.

../../_images/computed-metrics.png

Create Your Own Checks

Now let’s use these metrics to establish checks on the dataset.

  • Navigate to the Checks subtab.

  • None exist yet so create one in the Edit tab.

By definition, we know FICO scores range from 300 to 850. If a value is outside of this range, we will assume there must be some kind of data quality problem, and we want a failure notice.

  • Under “Add a new check”, select Metric value is in a numeric range.

  • Name it FICO >= 300.

  • Choose Min of card_fico_score as the metric.

  • Set the Minimum to 300.

  • Click Check to see how it will work. It should return “OK”, and the message 300. We could have found the same result in the Analyze tool.

Note

When we check if a metric is in a numeric range, we have the ability to define a hard or soft maximum or minimum, depending on whether we want to trigger an error or a warning.

  • A value less than the minimum or greater than the maximum value produces an error.

  • A value less than the soft minimum or greater than the soft maximum produces a warning.

Let’s test this out on the same check.

  • Change the minimum of the “FICO >= 300” check from 300 to 320 and run the check. Instead of returning “OK”, it returns an error because we have values in the dataset less than 320.

  • Reset the minimum to 300 and add a soft minimum of 320 (a very risky credit score). Now the check returns “WARNING”.

../../_images/fico-above-300.png

We can follow a similar process to check the upper bound.

  • Under “Add a new check”, select Metric value is in a numeric range.

  • Name it FICO <= 850.

  • Choose Max of card_fico_score as the metric.

  • Set the Maximum to 850.

  • Click Check to confirm it is working as intended.

Assuming that all of these transactions are from the United States, we know that there should not be more than 51 distinct values of merchant_state (including the District of Columbia as a state). We can create a check to monitor this.

  • Under “Add a new check”, select Metric value is in a numeric range.

  • Name it Valid merchant_state.

  • Choose Distinct value count of merchant_state as the metric.

  • Set the Maximum to 51.

  • After running the check, it should return “OK”.

We can also check if a metric is within a set of categorical values. For example, our domain knowledge might create the expectation that the most frequent merchant_subsector should be “gas”. Let’s make this a check.

  • Under “Add a new check”, select Metric value is in a set of values.

  • Name it 'gas' is subsector mode.

  • Choose Mode of merchant_subsector as the metric.

  • Add gas as the value.

  • After running the check, it should return “OK”.

After saving the new checks, navigate from the Edit to the Checks tab.

  • Display all of the newly-created Checks.

  • Click Compute.

../../_images/computed-checks.png

Note

We can also create custom checks with Python or plugins. We’ll see this in another lesson.

Model Metrics & Checks

Datasets are not the only Dataiku DSS object for which we can monitor metrics and checks. Models are another object in need of close monitoring.

In the Flow, the green diamond represents the deployed model that predicts whether a credit card transaction will be authorized or not.

  • Select it and Retrain it (non-recursively) from the Actions sidebar.

On opening the deployed model, we can see the active and previous versions. Note how the ROC AUC, or AUC, a common performance metric for classification models, is around 0.75.

../../_images/model-versions.png

We can track this metric, along with many other common indicators of model performance.

  • Navigate to the Metrics & Status tab.

  • On the View subtab, click on the Display to show the built-in model metrics available.

  • Add and save AUC to the list of metrics to display. Other kinds of performance metrics, such as accuracy, precision and recall, are also available.

../../_images/model-metrics.png

Now let’s create a check to monitor this metric.

  • Navigate to the Settings tab.

  • Under the Status checks subtab, add a new check for Metric value is in a numeric range.

  • Name it 0.60 >= AUC >= 0.95.

  • Select AUC as the metric to check.

  • Set a minimum to 0.6 and a maximum of 0.95 to throw an error if the model performance has either degraded or become suspiciously high.

  • Set a soft minimum of 0.65 and a soft maximum of 0.9 to warn us if the performance of the deployed model has decreased or increased.

  • Run the check to see if it is working. Save it.

../../_images/model-check.png

Note

This AUC range is just an example. For your own use case, an allowable deviation in AUC may be very different.

Return to the Metrics & Status tab and add the new check to the Display.

../../_images/model-checks-computed.png

Hands-On: Scenarios (Part 2)

In the section above, we saw how to use built-in metrics and checks to monitor the status of datasets and models in Dataiku DSS.

Now let’s see how to use these metrics and checks inside of a scenario to automate workflows.

Create a Scenario

From the Jobs menu, navigate to the Scenarios panel and create a new scenario.

In Dataiku DSS, we can create two kinds of scenarios. We’ll see the Custom Python scenario later. For now, create a step-based scenario named My Scenario.

../../_images/new-scenario.png

A scenario has two required components:

  • triggers that activate a scenario and cause it to run, and

  • steps, or actions, that a scenario takes when it runs.

An optional third component of a scenario is a reporter. Please see the product documentation for more information on reporters.

Add a Trigger

Dataiku DSS allows us to create many kinds of triggers to activate a scenario. These triggers can be based on:

  • a change in time,

  • a change in a dataset,

  • a change in the result of a SQL query,

  • or custom code.

Note

If using Dataiku Online, only time-based triggers are available.

Let’s start with the simplest kind of trigger.

  • Within the Triggers panel of the Settings tab, click the Add Trigger dropdown button.

  • Add a Time-based trigger.

  • Name it Every 3 min.

  • Change “Repeat every” to 3 minutes.

    • If using DSS version 8, change the frequency to Every X minutes and set the interval to 3.

  • Make sure its activity status is marked as ON.

../../_images/time-trigger.png

Note

For many use cases, it would be unusual to have a scenario run this frequently. This short duration is only to let us see the scenario running in a short amount of time.

Add Steps

Now that we have a trigger in place, we need to provide the steps that the scenario will take when a trigger is activated. In a somewhat contrived example, let’s demonstrate how to rebuild datasets, compute metrics, and set project variables in a scenario.

Navigate to the Steps tab.

  • From the Add Step dropdown, first add a Build / Train step. Add transactions_joined_prepared as the dataset to build.

    • Note that this step has Build required datasets as the Build mode. Accordingly, because we won’t have “new” data in this tutorial, datasets upstream to transactions_joined_prepared will never be out of date, and this step will only build transactions_joined_prepared.

Note

To learn more about building strategies in Dataiku DSS, see the course Flow Views & Actions.

  • Add the step Compute metrics. Choose transactions_joined_prepared as the dataset to compute.

    • Earlier we manually computed metrics like the minimum and maximum of card_fico_score. Now the scenario will execute this computation for us.

  • Add the step Run checks. Choose transactions_joined_prepared as the dataset to check.

    • Checks we created, such as whether the minimum of card_fico_score is greater than or equal to 300, will be recomputed.

  • Add the step Check project consistency.

    • As our data has not changed, we know that the schema consistency check is going to run smoothly, but in a production use case, it can help detect when the schema of an input dataset has changed.

Add the step Set project variables. The current state_name variable is “Delaware”. Reset it to another state as shown below. You can imagine changing the value of the variable if some condition is met.

{
    "state_name": "Rhode Island"
}
  • Add a Clear step. Choose merchants_by_state as the dataset to clear.

    • This is the output dataset to the Group recipe which uses the state_name variable.

  • Add one more Build / Train step. Choose merchants_with_tract_income as the dataset to build.

    • With a new state_name variable and an empty merchants_by_state dataset, this section of the Flow will get rebuilt from the Group recipe.

../../_images/scenario-steps.png

After saving, we could click the green Run button to manually start the scenario. This is important to test out scenario runs.

In this case though, navigate to the Scenario panel in the Jobs menu where all scenarios for the project are listed. Turn the “Auto-triggers” switch to “ON” to have Dataiku DSS start monitoring the triggers of the scenario.

../../_images/scenario-auto-trigger-status.png

Once the scenario starts running, check its progress in the Jobs panel. Jobs attached to a scenario have a label including the name of the scenario. We can see that this scenario creates two jobs.

  • Recall that we chose “Build required datasets” for the build mode instead of an option like “Build only this dataset” or “Force rebuild”. Accordingly, there is “nothing to do” for building transactions_joined_prepared as there is no new data.

  • Rebuilding merchants_with_tract_income, however, does require activity because one of the datasets it depends on has been cleared.

../../_images/scenario-jobs.png

Hint

Check the merchants_with_tract_income dataset again. Has the merchant_state column updated from “Delaware” to the new variable?

We can keep tabs on the progress through the Last runs tab of the scenario. Once you have seen a few runs, turn off the scenario from the main Scenarios page.

../../_images/scenario-last-runs.png

Note

If we have multiple scenarios, the Automation Monitoring panel in the Jobs menu provides similar information for all scenario activities coming from the project. This feature is not available to community edition users.

../../_images/automation-monitoring.png

Hands-On: Custom Metrics, Checks & Scenarios (Part 3)

In the sections above, we saw how to use built-in metrics, checks, and scenarios to automate workflows in Dataiku DSS.

The built-in options cover a wide range of common actions. However, for certain objectives, you may need greater flexibility than what is possible with the built-in tools. In these cases, you can customize all of these components with your own code.

Create a Custom Metric

In addition to the built-in probes, we can also create our own:

  • Python probes,

  • SQL query probes (where applicable), or

  • a custom probe from a plugin.

For example, we may want a metric to track the most authorized and least authorized merchant subsectors. Let’s achieve this with a Python probe.

  • In the Status tab of the transactions_joined_prepared dataset, navigate to the Edit subtab.

  • At the bottom of the Metrics panel, create a New Python Probe.

  • Turn it on and replace the starter code with the following snippet.

Note

In this tutorial, we’ll just copy and paste code snippets for the purposes of demonstration, but in a real situation, you’d want to test it interactively in a code notebook.

import numpy as np

# Define here a function that returns the metric.
def process(dataset, partition_id):
    # dataset is a dataiku.Dataset object
    df = dataset.get_dataframe()
    df.authorized_flag.fillna(0, inplace=True)
    df_per_subsector = (df
                        .groupby('merchant_subsector')
                        .agg({'authorized_flag':np.mean})
                        .sort_values('authorized_flag', ascending=False)
                    )

    most_authorized_subsector = df_per_subsector.index[0]
    least_authorized_subsector = df_per_subsector.index[-1]
    return {'Most authorized subsector' : most_authorized_subsector, 'Least authorized subsector' : least_authorized_subsector}

Run the probe. The most authorized subsector should be “financial services”, and the least should be “software”.

../../_images/custom-metric.png

Create a Custom Check

The same kind of flexibility code brings to custom metrics can also be brought to custom checks.

In another section, we added a metric to report the top ten most frequent states found in the data. We can use this built-in metric in a custom check to verify if a specific state is in that group.

  • Navigate from the Metrics panel to the Checks panel of the Edit subtab.

  • At the bottom of the screen, choose a new Custom (Python) check.

  • Name it Delaware in top10 states.

  • Replace the starter code with the following snippet.

def process(last_values, dataset, partition_id):
    # last_values is a dict of the last values of the metrics,
    # with the values as a dataiku.metrics.MetricDataPoint.
    # dataset is a dataiku.Dataset object
    if "Delaware" in (last_values['adv_col_stats:TOP10:merchant_state'].get_value()) :
        return 'OK', 'Delaware in top10 state'
    else:
        return 'ERROR', 'Delaware not in top10 state'

Running the check should return an error. Using the Analyze tool in the Explore tab, we can verify this to be true.

../../_images/custom-check.png

Create a Custom Scenario

In Dataiku DSS, we can introduce custom code to a scenario in two ways. We can create a custom Python scenario where a Python Script tab with starter code replaces the usual Steps tab.

../../_images/py-scenario.png

Alternatively, we can add custom steps (SQL, Python, plugin) to a step-based scenario. We’ll focus on this option.

  • Create a new step-based scenario.

  • Name it My Custom Scenario.

Create a Custom Trigger

Tip

Users of Dataiku Online will need to skip this step.

Dataiku DSS provides a large amount of flexibility with built-in triggers. For example, just in terms of time-based triggers, we can trigger a scenario once every few minutes, once per day, once per month, or on specific days of the week.

However, if we wanted to launch a scenario on only the first Friday of the month, we’d need to code our own solution. Let’s do this with a custom Python trigger.

  • On the Settings tab of the newly-created scenario, add a “Custom trigger” from the Add Trigger dropdown.

  • Name it Every first Friday.

  • Normally, we’d increase the “Run every (seconds)” parameter, but we can leave the default for demonstration purposes.

  • We won’t need any special libraries so the default built-in code environment can remain.

  • Replace the starter code with the following Python code snippet.

from dataiku.scenario import Trigger
from datetime import date

t = Trigger()

today = date.today()
dayofweek = today.isoweekday()
day = today.day


if dayofweek == 5 and day < 7:
    t.fire()
../../_images/custom-trigger.png

Add Custom Steps to a Scenario

What actions should we instruct Dataiku DSS to take on the first Friday of every month? Earlier we rebuilt part of the Flow by hard coding a variable change in a scenario step. Now let’s rebuild the same part of the Flow, but change the variable based on the result of a custom step.

SQL-based

If you’re using a SQL-based connection, we’ll start with a SQL query step to find the most transactions.

  • On the Steps tab, add an “Execute SQL” step.

  • Name it top_merchant_state.

  • Choose the connection you’re using.

  • Copy and paste the SQL script below.

SELECT
    COUNT(*) AS "state_transactions",
    "merchant_state"
FROM "${projectKey}_transactions_joined_prepared"
WHERE "merchant_state" IS NOT NULL
GROUP BY "merchant_state"
ORDER BY "state_transactions" DESC
LIMIT 1

Now we’ll use the result of this SQL query step to update a project variable.

  • Add a “Set project variables” step.

  • Turn Evaluated variables ON.

  • Add a variable:

    • The key is state_name.

    • The value is parseJson(stepOutput_top_merchant_state)['rows'][0][1].

File-based

If you are using a file-based connection, you won’t be able to use an “Execute SQL” step. Instead, you can combine both the SQL query step and the “Set project variables” step into one custom Python step instead.

  • On the Steps tab, add an “Execute Python code” step.

  • Copy and paste the code snippet below into the script field.

import dataiku
from dataiku.scenario import Scenario

# Defining scenario object
s = Scenario()

# Defining python client for public API
client = dataiku.api_client()
p = client.get_default_project()

# Computing state with most transactions
dataset = dataiku.Dataset("transactions_joined_prepared")
df = dataset.get_dataframe()
state_name = (df
            .groupby("merchant_state")
            .agg({"transaction_id": "count"})
            .sort_values('transaction_id', ascending=False)
            .index[0]
            )

# Set variable
variables = p.get_variables()
variables["standard"]["state_name"] = state_name

p.set_variables(variables)

Clear and build datasets

Once that variable is updated (either through a SQL query + “Set project variables” step or one Python step), we can clear and build just like we did in the previous scenario. Instead of using the built-in steps, however, let’s use a Python step.

For example, instead of a “Clear” step, we can use the clear_dataset() function. Instead of a “Build / Train” step, we can use the build_dataset() function.

  • Add an “Execute Python code” step.

  • Copy and paste the code snippet below into the script field.

from dataiku.scenario import Scenario
import dataiku

# The Scenario object is the main handle from which you initiate steps
scenario = Scenario()
state_name = scenario.get_all_variables()["state_name"]
if state_name != "Rhode Island":
    scenario.clear_dataset("merchants_by_state")
    scenario.build_dataset("merchants_by_state")
../../_images/custom-step.png

Running this scenario, even though it has custom steps and triggers, is no different from a fully built-in scenario. Instead of waiting for the first Friday of a month, manually trigger the scenario by clicking Run.

../../_images/custom-scenario-job.png

When the scenario finishes, open the merchants_by_state dataset to observe the change in output. It turns out that “Colorado” is the state with the most transactions.

What’s Next?

Congratulations on taking your first steps using metrics, checks, and scenarios in Dataiku DSS!

If you have not already done so, register for the Academy course on Automation to validate your knowledge of this material.

For more information on these topics, consult the product documentation.

Once you have become familiar with automating workflows, you may be ready to begin putting pipelines into production. Among other courses in the Operationalization section, the course on Flow Deployment will teach you how to take a Flow like this one and deploy it to a production environment.