Automate the Flow#

See a screencast covering this section’s steps

Once you’ve mastered the basics, you can start automating your MLOps processes with Dataiku’s system of scenarios. A scenario in Dataiku is a set of actions to run, along with conditions for when they should execute and who should be notified of the results.

Let’s design a scenario that rebuilds the furthest downstream dataset only if an upstream dataset satisfies certain conditions.

Note

These automation tools can be implemented visually, with code, or a mixture of both. To get started using code in your MLOps workflows, see the Developer Guide.

View the existing scenario#

This project already has a basic one step scenario for rebuilding the data pipeline.

  1. Navigate back to the Design node project.

  2. From the Jobs menu in the top navigation bar, open the Scenarios page.

  3. Click to open the Score Data scenario.

  4. On the Settings tab, note that the scenario already has a weekly trigger.

  5. Navigate to the Steps tab.

  6. Click on the Build step to see that this scenario will build the test_scored dataset (and its upstream dependencies, if required) whenever the scenario is triggered.

  7. Recognize that this step will only run if no previous step in the scenario has failed.

Dataiku screenshot of the a build step of a scenario.

Tip

You’ll learn about build modes in the Data Pipelines course of the Advanced Designer learning path.

Select a data quality rule type#

As of now, on a weekly basis, this scenario will attempt to build the test_scored dataset if its upstream dependencies have changed.

In addition to having many options for when a scenario should execute (e.g. time periods, dataset changes, or code), Dataiku also provides tools for control of how a scenario should execute. For example, you may want to interrupt (or proceed with) a scenario’s execution if a condition is met (or not met).

Let’s demonstrate this principle by adding a data quality rule to an upstream dataset of interest.

  1. In the Data Preparation Flow zone, open the job_postings_prepared dataset.

  2. Navigate to the Data Quality tab.

  3. Click Edit Rules.

  4. Select the rule type Record count in range.

Dataiku screenshot of the rule type selection page.

Configure a data quality rule#

Now let’s configure the details of this rule assuming you have expectations on the number of records at the start of the pipeline.

  1. Set the min as 100 and the soft min as 300.

  2. Set the soft max as 20000 and the max as 25000. Make sure all are turned ON.

  3. Click Run Test, and confirm that the record count is indeed within the expected range.

Dataiku screenshot of a data quality rule.

Verify a data quality rule in a scenario#

If this rule were to fail (the number of upstream records is greater than or less than our expectations), you could avoid computing the rest of the pipeline, as well as send a notification about the unexpected result.

Let’s have the scenario verify this rule is met before building the pipeline.

  1. From the Jobs menu in the top navigation bar, return to the Scenarios page, and click to open the Score Data scenario.

  2. Navigate to the Steps tab.

  3. Click Add Step to view the available steps, and choose Verify rules or run checks.

  4. Click Add Dataset to Verify. Select job_postings_prepared, and click Add.

  5. Using the dots on the left side of the step, drag the verification step above the build step.

  6. Click the green Run button to manually trigger the scenario’s execution.

Dataiku screenshot of a verify rule step of a scenario.

Inspect the scenario run#

Let’s take a closer look at what should be a successful scenario run.

  1. Navigate to the Last runs tab of the scenario.

  2. Click on the most recent run to view its details.

  3. The scenario’s build step triggered a job. Click to open it, and see that there was Nothing to do for it.

Dataiku screenshot of the last runs tab of a scenario.

The data in the Flow has not changed. Not surprisingly then, the scenario was able to verify the Record count in range rule. This is the same result as when you directly tested the rule on the dataset. With this verification step done, the scenario could move on to the build step.

Moreover, the build step on the downstream test_scored dataset was set to build required dependencies. As this dataset was not out of date, Dataiku did not waste resources rebuilding it.

Tip

To see this job do some actual work, try the Quick Start | Dataiku for AI collaboration, where you’ll execute the same scenario via a reusable Dataiku Application!