Automate the Flow#
Once you’ve mastered the basics, you can start automating your MLOps processes with Dataiku’s system of scenarios. A scenario in Dataiku is a set of actions to run, along with conditions for when they should execute and who should be notified of the results.
Since the same tools can be used to retrain models or redeploy new versions of API services, let’s start small by designing a scenario that rebuilds the furthest downstream dataset only if an upstream dataset satisfies certain conditions.
See also
These automation tools can be implemented visually, with code, or a mixture of both. To get started using code in your MLOps workflows, see the Developer Guide.
View the existing scenario#
This project already has a basic one step scenario for rebuilding the data pipeline.
Navigate back to the Design node project.
From the Jobs menu in the top navigation bar, open the Scenarios page.
Click to open the Score Data scenario.
On the Settings tab, note that the scenario already has a weekly trigger, but does not yet have a reporter.
Navigate to the Steps tab.
Click on the Build step to see that this scenario will build the test_scored dataset (and its upstream dependencies, if required) whenever the scenario is triggered.
Recognize that this step will only run if no previous step in the scenario has failed.
Tip
You’ll learn about build modes in the Data Pipelines course of the Advanced Designer learning path.
Select a data quality rule type#
As of now, on a weekly basis, this scenario will attempt to build the test_scored dataset if its upstream dependencies have changed.
In addition to having many options for when a scenario should execute (e.g. time periods, dataset changes, or code), Dataiku also provides tools for control of how a scenario should execute. For example, you may want to interrupt (or proceed with) a scenario’s execution if a condition is met (or not met).
Let’s demonstrate this principle by adding a data quality rule to an upstream dataset of interest.
In the Data Preparation Flow zone, open the job_postings_prepared dataset.
Navigate to the Data Quality tab.
Click Edit Rules.
Select the rule type Record count in range.
Configure a data quality rule#
Now let’s configure the details of this rule assuming you have expectations on the number of records at the start of the pipeline.
Set the min as
100
and the soft min as300
.Set the soft max as
20000
and the max as25000
. Make sure all are turned ON.Click Run Test, and confirm that the record count is indeed within the expected range.
Tip
Feel free to adjust these values to simulate warnings or errors on your own!
Verify a data quality rule in a scenario#
If this rule were to fail (the number of upstream records is greater than or less than our expectations), you could avoid computing the rest of the pipeline, as well as send a notification about the unexpected result.
Let’s have the scenario verify this rule is met before building the pipeline.
From the Jobs menu in the top navigation bar, return to the Scenarios page, and click to open the Score Data scenario.
Navigate to the Steps tab.
Click Add Step to view the available steps, and choose Verify rules or run checks.
Click + Add Item > Dataset > job_postings_prepared > Add Item.
Using the dots on the left side of the step, drag the verification step above the build step.
Click the green Run button to manually trigger the scenario’s execution.
Inspect the scenario run#
Let’s take a closer look at what should be a successful scenario run.
Navigate to the Last runs tab of the scenario.
Click on the most recent run to view its details.
The scenario’s build step triggered a job. Click on the job for the build step, and see that there was Nothing to do for it.
All that for nothing? What happened?
The data in the Flow has not changed. Not surprisingly then, the scenario was first able to successfully verify the Record count in range rule. This is the same result as when you directly tested the rule on the dataset. With this verification step done, the scenario could proceed to the build step.
The build step on the downstream test_scored dataset was set to build required dependencies. As this dataset was not out of date, Dataiku did not waste resources rebuilding it.
Tip
To see this job do some actual work, try the AI Collaboration Quick Start, where you’ll execute the same scenario via a reusable Dataiku Application!