Tutorial | Model monitoring basics (MLOps part 1)

Before we can put the credit card fraud prediction project into production, we need to consider how to monitor the project’s model. As we get further in time from the training data, how do we ensure our model stays relevant?


In this tutorial, you will:

  • Use the Evaluate recipe and a model evaluation store (MES) to monitor model metrics in situations where you do and do not have access to ground truth data.

  • Conduct drift analysis to interpret how well the model is performing compared to its initial training.

  • Create a scenario to retrain the model based on a metric found in the MES.

  • Create a model monitoring dashboard.

Starting here?

This is the first tutorial in the series, but be sure to have read the introduction and prerequisites.

Create the project

We’ll start from a project that includes a basic classification model and a zone for scoring new, incoming data.

  • From the Dataiku Design homepage, click +New Project > DSS tutorials > MLOps Practitioner > MLOps (Training).


You can also download the starter project from this website and import it as a zip file.

Review the Score recipe

Before we use the Evaluate recipe for model monitoring, let’s review the purpose of the Score recipe.

The classification model was trained on three months of transaction data between January and March 2017. The new_transactions dataset currently holds the next month of transactions (April 2017). You can verify this by checking the Settings tab.


The new_transaction_data folder feeding the new_transactions dataset holds nine CSV files: one for each month following the model’s training data. This monthly data has already been prepared using the same transformations as the model’s training data, and so it’s ready to be scored or evaluated.

It also is already labeled. In other words, it has known values for the target authorized_flag column. However, we can ignore these known values, for example, when it comes to scoring or input drift monitoring.

For this quick review, assume new_transactions has empty values for authorized_flag. If this were the case, our next step would be to input these new unknown records and the model to the Score recipe in order to output a prediction of how likely each record is to be fraudulent.

  • In the Model Scoring Flow zone, click on the Score recipe.

  • Click Run in the Actions tab of the right panel.

  • Click Run again to non-recursively build test_scored.

Compare the schema of new_transactions and test_scored. The Score recipe added three new columns (proba_0, proba_1, and prediction) to the test_scored dataset.

Dataiku screenshot of a Flow zone including a Score recipe.

The Score recipe outputs predictions for new records, but how do we know if these predictions are similar to those produced during model training?


You can learn more about model scoring in the Knowledge Base.

Ground truth vs. input drift monitoring

Over time, a model’s input data may trend differently from its training data. Therefore, a key question for MLOps practitioners is whether a model is still performing well or if it has degraded after being deployed. In other words, they must ask: is there model drift?

To definitively answer this question, we must know the ground truth, or the correct model output. However, in many cases, obtaining the ground truth can be slow, costly, or incomplete. In such cases, we must instead rely on input drift evaluation. Using this approach, we compare the model’s training data against the new production data to see if there are significant differences.


See this article on monitoring model performance and drift in production to learn more about ground truth vs. input drift monitoring.

Create a model monitoring pipeline for each approach

For many real life use cases, these two approaches are not mutually exclusive:

  • Input drift and prediction drift (to be defined below) are computable as soon as one has enough data to compare. You might calculate it daily or weekly.

  • Ground truth data, on the other hand, typically comes with a delay and may often be incomplete or require extra data preparation. Therefore, true performance drift monitoring is less frequent. You might only be able to calculate it monthly or quarterly.

Keeping this reality in mind, let’s set up two separate model monitoring pipelines that we can run independently of each other.

A model evaluation store for ground truth monitoring

Let’s start by creating the model monitoring pipeline for cases where the ground truth is available. For this, we’ll need the scored dataset.

  • From the Model Scoring Flow zone, select the saved model and the test_scored dataset.

  • In the Actions tab of the right panel, click on the Evaluate recipe.

  • For Outputs, set an Evaluation store named mes_for_ground_truth.

  • Click Create Evaluation Store, and then Create Recipe.

  • Adjust the Sampling method to Random (approx. nb. records), and keep the default of 10,000.

  • Click Save.

Take a moment to organize the Flow.

  • From the Flow, select both the Evaluate recipe and the MES.

  • In the Actions tab, click Move.

  • Click New Zone.

  • Name it Ground Truth Monitoring, and click Confirm.

A model evaluation store for input drift monitoring

Now let’s create a second model evaluation store for cases where the ground truth is not present following the same process. This time though, we’ll need the “new” transactions, which we can assume have an unknown target variable.

  • From the Model Scoring Flow zone, select the saved model and the new_transactions dataset.

  • In the Actions tab of the right panel, click on the Evaluate recipe.

  • For Outputs, set an Evaluation store named mes_for_input_drift.

  • Click Create Evaluation Store, and then Create Recipe.

  • As before, adjust the Sampling method to Random (approx. nb. records), and keep the default of 10,000.

In addition to the unknown input data, there is one more important difference in the configuration of the Evaluate recipe for input drift monitoring.

  • In the Settings tab of the recipe, check the box Skip performance metrics computation found in the Output tile.

  • Save the recipe, and return to the Flow.

  • Following the steps above, move the second Evaluate recipe and MES into a new Flow zone called Input Drift Monitoring.


If you do not have the ground truth, you won’t be able to compute performance metrics, and so the recipe would return an error without changing this setting.

We now have one Flow zone dedicated to model monitoring using the ground truth and another Flow zone for the input drift approach.

Dataiku screenshot of the Flow showing two Flow zones for model monitoring.


See the reference documentation to learn more about the Evaluate recipe.

Compare and contrast model monitoring pipelines

These model evaluation stores are still empty! Let’s evaluate the April 2017 data, the first month beyond our model’s training data.

Build the MES for ground truth monitoring

  • From the Flow, non-recursively run the Evaluate recipe that builds the mes_for_ground_truth.

  • Double click to open the MES, and observe a full range of performance metrics for a single model evaluation.

Dataiku screenshot of a model evaluation store for ground truth with one evaluation.


One run of the Evaluate recipe produces one model evaluation.

A model evaluation contains both metadata on the model and input, but also the computed metrics (in this case on data, prediction, and performance).

Build the MES for input drift monitoring

  • Return to the Flow, and non-recursively run the Evaluate recipe that builds the mes_for_input_drift.

  • Double click to open the store, and observe how it has far fewer available metrics.

Dataiku screenshot of a model evaluation store for input drift with one evaluation.


If you examine the job log for building either MES, you may notice an ML diagnostic warning—in particular, a dataset sanity check. As we’re not focused on the actual quality of the model, we can ignore this warning, but in a live situation, you’d want to play close attention to such warnings.

Run more model evaluations

Before diving into the meaning of these metrics, let’s add more data to the pipelines for more comparisons between the model’s training data and the new “production” data.

Get another month of transactions

  • In the Model Scoring Flow zone, navigate to the Settings tab of the new_transactions dataset.

  • In the Files subtab, click Show Advanced Options to confirm that the Files section field is set to Explicitly select files.

  • Click the trash can icon to remove /transactions_prepared_2017_04.csv.

  • On the right, click List Files to refresh, and then click Add for /transactions_prepared_2017_05.csv.

  • Save and refresh the page to confirm that the dataset now only contains data from May.

Dataiku screenshot of the Settings tab of a dataset.

Rebuild the MES for input drift monitoring

We can immediately evaluate the new data in the Input Drift Monitoring Flow zone.

  • Return to the Flow, and non-recursively run the Evaluate recipe that builds the mes_for_input_drift.

Rebuild the MES for ground truth monitoring

For ground truth monitoring, we first need to send the data through the Evaluate recipe to maintain consistency.

  • Select the mes_for_ground_truth, and click Build in the Actions sidebar.

  • Choose Recursive upstream.

  • Click Preview to confirm that the job will run first the Score recipe and then the Evaluate recipe.

  • Click Run, and open the mes_for_ground_truth to find a second evaluation row when the job has finished.

Dataiku screenshot of a model evaluation store with two evaluations.

Repeat for additional months (optional)

At this point, both model evaluation stores should have two rows (two model evaluations).

  • Feel free to repeat the process above for the months of June (/transactions_prepared_2017_06.csv) and beyond so that your model evaluation stores have more data to observe.

Conduct drift analysis

Now that we have some evaluation data to examine, let’s dive into what information the model evaluation store contains. Recall that our main concern is the model becoming obsolete over time.

The model evaluation store enables monitoring of three different types of model drift:

  • input data drift

  • prediction drift

  • performance drift (when ground truth is available)


See the reference documentation to learn more about drift analysis in Dataiku.

Input data drift

Input data drift analyzes the distribution of features in the evaluated data.

Slide representing the concept of input data drift.
  • Open the mes_for_ground_truth that has the performance metrics.

  • For the most recent model evaluation, click Open.

  • Navigate to the input data drift panel, and explore the visualizations.

Dataiku screenshot of the input data drift computed.


See the reference documentation on input drift analysis to understand how these figures can provide an early warning sign of model degradation.

Prediction drift

Prediction drift analyzes the distribution of predictions on the evaluated data.

Slide representing the concept of prediction drift.
  • Still within the mes_for_ground_truth, navigate to the Prediction drift panel.

  • If not already present, click Compute, and explore the output in the fugacity and predicted probability density chart.

Dataiku screenshot of the prediction drift computed.

Performance drift

Performance drift analyzes whether the actual performance of the model changes.

Slide representing the concept of performance drift.
  • Lastly, navigate to the Performance drift panel of the mes_for_ground_truth.

  • If not already present, click Compute, and explore the table and charts comparing the performance metrics of the current test_scored and reference training data.

Dataiku screenshot of the performance drift computed.


Thus far, we’ve only examined the drift analysis for the MES that computes performance metrics. Check the other MES to confirm that performance drift is not available. Moreover, you need to be using at least Dataiku 11.3 to have the prediction drift computed without ground truth.

Automate model monitoring

Of course, we don’t want to manually build the model evaluation stores every time. We can automate this task with a scenario.

Assume our goal is to automatically retrain the model if a certain metric (data drift for example) exceeds a certain threshold. Let’s create the bare bones of a scenario to accomplish this kind of objective.

Create a check on a MES metric

Our first step is to choose a metric important to our use case. Since it’s one of the most common, let’s choose data drift.

  • From the Flow, open the mes_for_input_drift, and navigate to the Settings tab.

  • Under the Status checks subtab, click Metric Value is in a Numeric Range.

  • Name the check Data Drift < 0.4.

  • Choose Data Drift as the metric to check.

  • Set the Soft maximum to 0.3 and the Maximum to 0.4.

  • Click Check to confirm it returns an error, and then click Save.

  • On the Status tab, click X/Y Metrics, and add both the data drift metric and the new check to the display.

  • Click Save once more.

Dataiku screenshot of a data drift check on a model evaluation store.


Here we’ve deliberately chosen a data drift threshold to throw an error. Defining an acceptable level of data drift is dependent on your use case.

Design the scenario

Just like any other checks, we now can use this MES check to control the state of a scenario run.

  • From the Jobs menu in the top navigation bar, open the Scenarios page.

  • Click + New Scenario.

  • Name it Retrain Model, and click Create.

First we need to build the MES and run the checks.

  • On the Steps tab, click Add Step > Build / Train.

  • Name it Build MES.

  • Click Add Evaluation Store to Build, and select mes_for_input_drift.

  • Click Add Step > Run checks.

  • Name it Run MES checks.

  • Again, click Add Evaluation Store to Check, and select mes_for_input_drift.

Finally, we need to build the model, but only in cases where the checks fail.

  • Click Add Step > Build / Train.

  • Name it Build model.

  • Click Add Model to Build, and choose the model.

  • Change the Run this step setting to If some prior step failed (that step being the Run checks step).

  • Then check the box to Reset failure state.

  • Click Save when finished.

Dataiku screenshot of the steps tab of the model retrain scenario.

Add a scenario trigger (optional)

For this demonstration, we’ll trigger the scenario manually, but in real life cases, we’d create a trigger based on how often or under what conditions we’d want to run the scenario.

Let’s imagine we have enough data to make a fair comparison every week.

  • On the Settings tab of the scenario, click Add Trigger > Time-based trigger.

  • Name it Weekly, and have it repeat every 1 week.


Feel free to add a reporter to receive alerts about the scenario’s progress!

Run the scenario

Let’s introduce another month of data to the pipeline, and then run the scenario.

  • Return to the new_transactions dataset in the Model Scoring Flow zone.

  • On the Settings tab, switch the data to the next month as done previously.

  • Click Save.

  • Return to the Retrain Model scenario, and click Run to manually trigger it.

  • On the Last Runs tab, observe its progression.

  • Assuming your MES check failed, open the saved model to see a new active version of the model!

Dataiku screenshot of the last runs tab of the model retrain scenario.


This goal of this tutorial is to cover the foundations of model monitoring. But you can also think about how this specific scenario would fail to meet real-world requirements.

  • For one, it retrained the model on the original data!

  • Secondly, model monitoring is a production task, and so this kind of scenario should be moved to the Automation node.

Create a model monitoring dashboard

Initially the visualizations inside the MES may be sufficient, but you may soon want to embed these metrics inside a dashboard to more easily share results with collaborators.

  • From the Dashboards page (G+P), open the project’s default dashboard.

  • On the Edit tab, click the plus button to add a tile.

  • For the first tile, choose Metrics, with type as Model evaluation store, source as mes_for_input_drift, and metric as Data Drift.

  • Click Add, and adjust the tile to a more readable size.

  • Click the Copy icon near the top right of the tile, and click Copy once more to duplicate the tile in the same slide of the dashboard.

  • For the second tile, in the Tile tab on the right, change Metrics options to History, and adjust the tile size.

Although we could add much more detail, let’s add just one more tile.

  • Click the plus button to add a third tile, and choose Scenario.

  • With the Last runs option selected, choose the Retrain Model scenario, and click Add.

  • Adjust the tile size, click Save, and navigate to the View tab to see the foundation of a model monitoring dashboard.

Dataiku screenshot of a model monitoring dashboard.


When even more customization is required, you’ll likely want to explore building a custom webapp (which can also be embedded inside a native dashboard).

Create MES metrics datasets (optional)

Dataiku allows for dedicated datasets for metrics and checks on objects like datasets, saved models, and managed folders. We can do the same for model evaluation stores. These datasets can be particularly useful for feeding into charts, dashboards, and webapps.

  • Open either MES, and navigate to the Status tab.

  • Click the gear icon, and select Create dataset from metrics data.

  • Repeat for the other MES.

  • Move both MES metrics datasets to their respective Flow zones.

Dataiku screenshot of a metrics dataset for a model evaluation store.

Next steps

You have achieved a great deal in this tutorial! Most importantly, you:

  • Created pipelines to monitor a model in situations where you do and do not have access to ground truth data.

  • Used input drift, prediction drift, and performance drift to evaluate model degradation.

  • Designed a scenario to automate periodic model retraining based on the value of a MES metric.

  • Gave stakeholders visibility into this process with a basic dashboard.

Now that we have monitoring infrastructure set up on both data quality and the model, let’s learn how to batch deploy to a production infrastructure!