Concept | Monitoring model performance and drift in production#

Watch the video

When a machine learning model is deployed in production, it can start degrading in quality fast and without warning. Therefore, we need to monitor models in production in a way that alerts us to potential problems and lets us reassure stakeholders that our AI project is under control.

For example, let’s say we have a credit card fraud analysis project where we’ve designed a model to predict fraud, and we deployed our model to production.


Initially, we were satisfied with the model’s performance. Later on, we noticed drops in the active model’s performance metrics like AUC, precision, and recall. However, monitoring models in production should include more than just reviewing a metrics dashboard.

Ideally, we want to monitor our models in a way that gives us control over our AI projects and lets us deploy new model versions with confidence. Therefore, we want to answer the following questions:

  • Does new incoming data reflect the same patterns as the data the model was originally trained on?

  • Is the model performing as well as during the design phase?

  • If not, why?


Tracking model degradation#

How well a model performs is a reflection of the data used to train it. A significant change in the distribution or composition of values for input variables or the target variable could lead to data drift. The drift could cause the model to degrade. Model degradation can lead to inaccurate predictions and insights.

Monitoring model degradation has the following benefits:

  • Alerting data teams before model performance drops significantly (giving them time to correct data drift).

  • Deploying AI projects with confidence.

  • Preparing for compliance in regulatory frameworks around AI.

  • Showing stakeholders that our AI project remains under control.


To track model degradation, there are two approaches to consider: one based on ground truth and the other on input drift.


Ground truth monitoring#

The ground truth is the correct answer to the question that the model was asked to solve. For example, in the case of predicting credit card fraud, the ground truth is the correct answer to the question, “Was this transaction actually fraudulent?” When we know the ground truth for all of the predictions a model has made, we can judge with certainty how well the model is performing.

Ground truth monitoring requires waiting for the label event, such as whether or not a specific transaction was actually fraudulent, and then computing the performance of the model based on these “ground truth” observations.

When the ground truth is available, it is the best solution to monitor model degradation.


However, obtaining the ground truth can be slow and costly. For example, in the case of predicting customer churn, we may not know the ground truth — whether the customer actually churned or not — until a few months after the record was scored.


Input drift monitoring#

If our use case requires rapid feedback or if the ground truth is not available, input drift evaluation may be the way to go.

The basis for input drift is that a model will only predict accurately if the data it was trained on is an accurate reflection of the real world. Basically, if a comparison of recent requests to a deployed model against the training data shows distinct differences, then there is a strong chance that the model performance may be compromised.

Unlike with ground truth evaluation, the data required for input drift evaluation already exists. There is no need to wait for any other information.

In summary, ground truth is the cornerstone, but input drift monitoring can provide early warning signs.

Works Cited

Mark Treveil and the Dataiku team. Introducing MLOps: How to Scale Machine Learning in the Enterprise. O’Reilly, 2020.