Concept | Monitoring and feedback in the AI project lifecycle#

When developing an ML pipeline, we take an iterative approach. We connect to our data, explore and prepare it, train and evaluate our models, test our candidate models in a simulated production environment, and finally deploy our model to production. Perhaps it has taken months just to get to this point.

What happens the moment our models are deployed? The data in production can be different from the data we trained the model on, and so our models begin to decay. How do we ensure our models continue to perform at their best? By incorporating a monitoring system that supports the iteration and improvement of our models.

To iterate on our models, we need a monitoring system that provides feedback. The goal of monitoring is to alert us to issues (hopefully with enough time to take action before a problem occurs).

In this article, we’ll discuss one of the more challenging, but crucial aspects of the AI project lifecycle — the monitoring and feedback loop.


Components of a monitoring and feedback loop#

All effective machine learning projects implement a form of feedback loop. This is where information from the production environment flows back to the model prototyping environment for iteration and improvement.


With Dataiku, that feedback loop consists of three main components:

  • A model evaluation store that does versioning and evaluation between different model versions.

  • An online system that performs model comparisons in the production environment, with a shadow scoring (champion/challenger) setup or with A/B testing.

  • A logging system that collects data from production servers.

Let’s look at each of these components in more detail.


This article describes the feedback loop from a best practices perspective.

Model evaluation stores#

Let’s say after reviewing information from our logging system, we decide our model performance has degraded, and we want to improve our model by retraining it.

Formally, a model evaluation store serves as a structure that centralizes the data needed to take several trained candidate models and compare them with the deployed model. Each model and each model version needs to be easily accessible. This structure, or group of models, is referred to as the logical model.

In addition, once we have a model evaluation store with comparison data, we can track model performance over time.

Each logged version of the logical model must come with all the essential information concerning its training phase, including:

  • The list of features used

  • The preprocessing techniques that are applied to each feature

  • The algorithm used, along with the chosen hyperparameters

  • The training dataset

  • The test dataset used to evaluate the trained model (this is necessary for the version comparison phase)

  • Evaluation metrics comparing the performance between different versions of a logical model

The evaluation metrics can then be stored in the model evaluation store where we can begin to identify top model candidates.


See this concept article to learn more about model evaluation stores.

Online evaluation#

Once we’ve identified that a candidate model is outperforming our deployed model, there are two ways we could proceed. We could update the model in the production environment or move to an online evaluation.

There is often a substantial discrepancy between the performance of a model that we are iterating on in development (also known as offline) and a model in production (also known as online). Evaluating our model online would give us the most truthful feedback about the behavior of our candidate model when it is fed real data.

There are two main modes of online evaluation:

  • Champion/challenger (otherwise known as shadow testing), where the candidate model shadows the deployed model and scores the same live requests

  • A/B testing, where the candidate model scores a portion of the live requests and the deployed model scores the others


The champion/challenger mode involves deploying one or more additional models (the challengers) to the production environment. These models receive and score the same incoming requests as the active model (the champion). However, the challenger models do not return any response or prediction to the system. That’s still the job of the champion model.

The challenger predictions are simply logged for further analysis. That’s why this method is also called “shadow testing” or “dark launch”.

This setup allows us to verify that the performance of the challenger model is better than the active model in the production environment.

How long should challenger models be deployed before it’s clear that a challenger model is better than the active model? There is no definitive answer; however, once both versions have scored a sufficient number of requests, we can compare the results statistically.

A/B testing#

In A/B testing, the candidate model is deployed to the production environment (along with the active model), but input requests are split between the two models. Each request is processed randomly by one or the other model — not both. Results from the two models are logged for analysis. Drawing statistically meaningful conclusions from an A/B test requires careful planning of the test.

For ML models, it should be used only when champion/challenger is not possible. This might happen when:

  • The ground truth cannot be evaluated for both models.

  • The objective to optimize is only indirectly related to the performance of the prediction (e.g. a business metric is more important than a statistical one).


See this tutorial to learn how to implement A/B testing of API endpoints with Dataiku.

Logging system#

Monitoring a live system, with or without machine learning components, means collecting and aggregating data about its states. The logging system is a time-stamped event log that captures and centralizes the following information for analysis:

  • Model metadata to identify the model, such as the model version.

  • Model inputs: Values of new observations to detect data drift.

  • Model outputs: The model’s predictions along with the ground truth (collected later on) to give us an idea of the model’s performance.

  • System action: The action taken by the system. For example, in the case of a system that detects credit card fraud, the system action might be to send a warning to the bank.

  • Model explanation: The model explanation explains which features have the most influence on the prediction. This is crucial information for some highly regulated industries such as finance (e.g., we would want to know if there is bias when the model makes predictions about who will pay back a loan and who will not).

Once in place, the logging system periodically fetches data from the production environment. We can set up automated monitoring to alert us in the event of data drift. If we receive an alert, we’ll want to evaluate our models for performance degradation. To do this, we can use a model evaluation store.


Data powers our insights, and so changing data could lead to different or inaccurate insights. When we monitor our models in production using a feedback loop, we obtain crucial information for iterating on and improving our models. Incorporating the three main components of a feedback loop gives us:

  • A model evaluation store that lets us identify a candidate model that outperforms our deployed model.

  • An online system where we can obtain the most truthful feedback about the behavior of our candidate model.

  • A logging system that can alert us to data drift.

Works Cited

Mark Treveil and the Dataiku team. Introducing MLOps: How to Scale Machine Learning in the Enterprise. O’Reilly, 2020.