Concept | Monitoring and feedback in the AI project lifecycle#

Developing an ML pipeline often takes an iterative approach. You connect to, explore, and prepare data. You train and evaluate models. You test candidate models in a simulated production environment. Finally, you deploy a model to production. Perhaps it has taken months just to get to this point.

What happens the moment models are deployed? The data in production can be different from the model’s training data, and so the models begin to decay. How do you ensure models continue to perform at their best? By incorporating a monitoring system that supports the iteration and improvement of models.

To iterate on models, you need a monitoring system that provides feedback. The goal of monitoring is to draw attention to issues (hopefully with enough time to act before a problem occurs).

This article discusses one of the more challenging, but crucial aspects of the AI project lifecycle — the monitoring and feedback loop.

../../_images/ai-project-lifecycle.jpg

Components of a monitoring and feedback loop#

All effective machine learning projects implement a form of feedback loop. This is where information from the production environment flows back to the model prototyping environment for iteration and improvement.

../../_images/monitor-feedback-loop.jpg

With Dataiku, that feedback loop consists of three main components:

  • A model evaluation store that does versioning and evaluation between different model versions.

  • An online system that performs model comparisons in the production environment, with a shadow scoring (champion/challenger) setup or with A/B testing.

  • A logging system that collects data from production servers.

Let’s look at each of these components in more detail.

Model evaluation stores#

After reviewing information from a logging system, you may decide that model performance has degraded, and you want to retrain the model.

Formally, a model evaluation store serves as a structure that centralizes the data needed to take several trained candidate models and compare them with the deployed model. Each model and each model version needs to be accessible. This structure, or group of models, is referred to as the logical model.

In addition, once you have a model evaluation store with comparison data, you can track model performance over time.

Each logged version of the logical model must come with all the essential information concerning its training phase, including:

  • The list of features used

  • The preprocessing techniques that are applied to each feature

  • The algorithm used, along with the chosen hyperparameters

  • The training dataset

  • The test dataset used to evaluate the trained model (this is necessary for the version comparison phase)

  • Evaluation metrics comparing the performance between different versions of a logical model

The evaluation metrics can then be stored in the model evaluation store where you can begin to identify top model candidates.

Note

See Concept | Model evaluation stores to learn more.

Online evaluation#

Once you’ve identified that a candidate model is outperforming the deployed model, there are at least two ways you could proceed. You could update the model in the production environment or move to an online evaluation.

A discrepancy often exists between the performance of a model in development (also known as offline) and a model in production (also known as online). Evaluating the model online would give the most truthful feedback about the behavior of a candidate model when it’s fed real data.

There are two main modes of online evaluation:

  • Champion/challenger (otherwise known as shadow testing), where the candidate model shadows the deployed model and scores the same live requests.

  • A/B testing, where the candidate model scores a portion of the live requests and the deployed model scores the others.

Champion/challenger#

The champion/challenger mode involves deploying one or more additional models (the challengers) to the production environment. These models receive and score the same incoming requests as the active model (the champion). However, the challenger models don’t return any response or prediction to the system. That’s still the job of the champion model.

The challenger predictions are simply logged for further analysis. That’s why this method is also called “shadow testing” or “dark launch”.

This setup allows for verification that the performance of the challenger model is better than the active model in the production environment.

How long should challenger models be deployed before it’s clear that a challenger model is better than the active model? There is no definitive answer. However, once both versions have scored enough requests, you can compare the results statistically.

A/B testing#

In A/B testing, the candidate model is deployed to the production environment (along with the active model). However, input requests are split between the two models. Each request is processed randomly by one or the other model — not both. Results from the two models are logged for analysis. Drawing statistically meaningful conclusions from an A/B test requires careful planning of the test.

For ML models, it should be used only when champion/challenger isn’t possible. This might happen when:

  • The ground truth can’t be evaluated for both models.

  • The objective to optimize is only indirectly related to the performance of the prediction (for example, a business metric is more important than a statistical one).

Note

See Tutorial | Real-time API deployment to learn how to implement A/B testing of API endpoints with Dataiku.

Logging system#

Monitoring a live system, with or without machine learning components, means collecting and aggregating data about its states. The logging system is a timestamped event log that captures and centralizes the following information for analysis:

  • Model metadata to identify the model, such as the model version.

  • Model inputs: Values of new observations to detect data drift.

  • Model outputs: The model’s predictions along with the ground truth (collected later on) to give us an idea of the model’s performance.

  • System action: The action taken by the system. For example, in the case of a system that detects credit card fraud, the system action might be to send a warning to the bank.

  • Model explanation: The model explanation explains which features have the most influence on the prediction. This is crucial information for some highly regulated industries such as finance. For example, you want to know if there is bias when the model makes predictions about who will pay back a loan and who won’t.

Once in place, the logging system periodically fetches data from the production environment. You can set up automated monitoring to send an alert in the event of data drift. If you receive an alert, you’ll want to evaluate the models for performance degradation. To do this, you can use a model evaluation store.

Summary#

Data powers insights, and so changing data could lead to different or inaccurate insights. When you monitor models in production using a feedback loop, you obtain crucial information for iterating on and improving our models. Incorporating the three main components of a feedback loop includes:

  • A model evaluation store for identifing a candidate model that outperforms a deployed model.

  • An online system for obtaining the most truthful feedback about the behavior of a candidate model.

  • A logging system that can send alerts to data drift.

Works Cited

Mark Treveil and the Dataiku team. Introducing MLOps: How to Scale Machine Learning in the Enterprise. O’Reilly, 2020.