Concept | Model validation and evaluation#

How can we check that our deployed model acts how we want it to? Once a model is in the Flow, we can use a validation set and the Evaluate recipe to give us a sense of model fit. This lesson will walk you through that process.

Active version of the model#

The model deployed to the Flow is our first and active version of the model, which means that it’s the version used when running the Retrain, Score, or Evaluate recipes.

The idea of which model is the “active version” is important because data science projects are highly iterative. Over time, your projects and models will evolve as new labelled data becomes available, or as you come up with new feature engineering strategies for your training data.

Whether you have a batch of new labelled records or are ready to try a new feature strategy, you will need to replace the deployed model with a new, updated version.

In our sample use case, Hospital Readmission, new patients continue to arrive at the hospital. Some are readmitted, and some never return. It’s possible that the distributions and relationships between features and the target in the data change over time. Maybe the hospital has changed policies? Maybe a pandemic is keeping patients away? To prevent the model from becoming stale, it’s good practice to retrain the model regularly when fresh training data becomes available.

A validation set#

How can we design a Flow that lets us manage the model lifecycle? One way is to adjust the Split recipe to carve out a validation set from the labelled training data.

A validation set gives us an unbiased evaluation of model fit.

Model retraining#

To retrain our model, we’ll run the Train recipe again and train the model on the new, slightly smaller training data that excludes all records in the validation set. To help automate the retraining of our model, we could create a scenario.

The Evaluate recipe#

But how do we know if the new version of the model is better than the previous one? That’s a job for the Evaluate recipe.

As we update the active version of the model, we’ll be able to compare a model’s performance against previous versions using data that was never seen during training.

The Evaluate recipe requires two inputs: a labelled dataset and a deployed prediction model. It produces two outputs: the scored input dataset and another dataset of model metrics.

The scored validation set (validation_scored) produced by the Evaluate recipe is similar to the output of the Score recipe. The Readmitted variable was already present, but we can now compare it against the class probabilities and predictions according to the active version of the model.

After running the Evaluate recipe one time, we have only one row in the model metrics dataset. After each update to the active version of the model, we’ll re-run the Evaluate recipe and see a new row of metrics added.

Retraining for new historic data#

With this evaluation plan in place, consider a case sometime in the near future where our historic data has changed. After it travels through our data pipeline, we have new, labelled training and validation sets recording which patients were readmitted.

Once again, let’s retrain the model on the new training data.

The model we just trained on our new data is now the active version of the model. ROC AUC has slightly improved, but we can get a fairer assessment against the validation set.

With a fresh active version of the model, let’s re-run the Evaluate recipe.

We have a new row in the model metrics dataset recording the performance on the newly trained model. The metrics are quite close. It looks like the model is performing the same on the new labelled data as it did on the older data.

Perhaps we should keep this model as the active version? If not, we can always roll back to a previous version.

Next steps#

Retraining your model is also important long-term to ensure that you’re still producing good predictions for evolving input data.

To learn more about model monitoring, see our resources in the Knowledge Base.