Concept | Model validation and evaluation#

Watch the video

How can we check that our deployed model acts how we want it to? Once a model is in the Flow, we can use a validation set and the Evaluate recipe to give us a sense of model fit. This lesson will walk you through that process.

Active version of the model#

The model deployed to the Flow is our first and active version of the model, which means that it is the version used when running the Retrain, Score, or Evaluate recipes.

Active version of the model.

The idea of which model is the “active version” is important because data science projects are highly iterative. Over time, your projects and models will evolve as new labelled data becomes available, or as you come up with new feature engineering strategies for your training data.

Matrix for determining when to retrain the model.

Whether you have a batch of new labelled records or are ready to try a new feature strategy, you will need to replace the deployed model with a new, updated version.

Graphic design showing the relationship between new training data and unlabeled data.

In our sample use case, Hospital Readmission, new patients continue to arrive at the hospital. Some are readmitted, and some never return. It is possible that the distributions and relationships between features and the target in the data change over time. Maybe the hospital has changed policies? Maybe a pandemic is keeping patients away? To prevent the model from becoming stale, it is good practice to retrain the model regularly when fresh training data becomes available.

Graphic design describing new labeled records.

A validation set#

How can we design a Flow that lets us manage the model lifecycle? One way is to adjust the Split recipe to carve out a validation set from the labelled training data.

A validation set gives us an unbiased evaluation of model fit.

Flow with Split recipe and output dataset highlighted.

Model retraining#

To retrain our model, we’ll run the Train recipe again and train the model on the new, slightly smaller training data that excludes all records in the validation set. To help automate the retraining of our model, we could create a scenario.

See also

For more information about scenarios, see Concept | Scenarios in the Knowledge Base, or Automation scenarios in the reference documentation.


We can also retrain the model by selecting the saved model and choosing Retrain from the available actions.

Flow with Model object selected and Retrain highlighted.

After completion of the model training, we can see the new version of the model by clicking on the saved model in the Flow. During retraining, all of the settings in the model design — the algorithm, hyperparameters, and feature handling — will be kept identical to the active model. But the model will be fitted to the new data.

By default, the retrained model becomes the new, active version. We could also change settings to require manual activation of any new model version. From this page, we can activate a new version of the model or roll back to a previous version of the model.

Active version of the model within a display showing all versions of the active model.

The Evaluate recipe#

But how do we know if the new version of the model is better than the previous one? That’s a job for the Evaluate recipe.

As we update the active version of the model, we’ll be able to compare a model’s performance against previous versions using data that was never seen during training.

The Evaluate recipe requires two inputs: a labelled dataset and a deployed prediction model. It produces two outputs: the scored input dataset and another dataset of model metrics.

Evaluate recipe configuration showing two outputs including a dataset and model metrics.

The scored validation set (validation_scored) produced by the Evaluate recipe is similar to the output of the Score recipe. The Readmitted variable was already present, but we can now compare it against the class probabilities and predictions according to the active version of the model.

Flow with the schema of a scored validation dataset.

After running the Evaluate recipe one time, we have only one row in the model metrics dataset. After each update to the active version of the model, we’ll re-run the Evaluate recipe and see a new row of metrics added.

Model metrics dataset.

Retraining for new historic data#

With this evaluation plan in place, consider a case sometime in the near future where our historic data has changed. After it travels through our data pipeline, we have new, labelled training and validation sets recording which patients were readmitted.

Hypothetical data pipeline where historic data has changed.

Once again, let’s retrain the model on the new training data.

The model we just trained on our new data is now the active version of the model. ROC AUC has slightly improved, but we can get a fairer assessment against the validation set.

Active version of the model within a display showing all versions of the active model.

With a fresh active version of the model, let’s re-run the Evaluate recipe.

We have a new row in the model metrics dataset recording the performance on the newly trained model. The metrics are quite close. It looks like the model is performing the same on the new labelled data as it did on the older data.

Model metrics dataset after running the Evaluate recipe.

Perhaps we should keep this model as the active version? If not, we can always roll back to a previous version.

Versions of the active model with a choice to set one version as the new active model.

What’s next?#

Retraining your model is also important long-term to ensure that you’re still producing good predictions for evolving input data.

To learn more about model monitoring, see our resources in the Knowledge Base.