Concept | Model evaluation stores#
Watch the video
At any point, we can evaluate the efficacy of our model over time if we have the right historical data. Additionally, as we receive new input data over time, we can concurrently evaluate whether our model is still performing well. To accomplish these tasks, we’ll utilize model evaluation stores in Dataiku.
Create an Evaluate recipe#
The first step to evaluating a model in Dataiku is to create an Evaluate recipe, which requires two inputs: a saved model deployed to the Flow and an evaluation dataset.
This recipe can have up to three outputs:
An output dataset containing the input features, predictions, and correctness of predictions;
A metrics dataset containing one row of performance metrics for each run of the Evaluate recipe;
And/or a model evaluation store, which contains the same performance metrics, but also visualizes them and offers the familiar status checks for monitoring purposes.
We’ll also keep the default settings for the Evaluate recipe. Note that this way, the recipe will use the active version of the saved model.
Each time the Evaluate recipe runs, new data is added to the model evaluation store.
Explore the model evaluation store#
In this example, we have run the Evaluate recipe three times. For each run, the prediction model trained on Q1 data remained the same. However, the contents of the input dataset transaction_to_train_Q were changed to Q2, Q3, and finally Q4 data.
After running the recipe on three subsequent quarters of data, we can begin to see the model degrade in the model evaluation store. Data drift, particularly for Q4, is increasing, and ROC AUC is decreasing.
View drift analysis#
In addition to the summary view above, we can take a closer look at any individual run by clicking on its row. When we open the third row containing the Q4 input data, we find all of the necessary model information and performance metrics found in the Lab, but there is also a Drift Analysis section.
The input data drift section reports the extent to which a random forest classifier is able to distinguish between the test set from the original reference data (Q1) and the new data (Q4). In this case, we clearly have input data drift, and may need to retrain the model.
Scrolling down in the same pane, we can actually see which input features have drifted. After sorting by the KS test column, we can see that five features were found to have drifted significantly between Q1 and Q4, as their test statistics are lower than the significance level of 5%.
Establish status checks#
Just like for other saved models or a metrics dataset from the Evaluate recipe, we could establish status checks, depending on any chosen metric. For example, we might want a warning if the model’s ROC AUC dipped below a certain threshold. We could also automatically retrain the model in the event this occurs using a scenario.
What’s next?#
Model evaluation stores are also essential for monitoring the health of your models over time.
To learn more about the AI lifecycle, visit Concept | Monitoring and feedback in the AI project lifecycle or this tutorial on model monitoring for more hands-on experience.