Tutorial | ML diagnostics#
ML diagnostics are a set of checks that help you to detect and correct common problems, such as overfitting and data leakage, during the model development phase. In this tutorial, you’ll learn how to use and interpret ML diagnostics using Dataiku’s visual machine learning tools.
Create the project#
If you have already created the Machine Learning 102 project in your Dataiku instance, open it.
Otherwise, from the Dataiku Design homepage, click +New Project > DSS tutorials > ML Practitioner > Machine Learning 102.
Note
The Machine Learning 102 project is shared by the following tutorials:
Feature generation
Model overrides
ML assertions
ML diagnostics
What if analysis
Causal prediction
Double click on the ML diagnostics & assertions Flow zone.
View model diagnostics#
This tutorial uses a dataset of a fictional company’s customers and some demographic and business information about those customers. It includes two models, already trained in the Lab, that predict whether each customer is a high-revenue customer, as indicated in the high revenue column.
Dataiku calculates diagnostics continually during training so you can decide whether to continue or abort training. It also provides a summary in the model Result tab after training is complete.
Note
The specific checks performed will depend on the algorithm, as well as the type of modeling task. You can read more information about the checks performed in the reference documentation.
To access the models and their diagnostics summaries:
From the top navigation bar, click on Visual Analyses.
Select the model Quick modeling of high_revenue on customers labeled.
In the model Result page that appears, click on Diagnostics for the Random forest model.
This brings you to the Training information report, where the Diagnostics section shows three warnings.
First, in Dataset sanity checks, we see warnings that the training and test sets are imbalanced. This can cause our classification model to perform poorly when attempting to predict the underrepresented value.
Now we should check how our data is imbalanced. In this data, we find that the proportion of high revenue customers is relatively small. Because these are the customers we want most to attract and retain, it’s important to identify this issue and take action to address it.
Next, under Model check, we can see that the model does not perform better than a random classifier. Dataiku calculates the performance of a dummy classifier and compares that to the trained models. The problem with our model could also relate to the underrepresented classes.
Note that the logistic regression model that also was trained has similar warnings from its diagnostic checks. You can view those diagnostics following the same steps.
Tip
Click on any of the links on the diagnostic checks, such as Dataset sanity checks or Modeling parameters to read more about that issue in the Dataiku Help Center.
Set diagnostics#
Diagnostic checks are automatically activated by default, and you can turn them off in the Design tab.
From the Training information report, click on the link to Go to the design to enable/disable diagnostics.
In this panel, you can turn off all diagnostics or selected checks.