Tutorial | Tune the model (ML Practitioner part 3)¶
In the Machine Learning Basics series, you built a basic model to classify high revenue customers and looked at a few ways to evaluate its performance.
You’ll also find this tutorial as part of the Academy course, Machine Learning Basics, which is part of the ML Practitioner learning path.
Because modeling is an iterative process, let’s now turn our attention to improving the model’s results and speeding up the evaluation process.
In the Machine Learning Basics (Tutorial) project, return to the Models tab of the High revenue analysis.
By default, you’ll be in the tab showing the Results of your training session. This is where you can get a sneak preview of the results of the visual ML diagnostics.
In this lesson, you’ll start in the Design tab.
Configure the train / test split¶
By default, Dataiku randomly splits the first N rows of the input dataset into a training set and a test set. The default ratio is:
80% for training, and
20% for testing.
This means Dataiku will take the first N rows of the dataset and randomly take 80% of those rows to train the model. This could result in a very biased view of the dataset.
Looking at our dataset, and analyzing the high_revenue column, our target column, we can see that there is a class imbalance.
This could be problematic when taking only the first N rows of the dataset and randomly splitting it into train and test sets. However, since our dataset is small, we’ll keep the default sampling & splitting strategy.
One way to try to improve a class imbalance is to apply a class rebalance sampling method. Visit Settings: Train / Test set to discover how Dataiku allows you to configure sampling and splitting.
Configure ML assertions¶
One of the ways to streamline and accelerate the model evaluation process is by automatically checking that predictions for specific subpopulations meet certain conditions.
A business analyst has analyzed the relationship between the top two variables from the Variable importance chart, age_first_order and pages_visited_avg, and the target, high_revenue, to assert the following:
When age_first_order is greater than or equal to 40, the customer is likely to be labeled “high revenue = true” at least 10% of the time.
When count of pages_visited_avg is between 6 and 12, the customer is likely to be labeled “high revenue = true” at least 10% of the time.
Rather than having to spot check the predicted results, we can add a conditional statement, known as an ML Assertion, to check that the model is behaving intuitively.
To add assertions:
In the Design tab, locate the Basic section.
Choose Debugging, then scroll down or zoom out to view Assertions.
Select Add An Assertion to add the first assertion.
Configure the following conditional statement:
On rows that satisfy all the following conditions
Expected class is
With a valid ratio greater than or equal to
Select Add Another Assertion.
Configure the following conditional statement:
On rows that satisfy a formula
Type the formula below:pages_visited_avg >= 6 && pages_visited_avg <= 12
Ensure that Expected class is set to
Set the valid ratio greater than or equal to
Save your changes.
Now, whenever we train the model, Dataiku will run ML diagnostics including the assertion check we just configured. Then we’ll be able to find the results of our assertion check by visiting the Metrics and Assertions in the Model Performance section.
To address the issue about pre-processing of variables before training the model, we’ll use the Features handling panel. Here, Dataiku will let you tune different settings.
Select Features handling in the Features section.
Reject geopoint feature¶
The Role of the variable (or feature) is the fact that a variable can be either used (Input) or not used (Reject) in the model.
Let’s remove ip_address_geopoint from the model.
Turn off ip_address_geopoint.
This action changes the handling of the feature to Reject.
Disable rescaling behavior¶
Each variable type can be handled differently.
The Type of the variable is very important to define how it should be preprocessed before it is fed to the machine learning algorithm:
Numerical variables are real-valued ones. They can be integer or numerical with decimals.
Categorical variables are the ones storing nominal values: red/blue/green, a zip code, a gender, etc. Also, there will often be times when a variable that looks like Numerical should actually be Categorical instead. For example, this will be the case when an “id” is used in lieu of the actual value.
Text is meant for raw blocks of textual data, such as a Tweet, or customer review. Dataiku is able to handle raw text features with specific preprocessing.
The numerical variables age_first_order and pages_visited_avg have been automatically normalized using a standard rescaling (this means that the values are normalized to have a mean of 0 and a variance of 1).
We’ll want to disable this behavior and use No rescaling instead.
Select the checkboxes for the variables age_first_order and pages_visited_avg.
Dataiku displays a menu where you can select the handling of the selected features.
Under Rescaling, select No Rescaling.
Generating new features can reveal unexpected relationships between the inputs (variables/features) and the target.
We can automatically generate new numeric features using Pairwise linear combinations and Polynomial combinations of existing numeric features.
The Script tab of a visual analysis includes all of the processors found in the Prepare recipe. Any features created here can be immediately fed to models. Please review lessons on the Prepare recipe and the Lab if this is unfamiliar to you.
In the Features section, select the Feature generation panel.
Select Pairwise linear combinations, then set Enable to Yes.
Select Pairwise polynomial combinations, then set Enable to Yes.
Retrain the model¶
After altering the model’s settings, you can now train and build some new models.
Select Save and then click Train.
Select Train again to start the session.
Once the session has completed, you can see that the performance of the random forest model has now slightly increased.
Evaluate the model from session 2¶
Session 2 results in a Random Forest model with an AUC value that is higher than the first model.
When training is complete, we can go directly to ML diagnostics.
Select Diagnostics in the Result tab of the random forest model to view the results of the ML diagnostics checks.
Dataiku displays Model Information > Training information. Here, we can view warnings and get advice to avoid common pitfalls, including if a feature has a suspiciously high importance - which could be due to a data leak or overfitting.
This is like having a second set of eyes that provide warning and advice, so that you can identify and correct these issues when developing the model.
Metrics and assertions¶
Now we can find out if our ML assertion check passed or failed.
Select Metrics and assertions in the Performance section.
Dataiku displays the results of the assertion check. We can see whether or not our assertion checks passed, the number of rows matching the criteria, along with the percentage of valid rows.
Finally, let’s look at the Variable importance chart for the latest model.
Select Variable importance in the Explainability section.
We can see that the importance is spread across the campaign variable along with the features automatically generated from age_first_order and pages_visited_avg. The generated features may have uncovered some previously hidden relationships.
You might find that your actual results are different from those shown. This is due to differences in how rows are randomly assigned to training and testing samples.
Now that you have trained several models, all the results may not fit your screen. To see all your models at a glance:
Go back to the Result tab.
Switch to the Table view.
You can sort the Table view on any column, such as ROC AUC. To do so, just click on the column title.
Congratulations, you just built, evaluated, and tuned your first predictive model using Dataiku!
How do we know, however, if this model to predict high revenue customers is biased? Is it performing similarly for male and female customers, for example?
In Hands-On: Explain Your Model, we’ll spend more time trying to understand and interpret the model’s predictions.