Tutorial | Tune the model (ML Practitioner part 3)

As of the tutorial on evaluating a model, you have built a basic model to classify high revenue customers and looked at a few ways to evaluate its performance.

Because modeling is an iterative process, let’s now turn our attention to improving the model’s results and speeding up the evaluation process.

Objectives

In this tutorial, you will:

  • Review the train/test split of the first model training session.

  • Iterate on the design of the model’s by configuring ml assertions, adjusting the feature handling, and generating new features.

  • Train another session of models with new design settings.

Starting here?

You’ll need to complete the tutorial on creating the model in order to reproduce the steps here.

Return to the model design

We first need to return to the design of the modeling task.

  • Navigate to the High revenue analysis attached to the customers_labeled dataset.

  • Click on the Design tab near the top center of the page.

Configure the train / test split

One setting we may wish to tune is the split between training and testing sets.

  • Click on the Train / Test Set panel of the model’s design.

By default, Dataiku randomly splits the first N rows of the input dataset into a training set and a test set. The default ratio is:

  • 80% for training, and

  • 20% for testing.

This means Dataiku will take the first N rows of the dataset and randomly take 80% of those rows to train the model. This could result in a very biased view of the dataset.

If we return to the customers_labeled dataset in the Flow, and analyze the high_revenue column, our target column, we can see that there is a class imbalance.

A dataset with a class imbalance problem.

This could be problematic when taking only the first N rows of the dataset and randomly splitting it into train and test sets. However, since our dataset is small, we’ll keep the default sampling & splitting strategy.

Note

One way to try to improve a class imbalance is to apply a class rebalance sampling method. Visit the reference documentation to discover how Dataiku allows you to configure sampling and splitting.

Adjust feature handling settings

To address the issue about pre-processing of variables before training the model, we’ll use the Features handling panel. Here, Dataiku will let you tune different settings.

  • Select Features handling in the Features section.

Reject geopoint feature

The Role of the variable (or feature) is the fact that a variable can be either used (Input) or not used (Reject) in the model.

Let’s remove ip_address_geopoint from the model.

  • Turn off ip_address_geopoint.

This action changes the handling of the feature to Reject.

Handling a geopoint feature in the Design tab of a visual analysis.

Disable rescaling behavior

The Type of the variable is very important to define how it should be preprocessed before it is fed to the machine learning algorithm:

  • Numerical variables are real-valued ones. They can be integer or numerical with decimals.

  • Categorical variables are the ones storing nominal values: red/blue/green, a zip code, a gender, etc. Also, there will often be times when a variable that looks like Numerical should actually be Categorical instead. For example, this will be the case when an “id” is used in lieu of the actual value.

  • Text is meant for raw blocks of textual data, such as a tweet, or customer review. Dataiku is able to handle raw text features with specific preprocessing.

The numerical variables age_first_order and pages_visited_avg have been automatically normalized using a standard rescaling (this means that the values are normalized to have a mean of 0 and a variance of 1).

Let’s disable this behavior, and use No rescaling instead.

  • Select the checkboxes for the variables age_first_order and pages_visited_avg.

Dataiku displays a menu where you can select the handling of the selected features.

  • Under Rescaling, select No Rescaling.

Handling numeric features in the Design tab of a visual analysis.

Generate new features

Generating new features can reveal unexpected relationships between the inputs (variables/features) and the target.

We can automatically generate new numeric features using Pairwise linear combinations and Polynomial combinations of existing numeric features.

Note

The Script tab of a visual analysis includes all of the processors found in the Prepare recipe. Any features created here can be immediately fed to models. Please review the article on data preparation in the Lab if this is unfamiliar to you.

  • In the Features section, select the Feature generation panel.

  • Select Pairwise linear combinations; then set Enable to Yes.

  • Select Pairwise polynomial combinations; then set Enable to Yes.

Feature generation in the Design tab of a visual analysis.

Train new models

After altering the Design settings, you can build some new models.

  • Select Save, and then click Train.

  • Select Train again to start the second training session.

Once the session has completed, you can see that the performance of the random forest model has now slightly increased.

Evaluate the new models

In our case, session 2 resulted in a Random Forest model with an AUC value that is slightly higher than the first model.

Diagnostics

When training is complete, we can go directly to ML diagnostics.

  • Select Diagnostics in the Result tab of the random forest model to view the results of the ML diagnostics checks.

ML diagnostics in the result tab of a visual analysis.

Dataiku displays Model Information > Training information. Here, we can view warnings and get advice to avoid common pitfalls, including if a feature has a suspiciously high importance — which could be due to a data leak or overfitting.

This is like having a second set of eyes that provide warning and advice, so that you can identify and correct these issues when developing the model.

Variable importance

Finally, let’s look at the Variable importance chart for the latest model.

  • Select Variable importance in the Explainability section.

We can see that the importance is spread across the campaign variable along with the features automatically generated from age_first_order and pages_visited_avg. The generated features may have uncovered some previously hidden relationships.

ML Assertion configuration in the Design tab of a visual analysis.

Note

You might find that your actual results are different from those shown. This is due to differences in how rows are randomly assigned to training and testing samples.

Table view

Now that you have trained several models, all the results may not fit your screen. To see all your models at a glance:

  • Go back to the Result tab.

  • Switch to the Table view.

You can sort the Table view on any column, such as ROC AUC. To do so, just click on the column title.

Table view in the results of a visual analysis.

What’s next?

Congratulations, you just built, evaluated, and tuned your first predictive model using Dataiku!

How do we know, however, if this model to predict high revenue customers is biased? Is it performing similarly for male and female customers, for example?

In the tutorial on explaining models, we’ll spend more time trying to understand and interpret the model’s predictions.