Tune the model#

You have built a basic model to classify high revenue customers and looked at a few ways to evaluate its performance. Because modeling is an iterative process, let’s now turn our attention to improving the model’s results and speeding up the evaluation process.

Return to the model design#

We first need to return to the design of the modeling task.

  1. Navigate to the High revenue analysis attached to the customers_labeled dataset by clicking on Models within the High revenue analysis/Models/Random forest title at the top of the page.

  2. Click on the Design tab near the top center of the page.

Configure the train / test split#

One setting we may wish to tune is the split between training and testing sets.

  1. Click on the Train / Test Set panel of the model’s design.

By default, Dataiku randomly splits the first N rows of the input dataset into a training set and a test set. The default ratio is:

  • 80% for training.

  • 20% for testing.

This means Dataiku will take the first N rows of the dataset and randomly take 80% of those rows to train the model. This could result in a very biased view of the dataset.

If we return to the customers_labeled dataset in the Flow, and analyze the high_revenue column, our target column, we can see that there is a class imbalance.

A dataset with a class imbalance problem.

This could be problematic when taking only the first N rows of the dataset and randomly splitting it into train and test sets. However, since our dataset is small, we’ll keep the default sampling & splitting strategy.

Note

One way to try to improve a class imbalance is to apply a class rebalance sampling method. Visit the reference documentation to discover how Dataiku allows you to configure sampling and splitting.

Adjust feature handling settings#

To address the issue about pre-processing of variables before training the model, we’ll use the Features handling panel. Here, Dataiku will let you tune different settings.

  1. Select Features handling in the Features section.

Reject geopoint feature#

The Role of the variable (or feature) is the fact that a variable can be either used (Input) or not used (Reject) in the model.

Let’s remove ip_address_geopoint from the model.

  1. Turn off ip_address_geopoint to exclude the feature from the model.

Handling a geopoint feature in the Design tab of a visual analysis.

Disable rescaling behavior#

The Type of the variable is very important to define how it should be preprocessed before it is fed to the machine learning algorithm:

  • Numerical variables are real-valued ones. They can be integer or numerical with decimals.

  • Categorical variables store nominal values: red/blue/green, a zip code, a gender, etc. Also, there will often be times when a variable that looks like numerical should actually be categorical instead. For example, this will be the case when an “id” is used in lieu of the actual value.

  • Text is meant for raw blocks of textual data, such as a social media post or customer review. Dataiku is able to handle raw text features with specific preprocessing.

The numerical variables age_first_order and pages_visited_avg have been automatically normalized using a standard rescaling (this means that the values are normalized to have a mean of 0 and a variance of 1).

Let’s disable this behavior, and use No rescaling instead.

  1. Select the checkboxes for the variables age_first_order and pages_visited_avg.

  2. In the menu for handling the two selected features, under Rescaling, select No Rescaling.

Handling numeric features in the Design tab of a visual analysis.

Generate new features#

Generating new features can reveal unexpected relationships between the inputs (variables/features) and the target.

We can automatically generate new numeric features using Pairwise linear combinations and Polynomial combinations of existing numeric features.

  1. In the Features section, select the Feature generation panel.

  2. Select Pairwise linear combinations.

  3. Set Enable to Yes.

  4. Select Pairwise polynomial combinations.

  5. Set Enable to Yes.

Feature generation in the Design tab of a visual analysis.

Train new models#

After altering the Design settings, you can build some new models.

  1. Select Save, and then click Train.

  2. Select Train again to start the second training session.

Once the session has completed, you can see that the performance of the random forest model has now slightly increased.

Evaluate the new models#

In our case, session 2 resulted in a Random Forest model with an AUC value that is slightly higher than the first model.

Diagnostics#

When training is complete, we can go directly to ML diagnostics.

  1. Select Diagnostics in the Result tab of the random forest model to view the results of the ML diagnostics checks.

ML diagnostics in the result tab of a visual analysis.

Dataiku displays Model Information > Training information. Here, we can view warnings and get advice to avoid common pitfalls, including if a feature has a suspiciously high importance — which could be due to a data leak or overfitting.

This is like having a second set of eyes that provide warning and advice, so that you can identify and correct these issues when developing the model.

Feature importance#

Finally, let’s look at the Feature importance charts for the latest model.

  1. Select Feature importance in the Explainability section.

  2. At the top of the page, change the Feature importance method to Gini to view the Variable importance chart.

We can see that the importance is spread across the campaign variable along with the features automatically generated from age_first_order and pages_visited_avg. The generated features may have uncovered some previously hidden relationships.

The Feature importance chart for the model.

You can also switch the Feature importance method back to Shapley to view more feature importance charts such as Feature effects and Feature dependence plots.

Table view#

Now that you have trained several models, all the results may not fit your screen. To see all your models at a glance:

  1. Go back to the Result tab. You can do this by clicking on Models in the High revenue analysis/Models/Randomforest (s2) title.

  2. Switch to the Table view.

You can sort the Table view on any column, such as ROC AUC. To do so, just click on the column title.

Table view in the results of a visual analysis.