Hands-On: Tune the Model

Thus far you have built a basic model to classify high revenue customers and looked at a few ways to evaluate its performance. Because modeling is an iterative process, let’s now turn attention to improving the model’s results.

In the Machine Learning Basics (Tutorial) project, return to the Models tab of the High revenue analysis. By default, you’ll be in the tab showing the Results of your training session. Navigate to the Design tab.

Feature Handling

The Design tab is where you can change how the models are built.


To address the issue about how we use the variables, proceed directly to the Features handling tab. Here DSS will let you tune different settings.

The Role of the variable (or feature) is the fact that a variable can be either used (Input) or not used (Reject) in the model.

  • Here, we want to remove ip_address_geopoint from the model.

  • Click on ip_address_geopoint and click the Reject button (or alternatively use the on/off toggle directly):


The Type of the variable is very important to define how it should be preprocessed before it is fed to the machine learning algorithm:

  • Numerical variables are real-valued ones. They can be integer or numerical with decimals.

  • Categorical variables are the ones storing nominal values: red/blue/green, a zip code, a gender, etc. Also, there will often be times when a variable that looks like Numerical should actually be Categorical instead. For example, this will be the case when an “id” is used in lieu of the actual value.

  • Text is meant for raw blocks of textual data, such as a Tweet, or customer review. Dataiku DSS is able to handle raw text features with specific preprocessing.

Each type can be handled differently. For instance, the numerical variables age_first_order and pages_visited_avg have been automatically normalized using a standard rescaling (this means that the values are normalized to have a mean of 0 and a variance of 1).

  • For these two variables, disable this behavior by selecting again both variable names in the list, and clicking the No rescaling button:


After altering these settings, you can now click on Train and build some new models:

The performance of the random forest model has now slightly increased:


Feature Generation


Here we focus on how to generate automated features such as linear and polynomial combinations, but note that the Script tab of a visual analysis includes all of the processors found in the Prepare recipe. Any features created here can be immediately fed to models. Please review lessons on the Prepare recipe and the Lab if this is unfamiliar to you.

Return to the Design tab again, and click the Feature generation tab. We can automatically generate new numeric features using Pairwise linear combinations and Polynomial combinations of existing numeric features. Sometimes these generated features can reveal unexpected relationships between the inputs and target.

  • Click on these feature generation methods and set “Enable” to Yes.


When done, train a third session by clicking on the Train button:


Interpreting Results Again

The resulting Random Forest beats the previous one – the AUC value is now higher than in either of the first two models – possibly because of the changes we made to the handling of features.

Looking at the Variables importance chart for the latest model, the importance is spread across the campaign variable along with the features automatically generated from age_first_order and pages_visited_avg, so the generated features may have uncovered some previously hidden relationships. On the other hand, the increase in AUC isn’t huge, so it may be best to be grateful for the boost without reading too much into it.


Now that you have trained several models, all the results may not fit your screen anymore. To see all your models at a glance, you can switch to the Table view, which can be sorted on any column. Here we have sorted on ROC AUC.


What’s next?

Congratulations, you just built, evaluated, and tuned your first predictive model using DSS!

How do we know, however, if this model to predict high revenue customers is biased? Is it performing similarly for male and female customers, for example?

In the next section on Explainable AI, we’ll spend more time trying to understand and interpret the model’s predictions.