Creating the Prediction Model

Now that we have a dataset ready on which to train models, let’s use machine learning to predict car breakdown.

From the open Lab Script, navigate to the Models tab. If none yet exist, we want to Create our First Model. Dataiku DSS lets us choose between two types of modeling tasks:

  1. Prediction (or supervised learning): to predict a target variable (including labels), given a set of input features

  2. Clustering (or unsupervised learning): to create groups of observations based on some shared patterns or characteristics

In this case, we are trying to determine whether or not a rental car will have problems. So, opt for a Prediction model.

Dataiku DSS then asks us to select the target variable. In this case, we want to calculate the probabilities for one of two outcomes: failure or non-failure, i.e., perform two-class classification. Accordingly, choose failure_bin as the target variable.

Once we have picked the type of machine learning problem, we can customize the model through either the option of Automated Machine Learning or Expert Mode.

Automated Machine Learning helps with some important decisions like choosing the type of algorithms and parameters of those algorithms. Select Automated Machine Learning and then Quick Prototypes, the default suggestions.

With the default design prepared, clicking Train will build models using the two default algorithms for this type of problem.

Once we have the results of some initial models, we can return to the Design tab to adjust all settings. For example, after navigating to the Basic > Metrics menu in the left sidebar, we can define how we want model selection to occur. By default, the platform optimizes for AUC (Area Under the Curve), i.e., it picks the model with the best AUC, while the threshold (or probability cut-off) is selected to give the best F1 score. Similarly, feature engineering can also be tailored as needed, from which/how features are used, as well as options for dimension reduction.

An important setting is the type of algorithms with which to model the data (under Modeling > Algorithms). In addition, we can define hyperparameters for each of them. For now, we’ll run two machine learning algorithms: Logistic Regression and Random Forest. They come from two classes of algorithms popular for these kinds of problems, linear and tree-based respectively.