Tutorial | Create the model (ML Practitioner part 1)

In this series of tutorials, we will create a machine learning model to predict whether or not a new customer will become a high-revenue customer. To do this, we’ll train a classification model using historical customer records and order logs from a fictional company, Haiku T-Shirts.


In this tutorial, you will:

  • Create a baseline classification model.

  • Examine the summary results of this prediction task.

Business objectives

It costs more to add new customers than it does to keep existing customers. Therefore, the business wants to target high-revenue Haiku T-shirts customers for the next marketing campaign.

Since the business is focused on finding and labeling (predicting) high-revenue customers, we’ll want to maximize the number of correct predictions while also minimizing the number of times the model incorrectly predicts that a customer is not a high-revenue customer. To do this, we’ll pay close attention to the confusion matrix when evaluating model results.

The business also wants to be able to measure how well the model is aligning with business expectations for specific known cases.

To enable this measurement, we’ll take advantage of Dataiku’s built-in model diagnostics such as model assertions. Model assertions, or ML assertions, act as a “sanity check” to ensure predictions align with business expectations for specific known cases.

Machine learning objectives

  • Build a high-performing, interpretable model to predict high-revenue customers.

  • Minimize false negatives—that is, minimize the number of times the model incorrectly predicts that a customer is not a high revenue customer.

  • Configure ML assertions.


Create your project

Let’s get started!

  • From the Dataiku homepage, click +New Project > DSS Tutorials > ML Practitioner > Machine Learning Basics (Tutorial).


You can also download the starter project from this website and import it as a zip file.

  • Select Go to Flow.

You’ll see an initial Flow that imports, prepares, and joins the customers and orders datasets.

Machine learning tutorial data pipeline in Dataiku.

In addition, there is a customers_unlabeled dataset representing the new customers that we want to predict. These customers have been joined with the orders log and prepared in much the same way as the historical customer data.

Train a baseline model

Our goal is to predict (i.e., perform a calculated guess) whether the customer will become a “high revenue” customer. If we can predict this correctly, we would be able to assess the quality of the cohorts of new customers, and help the business more effectively drive acquisition campaigns and channels.

  • In the Flow, select the customers_labeled dataset, and click on the Lab button in the right panel.

  • Select New Analysis.

  • Give the analysis the more descriptive name High revenue analysis.

  • Click Create Analysis.

Creating a new visual analysis from a training dataset.

Our labeled dataset contains information about the customer including the age of their first order, the number of pages visited on average, and whether or not the customer responded to a campaign. The last column high_revenue is a flag for customers generating a lot of revenue based on their purchase history. It will be used as the target variable of our modeling task.

The target variable of the prediction task is the high revenue column.

Now let’s build our baseline model!

  • In the top right corner, go to the Models tab, and then click Create first model.

Dataiku displays modeling task choices where you can choose the type of modeling task you want to perform.

  • Select AutoML Prediction since we want to predict high_revenue.

  • Select high_revenue as the target variable.

Dataiku displays automated machine learning templates for creating models depending on what you want to achieve; for example, using machine learning to get some insights on your data or creating a highly performant model.

  • Keep the default Quick Prototypes template on the In-memory (Python) backend and select Create.

  • Select Train on the next screen.

Dataiku guesses the best preprocessing to apply to the features of your dataset before applying the machine learning algorithms.

A few seconds later, Dataiku presents a summary of the results. In this case, two classes of algorithms are used on the data:

  • a simple generalized linear model (logistic regression)

  • a more complex ensemble model (random forest)

Baseline machine learning model results.


While training the models, Dataiku runs visual ML diagnostics and displays the results in real time. You can hover over the visual ML diagnostics as the models are trained to view any warnings.

Examine baseline results

We can use the results from our baseline model training session as a comparison as we iterate on our model and try to improve its performance.

Later tutorials will take a closer look at the model results, but we can take a brief tour.

The model summary includes the following information:

  • the type of model

  • a performance measure; here the Area Under the ROC Curve or AUC is displayed

  • a summary of the most important variables in predicting your target

The AUC measure is handy: the closer to 1, the better the model. Here the Random forest model seems to be the most accurate.

  • On the left, select the Random forest model to view its model report.

The Summary panel shows an ROC AUC value of approximately 0.791, which is pretty good for this type of application.

Machine learning model summary.


You might find that your actual results are different from those shown. This is due to differences in how rows are randomly assigned to training and testing samples.

What’s next?

Congratulations! You have successfully built your first prediction model in Dataiku. There is a lot more to be done, however.

In the next tutorial on evaluating the model, we’ll dive deeper into how Dataiku can help you evaluate models like the one you have just built.