Tutorial | Create the model (ML Practitioner part 1)

In this lesson, we will create a machine learning model to predict whether or not a new customer will become a high-revenue customer. To do this, we’ll train a classification model using historical customer records and order logs from Haiku T-Shirts. This historical data is known as labeled data. Later, we’ll score the model using unlabeled data.


When evaluating machine learning models, it is helpful to have a baseline model to compare with in order to establish that our models are performing better as we iterate on them.

We do not expect our initial, or baseline, model to be a high-performing model. We’ll need to evaluate and improve the model, implementing checks to ensure its predictions align with business expectations.

After we’ve deployed our best model to the Flow, we’ll score the unlabeled dataset–labeling the customers as True (high revenue) or False (not high revenue). We’ll do this in the scoring section of the ML Practitioner course series.


You’ll also find this tutorial as part of the Academy course, Machine Learning Basics, which is part of the ML Practitioner learning path.

Business objectives

It costs more to add new customers than it does to keep existing customers. Therefore, the business wants to target high-revenue Haiku T-shirts customers for the next marketing campaign.

Since the business is focused on finding and labeling (predicting) high-revenue customers, we’ll want to maximize the number of correct predictions while also minimizing the number of times the model incorrectly predicts that a customer is not a high-revenue customer. To do this, we’ll pay close attention to the confusion matrix when evaluating model results.

The business also wants to be able to measure how well the model is aligning with business expectations for specific known cases.

To enable this measurement, we’ll take advantage of Dataiku’s built-in model diagnostics such as model assertions. Model assertions, or ML assertions, act as a “sanity check” to ensure predictions align with business expectations for specific known cases.

Machine learning objectives

  • Build a high-performing, interpretable model to predict high-revenue customers.

  • Minimize false negatives–that is, minimize the number of times the model incorrectly predicts that a customer is not a high revenue customer.

  • Configure ML assertions.


This lesson assumes that you have completed Basics 101, 102, and 103, which are part of the Core Designer Learning Path in Dataiku Academy, prior to beginning this one!

Create your project

From the Dataiku homepage, click +New Project > DSS Tutorials > ML Practitioner > Machine Learning Basics (Tutorial).


You can also download the starter project from this website and import it as a zip file.

  • Select Go to Flow.

You’ll see an initial Flow that creates, prepares, and joins the customers and orders datasets.

Machine learning tutorial data pipeline in Dataiku.

In addition, there is a dataset of unlabeled customers representing the new customers that we want to predict. These customers have been joined with the orders log and prepared in much the same way as the historical customer data.

Alternatively, you can continue in the same project you worked on in Basics 103, by:

  1. Removing the total_sum and count columns from the customers_labeled dataset.

  2. Downloading a copy of the customers_unlabeled.csv file and uploading it to the project.

  3. Preparing the customers_unlabeled dataset to match the schema of the customers_labeled dataset. Remember to use an inner join to join customers_unlabeled with orders_by_customer. You can even copy-paste the Prepare recipe steps from the script you used to prepare the customers_orders_joined dataset.

Train the baseline model

Our goal is to predict (i.e., perform a calculated guess) whether the customer will become a “high revenue” customer. If we can predict this correctly, we would be able to assess the quality of the cohorts of new customers, and help the business more effectively drive acquisition campaigns and channels.

  • In the Flow, select the customers_labeled dataset and click on the Lab button in the right panel.

  • Select New Analysis.

  • Give the analysis the more descriptive name High revenue analysis, then click Create Analysis.

Creating a new visual analysis from a training dataset.

Our labeled dataset contains information about the customer including the age of their first order, the number of pages visited on average, and whether or not the customer responded to a campaign. The last column high_revenue is a flag for customers generating a lot of revenue based on their purchase history. It will be used as the target variable of our modeling task.

The target variable of the prediction task is the high revenue column.

Now let’s build our baseline model!

  • In the top right corner, go to the Models tab and then click Create first model.

Creating your first machine learning model.

Dataiku displays modeling task choices where you can choose the type of modeling task you want to perform.

  • Select AutoML Prediction since we want to predict high_revenue.

  • Select high_revenue as the target variable.

Dataiku displays automated machine learning templates for creating models depending on what you want to achieve; for example, using machine learning to get some insights on your data or creating a highly performant model.

  • Keep the default Quick Prototypes template on the In-memory (Python) backend and select Create.

  • Select Train on the next screen.

Dataiku guesses the best preprocessing to apply to the features of your dataset before applying the machine learning algorithms.

A few seconds later, Dataiku presents a summary of the results. By default, two classes of algorithms are used on the data:

  • a simple generalized linear model (logistic regression)

  • a more complex ensemble model (random forest)

While training the models, Dataiku runs visual ML diagnostics and displays the results in real time.

You can hover over the visual ML diagnostics as the models are trained to view any warnings.

Baseline machine learning model results.

Later, we’ll take a closer look at the model results including the ML diagnostics.

We can use the results from our baseline model training session as a comparison as we iterate on our model and try to improve its performance.

In the next section, we’ll evaluate our baseline model using the model summary. The model summary includes the following information:

  • the type of model

  • a performance measure; here the Area Under the ROC Curve or AUC is displayed

  • a summary of the most important variables in predicting your target

The AUC measure is handy: the closer to 1, the better the model. Here the Random forest model seems to be the most accurate.

  • Select the Random forest model to view the main Results page.

Machine learning model summary.

The Summary panel shows an ROC AUC value of approximately 0.791, which is pretty good for this type of application.


You might find that your actual results are different from those shown. This is due to differences in how rows are randomly assigned to training and testing samples.

What’s next?

Congratulations! You have successfully built your first prediction model in Dataiku. There is a lot more to be done, however. In Hands-On: Evaluate the Model, we’ll dive deeper into how Dataiku can help you evaluate models like the one you have just built.