Hands-On: Create the Model

In this lesson, you will create an initial, or baseline, machine learning model by analyzing the historical customer records and order logs from Haiku T-Shirts.

The goal is to predict whether a new customer will become a high-value customer, based on the information gathered during their first purchase.

We do not expect our initial model to be a high performing model. Throughout the hands-on lessons in this course, we will evaluate and improve the model. Finally, we will score the same model in the following Scoring course.

Prerequisites

This lesson assumes that you have completed Basics 101, 102, and 103 prior to beginning this one!

Create Your Project

From the Dataiku homepage, click +New Project > DSS Tutorials > ML Practitioner > Machine Learning Basics (Tutorial). Click on Go to Flow. In the Flow, you can see the steps used in the previous tutorials to create, prepare, and join the customers and orders datasets.

../../../_images/tshirts-ml-flow.png

Additionally, there is a dataset of “unlabeled” customers representing the new customers that we want to predict. These customers have been joined with the orders log and prepared in much the same way as the historical customer data.

Alternatively, you can continue in the same project you worked on in Basics 103, by

  1. Removing the total_sum and count columns from the customers_labeled dataset.

  2. Downloading a copy of the customers_unlabeled.csv file and uploading it to the project.

  3. Preparing the customers_unlabeled dataset to match the schema of the customers_labeled dataset. Remember to use an inner join to join customers_unlabeled with orders_by_customer. You can even copy-paste the Prepare recipe steps from the script you used to prepare the customers_orders_joined dataset.

Predicting Whether a Customer Will be of High Value

Based upon the joined customer and order data, our goal is to predict (i.e. guess) whether the customer will become a “high revenue” customer. If we can predict this correctly, we would be able to assess the quality of the cohorts of new users and more effectively drive acquisition campaigns and channels.

In the Flow, select the customers_labeled dataset and click on the Lab button to create a new visual analysis. Give the analysis the more descriptive name High revenue analysis.

../../../_images/lab-creation.png

Our labeled dataset contains personal information about the customer, his/her device and his/her location. The last column high_revenue is a flag for customers generating a lot of revenue based on their purchase history. It will be used as the target variable of our modeling task.

Now let’s build our first model!

Click on the Models tab in the visual analysis and then click Create first model. A modal dialog appears where you must choose the type of modeling task you want to perform.

../../../_images/modeling-types.png

Here, we want to predict high_revenue. Let us choose the Prediction option, and select high_revenue as the target variable. Dataiku DSS allows you complete control over the machine learning algorithms, but since this is our first model, let’s click on Automated Machine Learning.

../../../_images/prediction-style.png

Automated machine learning provides templates to create models depending on what you want to achieve; for example, either using machine learning to get some insights on your data or creating a highly performant model. Let us keep the default Quick Prototypes template on the In-memory (Python) backend and click Create. Click Train on the next screen.

../../../_images/quick-models.png

DSS guesses the best preprocessing to apply to the features of your dataset before applying the machine learning algorithms.

A few seconds later, DSS presents a summary of the results of this modeling session. By default, two classes of algorithms are used on the data:

  • a simple generalized linear model (logistic regression)

  • a more complex ensemble model (random forest)

../../../_images/tshirt-ml-model-01.png

The model summaries contain some important information:

  • the type of model

  • a performance measure; here the Area Under the ROC Curve or AUC is displayed

  • a summary of the most important variables in predicting your target

The AUC measure is handy: the closer to 1, the better the model. Here the Random forest model seems to be the most accurate. Click on it, and you will be taken to the main Results page for this specific model.

../../../_images/tshirt-ml-model-summary.png

The Summary tab showed an AUC value of 0.767, which is pretty good for this type of application. Your actual figure might vary due to differences in how rows are randomly assigned to training and testing samples.

To get a better understanding of your model results, Dataiku DSS also offers several different outputs in the left panel. These outputs are grouped into:

  • Interpretation, for assessing model behavior and the effects of features

  • Performance, for evaluating the model, using performance metrics

  • Model Information, for providing more information about the model

What’s next?

Congratulations! You have successfully built your first prediction model in DSS. There is a lot more to be done, however. In the next few lessons, you’ll dive deeper into how DSS can help you evaluate models like the one you have just built.