Building a first model

Now that our target variable is built, we need to create the complete train set that will store a list of features as well. The features are the variables we’ll use to try to predict churn.

Let’s create a new SQL recipe, with the train and events_complete datasets as inputs, and a new SQL dataset named train_enriched as output. The recipe is made of the following code:

SELECT train.*
    , nb_products_seen
    , nb_distinct_product
    , nb_dist_0 , nb_dist_1, nb_dist_2
    , amount_bought , nb_product_bought
    , active_time
FROM train
LEFT JOIN (
    -- generate features based on past data
    SELECT user_id
        , COUNT(product_id) AS nb_products_seen
        , COUNT(distinct product_id) AS nb_distinct_product
        , COUNT(distinct category_id_0) AS nb_dist_0
        , COUNT(distinct category_id_1) AS nb_dist_1
        , COUNT(distinct category_id_2) AS nb_dist_2
        , SUM(price::numeric * (event_type = 'buy_order')::int ) AS amount_bought
        , SUM((event_type = 'buy_order')::int ) AS nb_product_bought
        , EXTRACT(
           EPOCH FROM (
            TIMESTAMP '2014-08-01 00:00:00' - MIN(event_timestamp)
           )
          )/(3600*24)
              AS active_time
    FROM events_complete
    WHERE event_timestamp < TIMESTAMP '2014-08-01 00:00:00'
    GROUP BY user_id
    ) features ON train.user_id = features.user_id

This query adds several features to the train dataset (containing the target variable). An intuition is that the more a person is active, the less likely she is to churn.

We hence create the following features for each user:

  • total number of products seen or bought on website

  • number of distinct product seen or bought

  • number of distinct categories (to have an idea of the diversity of purchases of a user)

  • total number of products bought

  • and total amount spent

Once this first limited set of features is created, our training set is ready. As in the Machine Learning Basics course, create a baseline model to predict the target:

"Screenshot of two trained models: Random Forest and Logistic Regression ROC AUC 0.816"

This AUC scores are not that bad for a first model. Let’s deploy it.