Tutorial | Mining association rules and frequent item sets with R and Dataiku#

Overview#

Business case#

Looking for associations between items is a very common method to mine transaction data. A famous example is the so-called “market basket analysis”, where one looks for products frequently bought together at a grocery store or e-commerce site for instance. These kinds of associations, whether between purchases, TV shows or music, can serve as the basis of recommender systems.

In this tutorial, we demonstrate how to mine frequent item sets using an R package, from within Dataiku.

Supporting data#

We’ll be using the 1 million ratings version of the MovieLens dataset. It consists of a series of movie ratings made by users, and we are going to look for pairs of movies frequently reviewed, hence seen, by users.

This zip archive includes three files:

File name

Data on…

ratings.dat

UserID, MovieID, Rating, timestamp

movies.dat

MovieID, Title, Year, Genre

users.dat

UserID, Gender, Age, Occupation, Zip Code

Workflow overview#

The final Dataiku pipeline appears below.

Dataiku screenshot of the final project flow.

This Flow includes the following high-level steps:

  • Import raw data from the MovieLens dataset.

  • Wrangle data ready for mining association rules.

  • Use an R script to generate rules.

Prerequisites#

This tutorial assumes that you are familiar with:

Technical requirements#

  • A proper installation of R on the server running Dataiku.

    Tip

    See the reference documentation if you do not have the R integration installed.

  • An existing R code environment including the arules package or the permission to create a new code environment.

    Note

    Instructions for creating an R code environment can be found in the reference documentation.

Detailed walkthrough#

  1. Create a new blank Dataiku project and name it Association Rules.

Data acquisition#

The data acquisition stage in this case requires uploading three flat files to Dataiku.

  1. Download the MovieLens 1M file and uncompress the zip archive.

  2. One at a time, upload the three files to Dataiku. For each file, before creating the new dataset, navigate to the “Format/Preview” tab and change the type to One record per line.

  3. Append _raw to the end of each file name. For example, movies_raw.

Data preparation#

Before we can use this data for mining association rules, a few simple preparation steps in Dataiku are needed.

Clean data#

All three datasets require a similar Prepare recipe. In the Lab, create a new visual analysis on the movies_raw dataset.

  1. Use the Split column processor on the column line, using “::” as the delimiter.

  2. Remove the original line column.

  3. Rename the freshly-generated columns: MovieID, Title, and Genre.

  4. Use the Extract with regular expression processor on the Title column using the pattern ^.*(?<year>\d{4}).*$ in order to create a year column.

  5. Deploy this script to the Flow, simplifying the output dataset name to just movies.

Nearly the same process can be repeated on the remaining two datasets, ratings_raw and users_raw. Only the column names differ.

  1. For ratings_raw, rename columns line_0 to line_3 with the names UserID, MovieID, Rating, and timestamp, respectively.

  2. For users_raw, rename columns line_0 to line_4 with the names UserID, Gender, Age, Occupation, and Zip_code, respectively.

Moreover, the step extracting year with a regular expression can be omitted.

Join data#

After building the prepared datasets, join all three together with the Join recipe.

  1. Initiate a Join recipe between ratings and users. Name the output dataset transactions.

  2. Use a left join with UserID as the key.

  3. Add movies as a third input dataset by inner joining ratings and movies on the key MovieID.

  4. On the Selected Columns step, add the prefixes User and Movie to their respective columns for greater clarity on the origin of these columns.

  5. Run the recipe, updating the schema to 11 columns.

This will create a completely denormalized dataset, ready for the association rules analysis. At this point, the Flow should appear as below:

Dataiku screenshot of the Flow having finished data import and preparation steps.

Mining frequent associations with R#

Creating associations rules, or mining frequent item sets, is a set of techniques that can be used, in this case, to look for movies frequently reviewed together by users.

The arules R package contains the apriori algorithm, which we will rely on here.

From the transactions dataset, we need just some pretty simple data: a “grouping” key, which is here the UserID, and an “item” column, which is here the movies seen:

Dataiku screenshot of a visual analysis.

Create a code environment#

The default built-in R code environment includes popular packages like dplyr and ggplot2, but it does not include arules. Accordingly, create a new R code environment including this package, if one does not already exist.

Note

Users require specific permissions to create, modify and use code environments. If you do not have these permissions, contact your Dataiku administrator.

For detailed instructions on creating a new R environment, please consult the reference documentation.

Create a code recipe#

  1. From the transactions dataset, create a new R code recipe, naming the output dataset associations.

  2. Paste the script below into the code recipe. Run it to produce the output dataset associations.

library(dataiku)
library(arules)

# Recipe inputs
transactions <- dkuReadDataset("transactions", samplingMethod="head", nbRows=100000)

# Transform data to make it suitable
transactions <- as(
    split(as.vector(transactions$Movie_Title), as.vector(transactions$UserID)),
    "transactions"
)

# Analyze
rules <- apriori(
    transactions,
    parameter=list(supp=0.02, conf=0.8, target="rules", minlen=2, maxlen=2)
)

rules <- sort(rules, by ="lift")

# Recipe outputs
dkuWriteDataset(as(rules, "data.frame"), "associations")

This script does the following:

  1. Imports the required packages, including the Dataiku R API.

  2. Reads the dataset.

  3. Transforms the dataset into a suitable “transaction” format for the arules functions.

  4. Applies the apriori algorithm using a few parameters:

    • Minimum level of support and confidence (more on this later)

    • Extract only the rules made of 2 elements

    • Sort the results by descending lift

  5. Writes the resulting dataframe into a Dataiku dataset.

Note

Instead of directly running this R recipe, you could have interactively written this script in an R (Jupyter) notebook within Dataiku. Alternatively, you could create a blank R recipe. Develop it in RStudio through its integration with Dataiku, and save it back into Dataiku.

Interpreting results#

The associations rules are stored here in the final associations dataset, which is shown below.

The first column, rules, is structured as {Movie A} => {Movie B}. It can be interpreted as people who saw Movie A also saw Movie B.

Dataiku screenshot showing the output dataset.

It then includes three important statistics:

  • The support is the fraction of people who saw both movies among the entire dataset.

  • The confidence tells us the percentage of people who saw Movie A, also saw Movie B.

  • Finally, the lift is an interesting measure that will allow us to remove trivial rules, those that could appear only because both movies in the rule are popular. It is a correlation measure based on the fact that the actual joint probability of seeing both movies is higher than the one if they were independent.

Rules with a lift higher than 1 are the rules of greatest interest. The higher value, the higher the correlation. Going further, you may want to experiment adjusting the parameters of the algorithm, being more or less restrictive in terms of different settings.

In this sample, some of the rules with the highest lift score make sense. Some rules includes two films from the same trilogy (Three Colors), a film and its sequel (Ace Ventura), or two films from the same director (Mallrats and Clerks).

Wrap-up#

Congratulations! You have created a workflow that takes advantage of the visual interface of Dataiku for data wrangling, while also harnessing a statistical technique from an R package.