Solution | Market Basket Analysis#

Overview#

Business case#

Personalization is a huge opportunity for retail and CPG businesses: 80% of companies report seeing an uplift since implementing personalization, which includes recommending relevant products to users.

Retailers use several techniques to build relevant recommendations. One of them is the market basket analysis, used by retailers to increase sales by better understanding customer purchasing patterns. It relies on the analysis of large purchase history dataset to identify products that are likely to be purchased together.

One of the most famous examples is the well-known e-commerce giant which heavily uses “frequently bought together” items on the product pages. Brick-and-mortar stores can also leverage it. For example, a sports shop could choose to place running shoes next to swimsuits based on the analysis to increase sales.

Overall, it’s a great and powerful way to generate value through several use cases: optimizing product placement both online and offline, offering product bundles deals etc. While driving additional sales for the retailer and enhancing the shopping experience for customers, market basket analysis is a key asset to make the customers build brand loyalty toward the company.

The Solution consists of a data pipeline that computes association rules, identifies product recommendations for customers, and in doing so, opens up a wide range of product and purchasing analyses. Analysts can input their own data and surface the outputs in a dashboard or interactive webapp to analyze their organization’s own transaction data. Data scientists should use this Solution as an initial building block to develop advanced analytics / support decision making. Dataiku can offer roll-out and customization services on demand.

Installation#

From the Design homepage of a Dataiku instance connected to the internet, click + Dataiku Solutions.
Search for and select Market Basket Analysis.
If needed, change the folder into which the Solution will be installed, and click Install.
Follow the modal to either install the technical prerequisites below or request an admin to do it for you.

Note

Alternatively, download the Solution’s .zip project file, and import it to your Dataiku instance as a new project.

Technical requirements#

To use this Solution, you must meet the following requirements:

Have access to a Dataiku 13+* instance.
To benefit natively from the Solution, your transaction data (see Data requirements) should be in one of the following connections:
- Snowflake
- Google Cloud Platform: BigQuery + GCS (You need both if you want to leverage BigQuery).
- Azure: Azure Blob Storage
- PostgreSQL
- Microsoft SQL server (Dates must be mandatorily stored with ISO-8601 or RFC 822 formats if you work with this storage).
A Python 3.9 code environment named solution_market-basket-analysis with the following required packages:

mlxtend==0.18.0
dateparser==1.0.0
regex==2022.3.2
flask==2.2.2
Werkzeug==2.3.7

Required core packages versions for this code env are:

pandas==1.1.5

Data requirements#

The Dataiku Flow was initially built using publicly available data. However, we intend for you to use this project with your own data, which you can upload using the Dataiku app. Having a transactional historical dataset is mandatory to run the project and each row of the dataset should comprise:

Column	Description
Description	Describes the item and is later used as the item identifier.
InvoiceNo	Serves as the transaction identifier.
InvoiceDate	Contextualizes the purchase of an item, within a transaction, on a given date.
CustomerID (in transactions_dataset only)	Incorporates customer data.
Country (in transactions_dataset only)	Contextualizes transactions on where they occurred.

Workflow overview#

You can follow along with the sample project in the Dataiku gallery.

The project has the following high level steps:

Connect your data as an input and select your analysis parameters via the Dataiku app.
Ingest and pre-process the data to be compatible with the association rules computation.
Compute the association rules and filter the most relevant rules for better consumption downstream.
Identify products to recommend to customers based on their past transactions.
Interactively visualize the most frequently bought items, and the products associated with them for smarter product recommendations.

Walkthrough#

Note

In addition to reading this document, it’s recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Plug and play with your own data and parameter choices#

To begin, you’ll need your own instance of the Dataiku app associated with this Solution.

From the waffle () menu of the Design node’s top navigation bar, select Dataiku Apps.
Search for and select the application with the name of this Solution.
Then, click Create App Instance.

Once you have created a new instance, you can walk through the steps of the app to add your data, and select the analysis parameters to run.

Once you’ve built all elements of the Dataiku app you can either continue to the Project View to explore the generated datasets or go straight to the dashboards and webapp to visualize the data. If you’re mainly interested in the visual components of this pre-packaged Solution, feel free to skip over the next section.

Under the hood: The Dataiku app’s underlying Flow#

The Dataiku app is built on top of a Dataiku Flow that has been optimized to accept input datasets and respond to your select parameters. Let’s quickly walk through the different Flow zones to get an idea of how this was done.

Flow zone	Description
inputs	Contains a single dataset which is populated by ingesting the transactions table defined in the Inputs section of the app. By default it contains a publicly available dataset we’ve provided.
transactions_preprocessing	Looks a bit intimidating but is, in reality, a series of visual steps that clean the data and filter it based on the parameters set in the Dataiku app. It outputs the transactions_preprocessed dataset which is used for association rules computation.
association_rules_computation	Generates all learned association rules between items based on the transaction history and defined app parameters. The resulting 5 datasets represent the learned association rules, identified itemsets (products bought together), consequences, antecedents, and a general summary. The project wiki goes into great detail on the different datasets and their contents. As you will see when exploring the output datasets generated by the association rules computation, it’s possible to have a large amount of learned rules. It’s therefore important to filter the most relevant rules based on self-defined thresholds. As a reminder, you can filter rules based on four metrics selected in the Dataiku app. Support is the proportion of transactions which contain both itemsets (for example. tomato and onion). Confidence of a rule is the conditional probability of getting one itemset (for example tomato) knowing another itemset (for example onion). Lift measures the power of a rule and the strength of a dependent relationship between itemsets. Conviction measures the dependency of a rule outcome to its trigger.
recommendations_preprocessing (Dedicated to creating recommendations based on the association rules we’ve chosen to use)	Uses the filtering parameters set in the Dataiku app to filter on transactions dates before computing all the distinct purchased items, identifying unique customers, and applying association rules to identify all the associated items that each customer could have purchased based on what they did purchase. This zone results in two product-oriented outputs: (1) repeat purchases candidates and (2) cross-sell candidates.
cross_sales_recommendations (Dedicated to creating recommendations based on the association rules we’ve chosen to use)	Identifies products that are likely to be purchased together based on the association rules learnings so that you can personalize recommendations of products online and place strongly associated items close together in physical stores.
repeat_purchase_recommendations (Dedicated to creating recommendations based on the association rules we’ve chosen to use)	Also identifies products using association rules but in this case it focuses on items that are likely to be re-purchased. This enables tailoring promotions and marketing for your existing customer base.
webapp_zone	Isolates the datasets required for the backend of this Solution’s webapps.

Further explore your association rules with shareable visualizations#

The Market Basket Analysis Solution comes with a prebuilt dashboard containing:

Three pages with visualizations built with Dataiku charts to make consumption of the project’s datasets easier to consume.
Two pages with interactive webapps to allow you to explore the association rules and product recommendations derived from your transactions data.

Note

Parameters selected in the Dataiku app impact both the charts and webapps so the final renderings in your own projects may differ as a result.

The dashboard charts give a variety of visual ways to understand the transactions dataset both before and after applying preprocessing. These visualizations alone can give an overview of the transactions impacting your market basket analysis, identify purchasing patterns, understand the origin of certain association rules, and better tune the parameters of the Dataiku app to find association rules.

The final two pages of the dashboard contain two webapps: Items frequency analysis and Rules browser. If you are unable to interact with the webapp within the dashboard, you might need to start/restart the webapp backend. You can do this by going to the webapp menu or by running the restart_webapp_backend scenario.

Let’s first take a look at the Items frequency analysis, webapp which allows you to analyze the support of the most frequent items. The images contained in this article use the “Country” column of the transactions dataset as an association rule scope. Within this webapp there are some additional helpers that you can expand to better understand the wording used throughout.

If you configured a rules scope in the Dataiku app, you will need to choose at least one rules scope (for example one Country). You can additionally choose to focus on items that are common to a specified number of rules and/or select specific items to focus on from the full list of frequent items identified in the transactions dataset. A counter at the top of the webapp shows how many items out of the total item count match the filters set.

Let’s end by taking a look at the Rules browser webapp found in the final page of the dashboard. This webapp allows you to choose the most frequent items to browse the computed association rules linked to them. Once again, there are expandable helpers to clarify the language used in this webapp. You can interact with the webapp by first searching for selecting one or more items (for example Blue Pen, Party Balloons). Selecting an item allows you to visualize their associated rules on the right split into triggers (the antecedent of the rule, a.k.a. the item selected) and outcomes (the consequent of the rules, a.k.a. items they’re associated with). Hovering over an underlined value will provide a quick description of the metric.

You can browse the computed association rules using the other interactive elements of the webapp by:

Swapping between the two tabs of triggers or outcomes.
Filtering rules based on a rule metric threshold.
Ordering results (ascending or descending) by one of the rules metrics.

Additionally, you can export the results of your interactive analysis as a CSV for further sharing and/or analysis.

A short note on automation#

It’s possible to automate the Flow of this Solution based on new data, a specific time, etc. via the Project Setup. You can tune all trigger parameters in the Scenarios menu of the project.

Additionally, you can create reporters to send messages to Teams, Slack, email, etc. to keep your full organization informed. You can also run these scenarios ad-hoc as needed. You can find full details on the scenarios and project automation in the wiki.