Solution | Store Segmentation#

Overview#

Business Case#

The evolving customer landscape requires an increased, hyper-focus on the consumer. Retailers and their partners are expected to deliver the right products, in the right quantities, in the right stores and at the right prices. These expectations require retailers to develop sales and merchandising strategies that deliver on the shopper need at the local level.

This plug-and-play solution enables you to group stores with similar characteristics using demographic and sales data to optimize operations. This data-driven approach aids in store planning, product allocation, promotion strategies, and assortment execution.

Installation#

The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data Requirements.

Technical Requirements#

To leverage this solution, you must meet the following requirements:

  • Have access to a Dataiku 12.6+ instance.

  • To benefit natively from the solution, your data (see Data Requirements) should be stored in one of the following connections:

    • Snowflake

    • Filesystem managed connection (the Solution is delivered with demo data using this connection type)

  • No code environment is needed for using this solution.

Data Requirements#

The Dataiku Flow was initially built using publicly available data. However, this project is meant to be used with your own data which can be uploaded using the Project Setup. Below are the input datasets that the solution has been built with:

To better prepare these datasets, you can also help yourself with the Data Model article in the wiki.

Mandatory Datasets:

stores

Column

Type

Description

store_id

[string]

Unique identifier for a store

latitude

[double]

Store location latitude coordinate

longitude

[double]

Store location longitude coordinate

transactions

Column

Type

Description

transaction_date

[date]

Date of the transaction

store_id

[string]

Unique identifier for a store

product_id

[string]

Unique identifier for a product

product_purchase_price

[double]

Price of the product at time of purchase

product_quantity

[double]

Quantity of the product bought

transaction_id

[string]

Unique identifier for a transaction

census_data: mandatory only for the demographic segmentation.

Column

Type

Description

census_polygon

[string]

Census area for which all the demographic information is gathered

population_count

[integer]

Number of inhabitants per census area

age_A_B

[integer]

Number of inhabitants per census area who are between A and B years old

income

[integer]

Average income per census area

occupation_X

[integer]

Number of inhabitants per census area having job occupation X

household_composition_couple

[integer]

Number of inhabitants per census area living as a couple

household_composition_family

[integer]

Number of inhabitants per census area living in a family with at least one child

household_composition_other

[integer]

Number of inhabitants for other types of household per census area

gender_female

[integer]

Number of female inhabitants per census area

gender_male

[integer]

Number of male inhabitants per census area

gender_other

[integer]

Number of inhabitants who identified as another gender, per census area

native

[integer]

Number of inhabitants per census area who have the countries’ citizenship

non-native

[integer]

number of inhabitants per census area who don’t have the countries’ citizenship

work_from_home

[integer]

Number of inhabitants working from home per census area

work_outside

[integer]

Number of inhabitants working outside of home

products: mandatory only for the sales per category segmentation.

Column

Type

Description

product_id

[string]

Unique identifier for a product

target_category

[string]

Level of the category(ies) to analyze; contains all the possible category values

sub_category_1

[string]

First sub-category level under the target category

sub_category_X

[string]

Any other sub-categories under sub_category_1

Workflow Overview#

You can follow along with the sample project in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow zones.

The project has the following high-level steps:

  1. Connect your data as an input, and select your analysis parameters via the Project Setup.

  2. Run the Flow to prepare the data, create the features, and train the models.

  3. Explore the results in the demographic and/or sales per category dashboards.

Walkthrough#

Note

In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Preparation Questions#

Setting up the Solution requires some knowledge of data science. You may want to ask a data scientist or another technical colleague for assistance.

To begin, you will need to answer some questions before setting up the project with your own data and parameters. The questions are:

  • In which country are your stores located?

  • For which time period do you want to analyze store sales performance?

  • Do you want to analyze your stores based on the population living around them? Or do you want to analyze your stores based on your category’s sales performance? Or both?

Analyze stores based on the population around them#

The Solution examines the population living in your store trade areas, which are calculated within the Solution. How far do your trade areas extend from your stores?

  • Do you want to segment your stores based on absolute sub-population counts?

  • Or do you want to segment your stores based on relative sub-population counts?

  • Do you want a specific number of clusters? Or do you prefer to use the default numbers (it will be either 3, 4, or 5 clusters depending on the model’s performance)?

Analyze stores based on a category’s sales performance#

  • Do you want to segment your stores based on total category revenue or category revenue share within a store?

  • Or do you want to segment your stores based on within each store?

  • Do you want a specific number of clusters, or do you prefer to use the default numbers (it will be either 3, 4, or 5 clusters depending on the model’s performance)?

With answers to these questions, the data scientist should be able to set up the Solution.

Project Setup#

Here is an overview of the Project Setup and how to use it.

Connection Settings#

Dataiku screenshot of the Connection Settings part of the Project Setup for Store Segmentation

The project is initially shipped with all datasets using the filesystem connection. The user can either leave it this way by not modifying the connection settings section or switch to their preferred connection. First, select the connection from the list of available connections. Then, press the REFRESH button to display the filesystem connection settings.

Important

If the main connection is either Redshift, Synapse, or BigQuery, you will be asked to select a filesystem connection because processes might need to be written in a filesystem connection.

If the main connection is either S3, Azure, or Google Cloud Storage, you will be asked to choose an advanced file format.

Then, press RECONFIGURE to switch the dataset connections on the Flow.

Store Segmentation Strategies#

Dataiku screenshot of the Store Segmentation Strategies part of the Project Setup for Store Segmentation

This section lets you choose one or multiple store segmentation strategies.

Data Loading and Store Preprocessing#

Dataiku screenshot of the Data Loading part of the Project Setup for Store Segmentation

In this section, you can load the datasets required for the segmentation strategies selected above. The datasets need to be stored in a data preparation project, on the same connection as the one selected above. You need to enter the project key of the data preparation project, which you can find in its URL:

Dataiku screenshot of the Data Loading part of the Project Setup for Store Segmentation

Once the project key is entered in the text box and the datasets are selected, you can click on the RUN NOW button.

Once the inputs successfully run, you can refresh the page by pressing REFRESH to update the data displayed on the Project Setup page.

Transaction Preprocessing#

Dataiku screenshot of the Transaction Preprocessing part of the Project Setup for Store Segmentation

The Store Segmentation Solution focuses on a certain time period since it uses transactional data.

The first RUN button will display the first and last transaction dates available in the loaded dataset.

To display the first and last transaction dates, click on the REFRESH button.

Then, you can define the time granularity that will be used to visualize the transaction data in the dashboards. You can choose between days, weeks, months, or quarters.

Finally, defining the time frame is done by selecting its last date, either by taking the last one available in your data or by explicitly choosing it. Then, you can select a backward window in months.

You can click on the RUN NOW button to preprocess the transactional data.

Demographic Segmentation#

Dataiku screenshot of the Demographic Segmentation section of the Project Setup for Store Segmentation
Store Preprocessing#

If you have selected the demographic segmentation, you can first set up the Geo join recipe for joining the census and store data.

To join the census and store data, we create a trade area around each store. This trade area can be created based on either distance (a circular area around the store) or time travel with a certain transportation mode.

First, select the trade area computation method: either distance or time travel.

If it’s distance, define the distance from the store (it must be an integer) and the distance unit: either Kilometers or Miles.

If it’s time travel, define the time from the stores in minutes and the transportation mode: either car, bicycle, or pedestrian.

Demographic Feature Engineering#

After defining the trade area, you can optionally define how the census data and store data will be joined. The default setting is “Join census areas if they are contained in the trade area”.

Regarding the census data feature engineering, you can decide whether to sum the sub-population counts or calculate a percentage of the sub-population over the total population in the trade area.

Demographic Clustering and Results#

Finally, you can choose the number of clusters you would like to test or have, or leave the default numbers. If you choose several numbers of clusters to train models on, then the model having the best performance will be deployed.

You can also choose to define advanced feature rescaling settings: either min-max rescaling or standard rescaling. The default rescaling method is min-max.

If the demographic clustering model has already been trained, you can choose to either retrain it or not.

At this point, you can click on RUN NOW to execute the whole demographic branch and then view the results on the demographic dashboards.

Sales per Category Segmentation#

Dataiku screenshot of the Sales per Category Segmentation section of the Project Setup for Store Segmentation
Sales per Category Feature Engineering#

If you have selected the sales per category segmentation strategy, you need to define some feature engineering settings, as well as clustering settings.

As we aggregate the transactional data for each store, we either sum the total sales per category or calculate the category share in each store.

You can also choose which category(ies) you want to analyze. The categories suggested are from the target_category column in the products dataset.

Sales per Category Clustering and Results#

Finally, you can choose the number of clusters you would like to test or have, or leave the default numbers. If you choose several numbers of clusters to train models on, then the model having the best performance will be deployed.

You can also choose to define advanced feature rescaling settings: either min-max rescaling or standard rescaling. The default rescaling method is min-max.

If the sales per category clustering model has already been trained, you can choose to either retrain it or not.

At this point, you can click on RUN NOW to execute the whole sales per category branch and then view the results on the sales per category dashboards.

Dashboards#

Once everything runs successfully, you can view the dashboards. Each dashboard has two pages: one for the segmentation results and another for the clustering model analysis.

The first page is typically for a business user, while the second page is for a technical user, such as a data scientist, to monitor model performance.

Demographic Dashboard#

The main idea of the demographic dashboard is to provide insights into the population living around the stores, as well as monitoring the performance of the clustering model.

Tip

The Segmentation Results page is for a business user, while the Clustering Model Analysis page is for a technical user.

Segmentation Results#

This dashboard will help you better understand the demographic patterns specific to each cluster.

By analyzing these clusters, you can help yourself with a map and this Cluster Identity Card, which describes the characteristics of each cluster. Furthermore, you can explore more insights about the population’s characteristics in each cluster (gender, nationality, age groups, occupation, income, workplace, and household composition).

Dataiku screenshot of the Demographic Map and Cluster Identity Card of the Dashboard for Store Segmentation Dataiku screenshot of the Demographic Population Charts (part 1) of the Dashboard for Store Segmentation
Clustering Model Analysis#

This page is for a technical user, such as a data scientist, to monitor model performance. It provides insights into the model’s performance, such as the silhouette score, the feature importance, and the cluster profiles.

Dataiku screenshot of the Demographic Feature Importance chart of the Dashboard for Store Segmentation

Sales per Category Dashboard#

The main idea of the sales per category dashboard is to provide insights into the product category(ies) sales performance by cluster, as well as monitoring the performance of the clustering model.

Thus, the “Segmentation Results” page is for a business user, while the “Clustering Model Analysis” page is for a technical user.

Segmentation Results#

The Sales per Category dashboard will help you understand not only your stores’ performance in each category, but also the performance at the sub-category and product levels.

You will be able to explore the clusters on a map, analyze the evolution and distribution of revenue per cluster, view a radar chart on sub-category performance, and find a table ranking the sub-category and product performances.

Dataiku screenshot of the Sales per Category Map of the Dashboard for Store Segmentation Dataiku screenshot of the Sales per Category Radar chart of the Dashboard for Store Segmentation
Clustering Model Analysis#

This page is built the same way as the demographic one. It is made for a technical user, such as a data scientist, to monitor model performance. It provides insights into the model’s performance, such as the silhouette score, the feature importance, and the cluster profiles.

Dataiku screenshot of the Demographic Cluster Profiles chart of the Dashboard for Store Segmentation

Reproducing these Processes With Minimal Effort For Your Own Data#

The intent of this project is to enable category managers, marketing teams, retail real estate managers to have a plug-and-play solution built with Dataiku to optimize product assortments and merchandising strategies, to adjust marketing campaigns and promotions, and to analyze store performance for relocation, openings, and closures.

We’ve provided several suggestions on how to use store, transaction, product and census data to cluster stores but ultimately the “best” approach will depend on your specific needs and your data. If you’re interested in adopting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.