Solution | Store Segmentation#
Overview#
Business Case#
The evolving customer landscape requires an increased, hyper-focus on the consumer. Retailers and their partners are expected to deliver the right products, in the right quantities, in the right stores and at the right prices. These expectations require retailers to develop sales and merchandising strategies that deliver on the shopper need at the local level.
This plug-and-play solution enables you to group stores with similar characteristics using demographic and sales data to optimize operations. This data-driven approach aids in store planning, product allocation, promotion strategies, and assortment execution.
Installation#
The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.
Dataiku Cloud users should follow the instructions for installing solutions on cloud.
The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.
Once the Solution has been added to your space, move ahead to Data Requirements.
After meeting the technical requirements below, self-managed users can install the Solution in one of two ways:
On your Dataiku instance connected to the internet, click + New Project > Dataiku Solutions > Search for Store Segmentation.
Alternatively, download the Solution’s .zip project file, and import it to your Dataiku instance as a new project.
Additional note for 12.1+ users
If using a Dataiku 12.1+ instance, and you are missing the technical requirements for this Solution, the popup below will appear to allow admin users to easily install the requirements, or for non-admin users to request installation of code environments and/or plugins on their instance for the Solution.
Admins can processes these requests in the admin request center, after which non-admin users can re-trigger the successful installation of the Solution.
Technical Requirements#
To leverage this solution, you must meet the following requirements:
Have access to a Dataiku 12.6+ instance.
To benefit natively from the solution, your data (see Data Requirements) should be stored in one of the following connections:
Snowflake
Filesystem managed connection (the Solution is delivered with demo data using this connection type)
No code environment is needed for using this solution.
Data Requirements#
The Dataiku Flow was initially built using publicly available data. However, this project is meant to be used with your own data which can be uploaded using the Project Setup. Below are the input datasets that the solution has been built with:
To better prepare these datasets, you can also help yourself with the Data Model article in the wiki.
Mandatory Datasets:
- stores
Column
Type
Description
store_id
[string]
Unique identifier for a store
latitude
[double]
Store location latitude coordinate
longitude
[double]
Store location longitude coordinate
- transactions
Column
Type
Description
transaction_date
[date]
Date of the transaction
store_id
[string]
Unique identifier for a store
product_id
[string]
Unique identifier for a product
product_purchase_price
[double]
Price of the product at time of purchase
product_quantity
[double]
Quantity of the product bought
transaction_id
[string]
Unique identifier for a transaction
- census_data: mandatory only for the demographic segmentation.
Column
Type
Description
census_polygon
[string]
Census area for which all the demographic information is gathered
population_count
[integer]
Number of inhabitants per census area
age_A_B
[integer]
Number of inhabitants per census area who are between A and B years old
income
[integer]
Average income per census area
occupation_X
[integer]
Number of inhabitants per census area having job occupation X
household_composition_couple
[integer]
Number of inhabitants per census area living as a couple
household_composition_family
[integer]
Number of inhabitants per census area living in a family with at least one child
household_composition_other
[integer]
Number of inhabitants for other types of household per census area
gender_female
[integer]
Number of female inhabitants per census area
gender_male
[integer]
Number of male inhabitants per census area
gender_other
[integer]
Number of inhabitants who identified as another gender, per census area
native
[integer]
Number of inhabitants per census area who have the countries’ citizenship
non-native
[integer]
number of inhabitants per census area who don’t have the countries’ citizenship
work_from_home
[integer]
Number of inhabitants working from home per census area
work_outside
[integer]
Number of inhabitants working outside of home
- products: mandatory only for the sales per category segmentation.
Column
Type
Description
product_id
[string]
Unique identifier for a product
target_category
[string]
Level of the category(ies) to analyze; contains all the possible category values
sub_category_1
[string]
First sub-category level under the target category
sub_category_X
[string]
Any other sub-categories under sub_category_1
Workflow Overview#
You can follow along with the sample project in the Dataiku gallery.
The project has the following high-level steps:
Connect your data as an input, and select your analysis parameters via the Project Setup.
Run the Flow to prepare the data, create the features, and train the models.
Explore the results in the demographic and/or sales per category dashboards.
Walkthrough#
Note
In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.
Preparation Questions#
Setting up the Solution requires some knowledge of data science. You may want to ask a data scientist or another technical colleague for assistance.
To begin, you will need to answer some questions before setting up the project with your own data and parameters. The questions are:
In which country are your stores located?
For which time period do you want to analyze store sales performance?
Do you want to analyze your stores based on the population living around them? Or do you want to analyze your stores based on your category’s sales performance? Or both?
Analyze stores based on the population around them#
The Solution examines the population living in your store trade areas, which are calculated within the Solution. How far do your trade areas extend from your stores?
Do you want to segment your stores based on absolute sub-population counts?
Or do you want to segment your stores based on relative sub-population counts?
Do you want a specific number of clusters? Or do you prefer to use the default numbers (it will be either 3, 4, or 5 clusters depending on the model’s performance)?
Analyze stores based on a category’s sales performance#
Do you want to segment your stores based on total category revenue or category revenue share within a store?
Or do you want to segment your stores based on within each store?
Do you want a specific number of clusters, or do you prefer to use the default numbers (it will be either 3, 4, or 5 clusters depending on the model’s performance)?
With answers to these questions, the data scientist should be able to set up the Solution.
Project Setup#
Here is an overview of the Project Setup and how to use it.
Connection Settings#
The project is initially shipped with all datasets using the filesystem connection. The user can either leave it this way by not modifying the connection settings section or switch to their preferred connection. First, select the connection from the list of available connections. Then, press the REFRESH
button to display the filesystem connection settings.
Important
If the main connection is either Redshift, Synapse, or BigQuery, you will be asked to select a filesystem connection because processes might need to be written in a filesystem connection.
If the main connection is either S3, Azure, or Google Cloud Storage, you will be asked to choose an advanced file format.
Then, press RECONFIGURE
to switch the dataset connections on the Flow.
Store Segmentation Strategies#
This section lets you choose one or multiple store segmentation strategies.
Data Loading and Store Preprocessing#
In this section, you can load the datasets required for the segmentation strategies selected above. The datasets need to be stored in a data preparation project, on the same connection as the one selected above. You need to enter the project key of the data preparation project, which you can find in its URL:
Once the project key is entered in the text box and the datasets are selected, you can click on the RUN NOW
button.
Once the inputs successfully run, you can refresh the page by pressing REFRESH
to update the data displayed on the Project Setup page.
Transaction Preprocessing#
The Store Segmentation Solution focuses on a certain time period since it uses transactional data.
The first RUN
button will display the first and last transaction dates available in the loaded dataset.
To display the first and last transaction dates, click on the REFRESH
button.
Then, you can define the time granularity that will be used to visualize the transaction data in the dashboards. You can choose between days, weeks, months, or quarters.
Finally, defining the time frame is done by selecting its last date, either by taking the last one available in your data or by explicitly choosing it. Then, you can select a backward window in months.
You can click on the RUN NOW
button to preprocess the transactional data.
Demographic Segmentation#
Store Preprocessing#
If you have selected the demographic segmentation, you can first set up the Geo join recipe for joining the census and store data.
To join the census and store data, we create a trade area around each store. This trade area can be created based on either distance (a circular area around the store) or time travel with a certain transportation mode.
First, select the trade area computation method: either distance or time travel.
If it’s distance, define the distance from the store (it must be an integer) and the distance unit: either Kilometers or Miles.
If it’s time travel, define the time from the stores in minutes and the transportation mode: either car, bicycle, or pedestrian.
Demographic Feature Engineering#
After defining the trade area, you can optionally define how the census data and store data will be joined. The default setting is “Join census areas if they are contained in the trade area”.
Regarding the census data feature engineering, you can decide whether to sum the sub-population counts or calculate a percentage of the sub-population over the total population in the trade area.
Demographic Clustering and Results#
Finally, you can choose the number of clusters you would like to test or have, or leave the default numbers. If you choose several numbers of clusters to train models on, then the model having the best performance will be deployed.
You can also choose to define advanced feature rescaling settings: either min-max rescaling or standard rescaling. The default rescaling method is min-max.
If the demographic clustering model has already been trained, you can choose to either retrain it or not.
At this point, you can click on RUN NOW
to execute the whole demographic branch and then view the results on the demographic dashboards.
Sales per Category Segmentation#
Sales per Category Feature Engineering#
If you have selected the sales per category segmentation strategy, you need to define some feature engineering settings, as well as clustering settings.
As we aggregate the transactional data for each store, we either sum the total sales per category or calculate the category share in each store.
You can also choose which category(ies) you want to analyze. The categories suggested are from the target_category
column in the products dataset.
Sales per Category Clustering and Results#
Finally, you can choose the number of clusters you would like to test or have, or leave the default numbers. If you choose several numbers of clusters to train models on, then the model having the best performance will be deployed.
You can also choose to define advanced feature rescaling settings: either min-max rescaling or standard rescaling. The default rescaling method is min-max.
If the sales per category clustering model has already been trained, you can choose to either retrain it or not.
At this point, you can click on RUN NOW
to execute the whole sales per category branch and then view the results on the sales per category dashboards.
Dashboards#
Once everything runs successfully, you can view the dashboards. Each dashboard has two pages: one for the segmentation results and another for the clustering model analysis.
The first page is typically for a business user, while the second page is for a technical user, such as a data scientist, to monitor model performance.
Demographic Dashboard#
The main idea of the demographic dashboard is to provide insights into the population living around the stores, as well as monitoring the performance of the clustering model.
Tip
The Segmentation Results page is for a business user, while the Clustering Model Analysis page is for a technical user.
Segmentation Results#
This dashboard will help you better understand the demographic patterns specific to each cluster.
By analyzing these clusters, you can help yourself with a map and this Cluster Identity Card, which describes the characteristics of each cluster. Furthermore, you can explore more insights about the population’s characteristics in each cluster (gender, nationality, age groups, occupation, income, workplace, and household composition).
Clustering Model Analysis#
This page is for a technical user, such as a data scientist, to monitor model performance. It provides insights into the model’s performance, such as the silhouette score, the feature importance, and the cluster profiles.
Sales per Category Dashboard#
The main idea of the sales per category dashboard is to provide insights into the product category(ies) sales performance by cluster, as well as monitoring the performance of the clustering model.
Thus, the “Segmentation Results” page is for a business user, while the “Clustering Model Analysis” page is for a technical user.
Segmentation Results#
The Sales per Category dashboard will help you understand not only your stores’ performance in each category, but also the performance at the sub-category and product levels.
You will be able to explore the clusters on a map, analyze the evolution and distribution of revenue per cluster, view a radar chart on sub-category performance, and find a table ranking the sub-category and product performances.
Clustering Model Analysis#
This page is built the same way as the demographic one. It is made for a technical user, such as a data scientist, to monitor model performance. It provides insights into the model’s performance, such as the silhouette score, the feature importance, and the cluster profiles.
Reproducing these Processes With Minimal Effort For Your Own Data#
The intent of this project is to enable category managers, marketing teams, retail real estate managers to have a plug-and-play solution built with Dataiku to optimize product assortments and merchandising strategies, to adjust marketing campaigns and promotions, and to analyze store performance for relocation, openings, and closures.
We’ve provided several suggestions on how to use store, transaction, product and census data to cluster stores but ultimately the “best” approach will depend on your specific needs and your data. If you’re interested in adopting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.