Solution | Customer Segmentation for Banking¶
Insightful customer segmentation is a cornerstone of effective business management, marketing, and product development within consumer banking. Many firms have developed deep business knowledge which is applied to their customer pools using business rules logic, slicing the overall customer base into subgroups based on actual or potential revenues, product mix, digital engagement, and much more.
These existing customer analytics provide powerful insight and are often driven by qualitative insights or historical practice. Yet 82% of bank executives say their teams have difficulties identifying new customer segments, which can drive up acquisition costs and reduce retention rates. Leveraging a purely data-driven approach to segmentation introduces the possibility of new perspectives, complementing rather than replacing existing expertise.
The goal of this plug-and-play solution is to use machine learning in the form of a clustering algorithm to identify distinct clusters of customers, which are referred to in our analysis as Segments. Further analysis is carried out to understand these clusters, and how they relate to the bank’s product mix and existing customer tiering approaches.
The Next Best Offer for Banking solution completes the customer segmentation solution within the marketing suite for banking. The user can plug the same data as in the current solution, and build an initial model. Additionally, the user can utilize the segmentation output as an input in the Next Best Offer for Banking solution.
To leverage this solution, you must meet the following requirements:
Have access to a Dataiku 12+ instance.
Have installed the Sankey Charts Plugin.
The solution requires a Python 3.6 built-in environment. For instances with a built-in environment of Python 2, users should create a basic Python 3 code env, and set the project to use this code env.
To benefit natively from the solution’s Dataiku Application, a PostgreSQL or Snowflake connection storing your data (see Data Requirements) is needed. However, the solution comes with demo data available on the filesystem-managed connection.
If the technical requirements are met, this solution can be installed in one of two ways:
On your Dataiku instance click + New Project > Dataiku Solutions > Search for Customer Segmentation for Banking.
Download the .zip project file and upload it directly to your Dataiku instance as a new project.
This solution comes with a simulated dataset that can be used to demo the solution and its components. In order to use the solution on your own data via Plug and Play usage, your input data must match the data model of the demo data (more details can be found in the wiki):
The input data should be separated into five different datasets with the same time frequency:
revenues, which includes the revenues generated by each product per customer over time
product_holdings, which includes product information and duration period of each product held by customers
customers, which includes customers’ static information. Optional columns can be added to this dataset.
balances, which includes balance amounts of each product per customer over time
additional_information, which includes optional additional columns
You can follow along with the solution in the Dataiku gallery.
The project has the following high-level steps:
Connect your data as an input and select your analysis parameters via the Dataiku Application.
Ingest and pre-process the data to be available for segmentation.
Train and apply a Clustering Model.
Apply cluster analysis to our identified customer segments.
Interactively explore your customer segments with a pre-built dashboard.
In addition to reading this document, it is recommended to read the wiki of the project before beginning in order to get a deeper technical understanding of how this solution was created, the different types of data enrichment available, longer explanations of solution-specific vocabulary, and suggested future direction for the solution.
To begin, you will need to create a new instance of the Customer Segmentation for Banking Application. This can be done by selecting the Dataiku Application from your instance home, and clicking Create App Instance. The project is delivered with sample data that should be replaced with your own data, assuming that it adopts the data model described above.
This can be done in three ways:
Data can be uploaded directly from the filesystem in the first section of the Dataiku app.
Data can be connected to your database of choice by selecting an existing connection.
Connection settings and data can be copied from a Next Best Offer for Banking project already built.
In option 1 and 2, users must click the Check button which will load the data and verify the schema.
Be sure to refresh the page so that the app can dynamically take your data into account.
With our data selected and loaded into the Flow, we can move to the following App section:
A few parameters need to be inputted by the user to configure the project. Before doing so, press the refresh button to update the reference date dropdown with the available ones. Then select the reference_date: a natural choice for this date is the latest. Define the lookback period, which can also be interpreted as a reference period: features will be computed both on a monthly basis and on a reference period basis. The value is expressed in number of months. Next, select the number of clusters, standard values range between 3 and 6 but depending on the particular use case, this number can be higher although it would require some more extensive work to interpret each of them.
Then Run button will trigger the scenario that will rebuild the whole Flow. This action will take from a few minutes to hours if the input data is very large. Finally, press the link to the dashboard to access ready-made insights on this segmentation.
Data Preparation is comprised of 5 Flow zones in the project Flow: Data Input, Date Preparation, Customer Data Preparation, Product Preparation, and Balances and Revenues Preparation.
Beginning first with the Data Input Flow zone, when new data is uploaded to the solution via the Dataiku Application, the starting five datasets in this Flow zone are reconfigured and refreshed to incorporate the new data.
The Date Preparation Flow zone creates the initial dates dataset necessary for subsequent data transformations. It does this by identifying unique dates in the product_holdings dataset, extracting year and month values, retrieving the last date per month, and adding the project_reference_date. This value is stored in the Product Variables and defined in the Dataiku Application.
Customer Data Preparation takes in customer data, as well as the optional additional customer information, and applies a similar methodology to both datasets. A Join recipe is used to replicate the datasets of customer information as many times as the number of rows in the dates_history dataset to allow us to perform historical computation for every period. Following this, we take the customer data to compute customer age and account age. Meanwhile, we filter out rows that aren’t in the lookback period defined by the Dataiku App before finally using a Group recipe to compute each feature per customer on the last month, and the average of each feature per customer on the reference period.
The Product Preparation Flow zone outputs two key datasets. The first is a correspondence table representing the product and product type for each product id. The second is a result of several visual recipes applied to the products_holiding and dates_history datasets. The specific details of each recipe can be read in the wiki, but we can focus on the output of these recipes which is a dataset name product_holdings_portfolio which contains customer ids, the defined reference date, the number of subscribed and terminated products at the given reference date, and the number of subscribed and terminated products before the lookback period. If there are missing values in the data, they are filled in with 0.
Finally, the Balanced and Revenues Preparation Flow zone handles the preparation of the balances_renamed and revenues_renamed datasets. These datasets are joined together and enriched with the correspondence table of products from the last Flow zone. Repeating the similar methodology of previous Flow zones, the data is joined with dates_history to properly contextualize the data relative to our reference date. The output of the Flow zone is a dataset containing, for each product type:
The total number of products
The total balance on the product type on average during the reference period.
The total balance on the product type during the last month.
The total revenue on the product type on average during the reference period.
The total revenue on the product type during the last month.
With our data sufficiently prepared, we are ready to train and deploy our model.
We begin the Segmentation Model Flow zone by joining all of our previously prepared datasets to create the customer_full_data dataset. We use a filter to keep only active customers at the given reference date and then split our data to separate historical data from the reference data (which will be used to train our model). Prior to training, we take logs of the variables for income and those coming from customer behaviors, revenues, and balances. The reason behind this decision is to avoid having too many outliers: most of these variables exhibit log-normal distribution, with very high density around low values and a few very large values. Clustering is not an exact science, and the way features are preprocessed and included in the analysis hugely impacts the result so great care must be put into ensuring that the business hypothesis is well reflected in the model.
Prior to the training we also do a bit of feature handling, depending on whether the features are numerical or categorical: categorical variables are dummy encoded and numerical variables are rescaled using standard rescaling. Choices can also be made on the set of variables that are included in the model. For instance, segmentation could be done by focusing mostly on revenues and ignoring demographics. Similarly, segments could be built by using behavioral data and dismissing revenues.
The clustering model is a Kmeans algorithm, with the number of clusters that are set programmatically according to the value input in the Dataiku Application. This number can be adjusted depending on how diverse the customers are. We chose not to detect outliers as we do not want to have customers not belonging to any of the segments. Clusters have been automatically renamed based on the variables that mostly define them. However, the user has the possibility to summarize this information with a more human-readable name thanks to the edit button and rebuild the graph using the rebuild_graphs scenario provided with the solution. The model is deployed on the Flow and used to score the reference data and historical data with identified segments.
Five Flow zones are used to analyze our identified customer segments. Each Flow zone has output datasets that can be explored for additional analysis, but they all are used to create the visualizations provided in the pre-built dashboard.
The Summary Flow zone outputs compute the quintiles for customer income, age, and account age. Prep and grouping recipes are applied to count the customers per category, for each segment and tier. The two output datasets serve as the base for several graphs on the dashboard providing in-depth descriptive statistics about the segmentation and tiers.
Cross Sell analyzes the output of the segmentation with respect to cross-sell, and also creates insights about tiers and cross-sell without the need for any segmentation.
Product Mix computes the distribution of product portfolios for each tier and segment.
Transition Analysis focuses on how customers move between segments and serves as the underlying dataset to create the interactive Sankey charts in the dashboard.
The Historical Analysis Flow zone groups the scored historical data to compute the count of customers, the sum of revenues, and the average of revenues for each segment and reference date.
The Create customer data input for NBO Flow zone creates a dataset similar to the input customers dataset with one additional column including the segmentation result.
The Customer Segmentation Summary dashboard contains 5 slides: Segmentation Model Analysis, Segment Analysis, Tier Analysis, Segment and Tier and Segment Evolution.
The first slide, Segmentation Model Analysis contains visualizations published from the model results. Additional visualizations and more details on the segmentation model can be found within the saved model interface linked from the dashboard. As mentioned, cluster names can be edited from the saved model interface. The scenario to rebuild the graphs can be run from this first tab of the dashboard. The graphs on this tab provide a summary of identified clusters, a cluster heatmap, and variables’ importance to the clustering model.
The Segment Analysis and Tier Analysis slides contain the same graphs with the only difference being whether charts are created according to tiers or segments. These charts are a starting point to understanding how segments/tiers are constituted but additional charts can be built very easily. Four areas of analysis per segment/tier are provided with the graphs: Revenue and Cross Sell, Age Distributions, Product Mix, and Pivot tables with Total Revenue, Average Revenue per Customer, and Customer number.
The Segment and Tier tab is a comparison of the segmentation created by the project using a data-centric approach to business tiering. The first two graphs allow us to understand if there are close links between the tiers and segments, or if they are built independently. The second pair of graphs focus on the revenue repartition between tiers, or segments. This approach helps to pinpoint the most profitable areas, and from which tier and segment. Additionally, there are two Pivot tables to aid in the understanding of the revenue repartition.
Finally, the Segment Evolution tab presents us with a new way of looking at our Segments as dynamic identifiers. In looking at the Segment Stability, Transition, and Evolution graphs, we can see how customers move between different segments over time and potentially what might be the causes of these transitions. The Sankey charts are interactive and allow us to click on other segments to change the focus.
The intent of this project is to enable business management, marketing, and product development teams in consumer banking to understand how Dataiku can be used to identify new and existing customer segments. By creating a unified space where existing business knowledge and analytics (for example, on Cross Sell and Tiering) are presented alongside new and easily generated Machine Learning Segmentations, business teams can immediately understand the incremental value of an ML approach, without disrupting or separating their existing analytics and subject-matter expertise.
We’ve provided several suggestions on how to use customer data to identify customer segments and tiers but ultimately the “best” approach will depend on your specific needs and your data. If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.