Customer Segmentation for Banking¶
Insightful customer segmentation is a cornerstone of effective business management, marketing, and product development within consumer banking. Many firms have developed deep business knowledge which is applied to their customer pools using business rules logic, slicing the overall customer base into subgroups based on actual or potential revenues, product mix, digital engagement, and much more. These existing customer analytics provide powerful insight and are often driven by qualitative insights or historical practice. Yet 82% of bank executives say their teams have difficulties identifying new customer segments, which can drive up acquisition costs and reduce retention rates. Leveraging a purely data-driven approach to segmentation introduces the possibility of new perspectives, complementing rather than replacing existing expertise.
The goal of this plug-and-play solution is to use machine learning in the form of a clustering algorithm to identify distinct clusters of customers, which are referred to in our analysis as Segments. Further analysis is carried out to understand these clusters, and how they relate to the bank’s product mix and existing customer tiering approaches.
To leverage this solution, you must meet the following requirements:
Have access to a DSS 9.0+ instance.
The solution requires a Python3 built-in environment. For instance, with a built-in environment of Python2, users should create a basic Python3 code env and set the project to use this code env.
To benefit natively from the solution’s Dataiku Application, a PostgreSQL or Snowflake connection storing your data (see Data Requirements) is needed. However, the Solution comes with demo data available on the filesystem-managed connection.
If the technical requirements are met, this solution can be installed in one of two ways:
On your Dataiku instance click + New Project > Business Solutions > Search for Customer Segmentation for Banking.
Download the .zip project file and upload it directly to your Dataiku instance as a new project.
This Solution comes with a simulated dataset that can be used to demo the Solution and its components. In order to use the Solution on your own data via Plug and Play usage, your input data must match the data model of the demo data (more details can be found in the wiki) :
customer_data_input contains general information about each customer.
product_holdings_input contains historical data of product holdings per customer.
revenues_input contains the sums of all revenues generated from each customer for each product. The initial analysis does not split out revenues by type (e.g. fee or non-fee).
balances_input contains the historical balances for each product that are held per customer.
additional_information_input contains optional additional information to enrich your analysis (e.g. dates for the last monthly statement, other customer values gathered on a monthly basis).
You can follow along with the solution in the Dataiku gallery.
The project has the following high-level steps:
Connect your data as an input and select your analysis parameters via the Dataiku Application.
Ingest and pre-process the data to be available for segmentation.
Train and apply a Clustering Model.
Apply cluster analysis to our identified customer segments.
Interactively explore our customer segments with a pre-built Dashboard.
In addition to reading this document, it is recommended to read the wiki of the project before beginning in order to get a deeper technical understanding of how this solution was created, the different types of data enrichment available, longer explanations of solution-specific vocabulary, and suggested future direction for the solution.
To begin, you will need to create a new instance of the Customer Segmentation for Banking Application. This can be done by selecting the Dataiku Application from your instance home, and clicking Create App Instance. The project is delivered with sample data that can be used to run an initial demo run of the Solution or should be replaced with your own data. A Snowflake or PostgreSQL data connection can be selected and used to reconfigure the flow and App to your own data. Once the reconfiguration is complete, we can input the names of each of the input datasets and press the Load button before refreshing the page to allow the Dataiku App to accurately reflect our data.
Within the Column Identification section of the App, we can go through each required column and map it to the matching column in our datasets. If we’ve included additional customer information, we can also multi-select which optional columns we want to enrich our analysis with. Once complete we can press the Build button to create a historical_dates dataset. We should once again refresh the page in order to make subsequent fields of the Dataiku Application reflect our actual available data and parameters. Finally, we use the Clustering Configuration section to configure our analysis. Specifically, we should select the reference_date (a natural choice for this is the latest date available). Then, we define the lookback period, which can also be interpreted as a reference period features will be computed both on a monthly basis and on a reference period basis. The value is expressed in the number of months. Next, select the number of clusters; standard values range between 3 and 6, but depending on the particular use case, this number can be higher although it would require more extensive work to interpret each of them.
Pressing the Run button will rebuild the whole flow and can take a few minutes to a few hours, depending on the data. When the job is complete, press the link to the dashboard to access ready-made insights on this segmentation. Continue reading the following sections if you’re interested in how the underlying flow of the Solution works. Otherwise, you can skip to the section detailing the Dashboard.
Data Preparation is comprised of 5 flow zones in the project flow: Input, Date Preparation, Customer Data Preparation, Product Preparation, and Balances and Revenues.
Beginning first with the Input flow zone, when new data is uploaded to the Solution via the Dataiku Application, the starting five datasets in this flow zone are reconfigured and refreshed to incorporate the new data. A Prepare recipe is then used to link the original column names with the ones defined in the Column Identification section of the App.
The Date Preparation flow zone creates the initial dates dataset necessary for subsequent data transformations. It does this by identifying unique dates in the product_holdings dataset, extracting year and month values, retrieving the last date per month, and adding the project_reference_date. This value is stored in the Product Variables and defined in the Dataiku Application.
Customer Data Preparation takes in customer data, as well as the optional additional customer information, and applies a similar methodology to both datasets. A Join recipe is used to replicate the datasets of customer information as many times as the number of rows in the dates_history dataset to allow us to perform historical computation for every period. Following this, we take the customer data to compute customer age and account age. Meanwhile, we filter out rows that aren’t in the lookback period defined by the Dataiku App before finally using a Group recipe to compute each feature per customer on the last month, and the average of each feature per customer on the reference period.
The Product Preparation flow zone outputs two key datasets. The first is a correspondence table representing the product and product type for each product id. The second is a result of several visual recipes applied to the products_holiding and dates_history datasets. The specific details of each recipe can be read in the wiki, but we can focus on the output of these recipes which is a dataset name product_holdings_portfolio which contains customer ids, the defined reference date, the number of subscribed and terminated products at the given reference date, and the number of subscribed and terminated products before the lookback period. If there are missing values in the data, they are filled in with 0.
Finally, the Balanced and Revenues flow zone handles the preparation of the balances_renamed and revenues_renamed datasets. These datasets are joined together and enriched with the correspondence table of products from the last flow zone. Repeating the similar methodology of previous flow zones, the data is joined with dates_history to properly contextualize the data relative to our reference date. The output of the flow zone is a dataset containing, for each product type:
The total number of products
The total balance on the product type on average during the reference period.
The total balance on the product type during the last month.
The total revenue on the product type on average during the reference period.
The total revenue on the product type during the last month.
With our data sufficiently prepared, we are ready to train and deploy our model.
We begin the Segmentation flow zone by joining all of our previously prepared datasets to create the customer_full_data dataset. We use a filter to keep only active customers at the given reference date and then split our data to separate historical data from the reference data (which will be used to train our model). Prior to training, we take logs of the variables for income and those coming from customer behaviors, revenues, and balances. The reason behind this decision is to avoid having too many outliers: most of these variables exhibit log-normal distribution, with very high density around low values and a few very large values. Clustering is not an exact science, and the way features are preprocessed and included in the analysis hugely impacts the result so great care must be put into ensuring that the business hypothesis is well reflected in the model.
Prior to the training we also do a bit of feature handling, depending on whether the features are numerical or categorical: categorical variables are dummy encoded and numerical variables are rescaled using standard rescaling. Choices can also be made on the set of variables that are included in the model. For instance, segmentation could be done by focusing mostly on revenues and ignoring demographics. Similarly, segments could be built by using behavioral data and dismissing revenues.
The clustering model is a Kmeans algorithm, with the number of clusters that are set programmatically according to the value input in the Dataiku Application. This number can be adjusted depending on how diverse the customers are. We chose not to detect outliers as we do not want to have customers not belonging to any of the segments. Clusters have been automatically renamed based on the variables that mostly define them. However, the user has the possibility to summarize this information with a more human-readable name thanks to the edit button and rebuild the graph using the rebuild_graphs scenario provided with the Solution. The model is deployed on the flow and used to score the reference data and historical data with identified segments.
Five flow zones are used to analyze our identified customer segments. Each flow zone has output datasets that can be explored for additional analysis, but they all are used to create the visualizations provided in the pre-built Dashboard.
The Summary flow zone outputs compute the quintiles for customer income, age, and account age. Prep and grouping recipes are applied to count the customers per category, for each segment and tier. The two output datasets serve as the base for several graphs on the dashboard providing in-depth descriptive statistics about the segmentation and tiers.
Cross Sell analyzes the output of the segmentation with respect to cross-sell, and also creates insights about tiers and cross-sell without the need for any segmentation.
Product Mix computes the distribution of product portfolios for each tier and segment.
Transition Analysis focuses on how customers move between segments and serves as the underlying dataset to create the interactive Sankey charts in the Dashboard.
The Historical Analysis flow zone groups the scored historical data to compute the count of customers, the sum of revenues, and the average of revenues for each segment and reference date.
The Customer Segmentation Summary Dashboard contains 5 slides: Segmentation Model, Segment Analysis, Tier Analysis, Segment and Tier and Segment Evolution.
The first slide, Segmentation Model contains visualizations published from the model results. Additional visualizations and more details on the segmentation model can be found within the saved model interface linked from the dashboard. As mentioned, cluster names can be edited from the saved model interface. The scenario to rebuild the graphs can be run from this first tab of the dashboard. The graphs on this tab provide a summary of identified clusters, a cluster heatmap, and variables’ importance to the clustering model.
The Segment Analysis and Tier Analysis slides contain the same graphs with the only difference being whether charts are created according to tiers or segments. These charts are a starting point to understanding how segments/tiers are constituted but additional charts can be built very easily. Four areas of analysis per segment/tier are provided with the graphs: Revenue and Cross Sell, Income and Age Distributions, Product Mix, and Pivot tables with Total Revenue, Average Revenue per Customer, and Customer number.
The Segment and Tier tab is a comparison of the segmentation created by the project using a data-centric approach to business tiering. The first two graphs allow us to understand if there are close links between the tiers and segments, or if they are built independently. The second pair of graphs focus on the revenue repartition between tiers, or segments. This approach helps to pinpoint the most profitable areas, and from which tier and segment. Additionally, there are two Pivot tables to aid in the understanding of the revenue repartition.
Finally, the Segment Evolution tab presents us with a new way of looking at our Segments as dynamic identifiers. In looking at the Segment Stability, Transition, and Evolution graphs, we can see how customers move between different segments over time and potentially what might be the causes of these transitions. The Sankey charts are interactive and allow us to click on other segments to change the focus.
The intent of this project is to enable business management, marketing, and product development teams in consumer banking to understand how Dataiku can be used to identify new and existing customer segments. By creating a unified space where existing business knowledge and analytics (for example, on Cross Sell and Tiering) are presented alongside new and easily generated Machine Learning Segmentations, business teams can immediately understand the incremental value of an ML approach, without disrupting or separating their existing analytics and subject-matter expertise.
We’ve provided several suggestions on how to use customer data to identify customer segments and tiers but ultimately the “best” approach will depend on your specific needs and your data. If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.