Insurance Claims Modeling¶
Generalized Linear Models (GLMs) are a common approach to consumer insurance claims modeling across the world, with a deep, rich, and proven track record. They are an industry standard, well-understood, and acceptable to stakeholders inside and outside the insurance firm. Existing no- and low-code platforms for building and approving GLMs are often outdated and lack modern data science and analytic capabilities. They require complex and potentially unreliable nests of supporting systems to work effectively. This solution acts as a template of how actuaries could use Dataiku to perform their work. By using this solution, actuaries can benefit from training GLMs in a visual environment, conduct extensive Exploratory Data Analysis, and push their models to production through a simple API deployment interface.
To leverage this solution, you must meet the following requirements:
Have access to a DSS 10.0+ instance.
A Python 3.6 code environment named
solution_claim-modelingwith the following required packages:
dash==2.3.1 dash_bootstrap_components==1.0 scikit-learn>=0.20,<0.21 statsmodels>=0.10,<0.11 cloudpickle>=1.3,<1.6
Dataiku Online instances will auto-install these requirements when the Solution is created.
This solution is available to install on Dataiku and Dataiku Online instances.
Installing on your Dataiku Instance¶
If the technical requirements are met, this solution can be installed in one of two ways:
On your Dataiku instance click + New Project > Business solutions > Search for Insurance Claims Modeling with Visual GLM.
Download the .zip project file and upload it directly to your Dataiku instance as a new project.
Installing on a Dataiku Online Instance¶
Dataiku Online customers can add this Solution to their managed instance from the Launchpad: Features > Add A Feature > Extensions > Insurance Claims Modeling
The Dataiku flow was initially built using publicly available data from the French government and the CASDataset R package about insurance claims.
The car insurance datasets are:
claim_frequency: a 678,013 rows dataset with one row per policyholder containing information on them and their car and the number of claims they have made for this period.
claim severity: a 26,639 rows dataset consisting of an id column that is linked to the claim frequency dataset and a claim amount column that sums the total claim amount made by this policyholder for this period.
These two datasets are joined on the id column to have the claim amount associated with claims when there have been any.
The geographic datasets consist of:
regions_correspondance: to match old French regions (before 2016) with current regions.
regions polygons: containing polygons for each of the new regions.
You can follow along with the solution in the Dataiku gallery.
The project has the following high-level steps:
Input Historical Data and perform feature processing
Conduct Exploratory Data Analysis for a deeper understanding
Train models for claims modeling and pricing
Review model performance
Deploy models to an API for real-time predictions
Interactively explore our models’ predictions with a pre-built Webapp and Dashboard
The usage of Generalized Linear Models for Insurance Claims Modeling is a complex topic. This article serves as a very brief overview of the solution and is intentionally sparse in its details. In-depth technical details, summaries of the research that was involved in the building of this solution, and suggested next steps can be found in the wiki of the project. It is highly recommended you read the wiki before using this solution.
Following the input of the initial joining of our historical claims data in the Input Flow Zone, we move our claims data to the Feature Processing Flow Zone to prepare the data for training. To do so, we apply a prepare recipe to the data in which Exposure and Claim Numbers are capped, and missing values for claim amounts are filled with 0. The decision to cap value is inspired by source research on Generalized Linear Models for Insurance Ratings. More details on this research can be explored in the Project Wiki. The result of this prepare recipe is then split into Train/Test sets. We apply an additional prepare recipe to the train set to ensure the reliability of the test results. The final prepare recipe performs additional transformations to analyze relationships and applies some additional value caps to focus on significant data. Although our data is now ready for training, we will first conduct Exploratory Data Analysis (EDA) on our data that can reveal some interesting patterns and insights in our historical data.
Three flow zones comprise our EDA in this solution:
Univariate Analysis - aims at analyzing each of the possible variables one by one to check their distribution.
Cross Variable - identify more complex dependencies in our data by looking at variables taken together and analyzing their joint distribution.
Geographic Analysis - plot geographical variables on maps to enable visual confirmation of intuition by experienced analysts.
Starting off with the Univariate Analysis flow zone, our claims train dataset is used as an input which is then folded using a prepare step. Folding all the variables in this dataset allows us to avoid having to use as many group recipes as there are variables. The resulting, much longer, the dataset is then aggregated and grouped by the minimum claim number, claim amount sum, and exposure sum. We then unfold the data with a final prepare recipe to provide us with data on the Claim Frequency, Claim Severity, and Pure Premium contained in our train dataset. These values are visualized in the Dashboard tabs of the same names.
Moving along to the Cross Variable flow zone, we once again use our claims train dataset as an input and apply 3 different group by recipes. The first group by recipe computes the min and max of Density for each area. This analysis reveals that the ranges of Area and density do not intersect so any model using Density and Area would lead to overfitting. Additional group by recipes analyze the data by Vehicle Brand and Area, as well as Vehicle Brand and Vehicle Power. There does not seem to be any correlation between Vehicle Brand and Area, although there is a correlation between Brand and Power. These relationships are visualized in the final 3 graphs of the Claim Frequency tab of the Dashboard.
Finally, the Geographic Analysis flow zone uses our prepared claims train the dataset, as well as two of our original input datasets (regions_correspondance and regions_polygons) to associate the polygons to each region. The resulting geographical data is visualized using Dataiku’s geographic map building capabilities in the Map View tab of the dashboard. The maps allow us to see how data is distributed across regions. We first look at the sum of exposure and claim numbers. Our solution was built with data representing French regions but can be easily adapted with geographical data for other countries. Before moving onto the modeling part of this solution, it is important to spend some time exploring the aforementioned Dashboard tabs to get a clear understanding of the underlying data that we will use to build our predictive models.
Similar to the previous section, 3 flow zones are involved in the model training process, all of which use the GLM plugin to enable training of Generalized Linear Models within Dataiku’s VisualML feature:
Claim Frequency Modeling - trains a model to predict the number of claims made by a policyholder.
Claim Severity Modeling - trains a model to predict the claim amount conditional on the existence of a claim.
Pure Premium Modeling - trains a model to predict the claim amount unconditional on the existence of a claim.
To train our Claim Frequency model we take the claims train dataset directly into a VisualML recipe. The recipe applies some additional feature preprocessing and handling before training a Generalized Linear Model Regression on the dataset. The previously created claims test set is used to analyze the performance of our model on 3 metrics; Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and Deviance. Model results and comparison of GLM Actual vs Expected graphs can be explored in the GLM Summary View.
Our Claim Severity model requires that we first filter our claims dataset to only include observations where claims exist (i.e. ClaimNb > 0). Additionally, the claims test set is scored using the Claims Frequency model since we will rely on the ClaimNb prediction. The training script is the same as for Claim Frequency but with one obvious difference: here we are predicting Claim Amount. As a result of that difference the metric of importance for assessing our model’s performance will be the gamma deviance.
Finally, we train our Pure Premium model using the claim train dataset with the prediction target being Claim Amount. However, unlike our Claim Severity model, we do not need to filter out rows where claims don’t exist. We chose a Tweedie distribution to model the response and thus we analyze our model performance on the Tweedie deviance.
We’ve spoken through the differences between each trained model but which model performs the best? The Model Comparison flow zone uses scored holdout datasets to compare the performance of our 3 models. All 3 scored datasets are taken from their respective flow zones and joined on the id key they all share. Additionally, a prepare recipe is used to compute the Compound Model prediction as the product of the Claim Number and Claim Amount predictions. We can now compare this prediction with the Tweedie prediction. For our solution, both the parametric measures fit and Lorenz curves indicate stronger performance by the Compound Model. However, more work on the feature handling may produce different results. Results can be visually explored in the Model Comparison tab of the dashboard.
All of the previously trained models are exposed by a deployed API service. The API flow zone serves to group all models into a single flow zone for organizational purposes. The deployed API service named claim_risk contains 3 endpoints (one for each prediction model). The enrichment is necessary beforehand as the model scripts were designed to contain the necessary feature processing. If you have an API Node as part of our Dataiku subscription, this API service can be pushed to the API node to allow for real queries from your claims teams to be sent.
In addition to the pre-defined model comparison analysis detailed above and visualized in the Model Comparison dashboard tab, this solution comes with a pre-built Webapp to allow for interactive model comparison. The interactive view available in the Interactive Model Comparison dashboard tab provides a view to explore models’ predictions, understand how each feature affects the models, and compare the model predictions.
The impacting features can be modified using sliders or dropdown menus which will cause an immediate call to the API which will, in turn, return predictions of each model. Due to the redundancy of Area and Density discovered during our EDA, Area is not an available feature in the Webapp.
It is possible to have the Webapp use models directly deployed on the flow, instead of the API service, by changing the use_api and api_node_url project variables.
The intent of this project is to enable claims teams to understand how Dataiku, and the new GLM Plugin, can be used to create an insurance pricing model based on historical claim data. By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization, smarter and more holistic strategies can be designed in order to leverage GLM pricing solutions, establish effective governance, and centralize pricing workflows without sacrificing agility.
We’ve provided several suggestions on how to use historical claims data to train predictive models but ultimately the “best” approach will depend on your specific needs and your data. If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.