Solution | Insurance Claims Modeling#

Overview#

Business case#

Generalized linear models (GLMs) are a common approach to consumer insurance claims modeling across the world, with a deep, rich, and proven track record. They’re an industry standard, well-understood, and acceptable to stakeholders inside and outside the insurance firm.

Existing no and low code platforms for building and approving GLMs are often outdated and lack modern data science and analytic capabilities. They require complex and potentially unreliable nests of supporting systems to work effectively.

This Solution acts as a template of how actuaries could use Dataiku to perform their work. By using this Solution, actuaries can benefit from training GLMs in a visual environment, conduct extensive exploratory data analysis, and push their models to production through a simple API deployment interface.

Installation#

From the Design homepage of a Dataiku instance connected to the internet, click + Dataiku Solutions.
Search for and select Insurance Claims Modeling.
If needed, change the folder into which the Solution will be installed, and click Install.
Follow the modal to either install the technical prerequisites below or request an admin to do it for you.

Note

Alternatively, download the Solution’s .zip project file, and import it to your Dataiku instance as a new project.

Technical requirements#

To leverage this Solution, you must meet the following requirements:

Have access to a Dataiku 12.0+* instance.
Generalized Linear Models plugin.
A Python 3.8 or 3.9 code environment named solution_claim-modeling with the following required packages:

scikit-learn>=1.0,<1.1
Flask<2.3
glum==2.6.0
cloudpickle>=1.3,<1.6
lightgbm>=3.2,<3.3
scikit-learn>=1.0,<1.1
scikit-optimize>=0.7,<0.10
scipy>=1.5,<1.6
statsmodels==0.12.2
xgboost==0.82
dash==2.3.1
dash_bootstrap_components==1.0

Data requirements#

The Dataiku Flow was initially built using publicly available data from the French government and the CASDataset R package about insurance claims.

The car insurance datasets are:

Dataset	Description
claim_frequency	A 678,013 rows dataset with one row per policyholder containing information on them and their car and the number of claims they have made for this period.
claim severity	A 26,639 rows dataset consisting of an ID column that’s linked to the claim frequency dataset and a claim amount column that sums the total claim amount made by this policyholder for this period.

The Flow joins these two datasets on the ID column to have the claim amount associated with claims when there have been any.

The geographic datasets consist of:

Dataset	Description
regions_correspondance	Matches old French regions (before 2016) with current regions.
regions polygons	Contains polygons for each of the new regions.

Workflow overview#

You can follow along with the Solution in the Dataiku gallery.

The project has the following high-level steps:

Input historical data and perform feature processing.
Conduct exploratory data analysis for a deeper understanding.
Train models for claims modeling and pricing.
Review model performance.
Deploy models to an API for real-time predictions.
Interactively explore the models’ predictions with a pre-built webapp and dashboard.

Walkthrough#

Note

In addition to reading this document, it’s recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Gather input data and prepare for training#

Following the input of the initial joining of historical claims data in the Input Flow zone, the Flow moves claims data to the Feature Processing Flow zone to prepare the data for training. To do so, it applies a Prepare recipe to the data in which exposure and claim numbers are capped, and missing values for claim amounts are filled with 0.

Source research on generalized linear models for insurance ratings inspired the decision to cap value. You can explore more details on this research in the project wiki. The Flow then splits the result of this Prepare recipe into train/test sets. It applies an additional Prepare recipe to the train set to ensure the reliability of the test results. The final Prepare recipe performs additional transformations to analyze relationships and applies some additional value caps to focus on significant data.

Although the data is now ready for training, we will first conduct exploratory data analysis (EDA) on the data that can reveal some interesting patterns and insights in the historical data.

Extensive exploratory data analysis for a deeper understanding#

Three Flow zones comprise the EDA phase in this Solution:

Flow zone	Description
Univariate Analysis	Aims at analyzing each of the possible variables one by one to check their distribution.
Cross Variable	Identifies more complex dependencies in the data by looking at variables taken together and analyzing their joint distribution.
Geographic Analysis	Plots geographical variables on maps to enable visual confirmation of intuition by experienced analysts.

Starting off with the Univariate Analysis Flow zone, the Flow uses the claims train dataset as an input, which is then folded using a Prepare step. Folding all the variables in this dataset allows you to avoid having to use as many Group recipes as there are variables. The resulting, much longer, dataset is then aggregated and grouped by the minimum claim number, claim amount sum, and exposure sum. It then unfolds the data with a final Prepare recipe to provide data on the claim frequency, claim severity, and pure premium contained in the train dataset. The dashboard tabs of the same names visualize these values.

Moving along to the Cross Variable Flow zone, the Flow once again uses the claims train dataset as an input and applies three different Group recipes. The first Group recipe computes the min and max of density for each area. This analysis reveals that the ranges of area and density don’t intersect so any model using density and area would lead to overfitting.

Additional Group recipes analyze the data by vehicle brand and area, as well as vehicle brand and vehicle power. There doesn’t seem to be any correlation between vehicle brand and area, although there is a correlation between brand and power. The final three graphs of the Claim Frequency tab of the dashboard visualize these relationships.

Finally, the Geographic Analysis Flow zone uses the prepared claims train the dataset. It also uses two of the original input datasets (regions_correspondance and regions_polygons) to associate the polygons to each region. The Map View tab of the dashboard visualizes the resulting geographical data. The maps allow one to see how data is distributed across regions.

First look at the sum of exposure and claim numbers. We built the Solution with data representing French regions, but you can adapt it with geographical data for other countries. Before moving onto the modeling part of this Solution, it’s important to spend some time exploring the aforementioned dashboard tabs. This will provide a clear understanding of the underlying data that used to build predictive models.

Train models to predict claim frequency, severity, and amount#

Similar to the previous section, the Solution uses three Flow zones in the model training process. All three zones use the GLM plugin to enable training of generalized linear models within Dataiku’s Visual ML tool:

Flow zone	Description
Claim Frequency Modeling	Trains a model to predict the number of claims made by a policyholder.
Claim Severity Modeling	Trains a model to predict the claim amount conditional on the existence of a claim.
Pure Premium Modeling	Trains a model to predict the claim amount unconditional on the existence of a claim.

To train the claim frequency model, the Solution takes the claims train dataset directly into a visual ML recipe. The recipe applies some additional feature preprocessing and handling before training a generalized linear model regression on the dataset.

The Solution uses the previously created claims test set to analyze the performance of the model on three metrics: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and deviance. You can explore model results and a comparison of GLM actual vs. expected graphs in the GLM summary view.

The claim severity model requires first filtering the claims dataset to only include observations where claims exist (that is, ClaimNb > 0). Additionally, the claims test set is scored using the claims frequency model since it relies on the ClaimNb prediction.

The training script is the same as for claim frequency, but with one obvious difference: here we’re predicting claim amount. As a result of that difference, the metric of importance for assessing the model’s performance will be the gamma deviance.

Finally, the Solution trains the pure premium model using the claim train dataset with the prediction target being claim amount. However, unlike the claim severity model, we don’t need to filter out rows where claims don’t exist. We chose a Tweedie distribution to model the response. Thus, we analyze the model performance on the Tweedie deviance.

Evaluate model performance#

We’ve spoken through the differences between each trained model, but which model performs the best? The Model Comparison Flow zone uses scored holdout datasets to compare the performance of the three models. All three scored datasets are taken from their respective Flow zones and joined on the ID key they all share. Additionally, a Prepare recipe computes the compound model prediction as the product of the claim number and claim amount predictions.

You can now compare this prediction with the Tweedie prediction. For the Solution, both the parametric measures fit and Lorenz curves indicate stronger performance by the compound model. However, more work on the feature handling may produce different results. Results can be visually explored in the Model Comparison tab of the dashboard.

Deploy models to an API for real predictions#

A deployed API service exposes all previously trained models. The API Flow zone serves to group all models into a single Flow zone for organizational purposes. The deployed API service named claim_risk contains three endpoints (one for each prediction model).

An enrichment is necessary beforehand as the model scripts were designed to contain the necessary feature processing. If you have an API node as part of your Dataiku subscription, you can push this API service to the API node. This allows your claims teams to send real queries.

Enable claims teams with pre-built interactive dashboards#

Note

For 11.4+ instances with UIF enabled, retraining of the model is necessary before starting the webapp for the first time to avoid a permission denied error.

In addition to the pre-defined model comparison analysis detailed above and visualized in the Model Comparison dashboard tab, this Solution comes with a pre-built webapp to allow for interactive model comparison. The interactive view available in the Interactive Model Comparison dashboard tab provides a view to explore model predictions, understand how each feature affects the models, and compare the model predictions.

You can modify the impacting features using sliders or dropdown menus which will cause an immediate call to the API. This will, in turn, return predictions of each model. Due to the redundancy of area and density discovered during EDA, area isn’t an available feature in the webapp.

Note

It’s possible to have the webapp use models directly deployed on the Flow, instead of the API service, by changing the use_api and api_node_url project variables.

Reproducing these processes with minimal effort for your data#

The intent of this project is to enable claims teams to understand how they can use Dataiku, and the GLM plugin, to create an insurance pricing model based on historical claim data. By creating a singular Solution that can benefit and influence the decisions of a variety of teams in a single organization, you can design smarter and more holistic strategies to leverage GLM pricing solutions, establish effective governance, and centralize pricing workflows without sacrificing agility.

This documentation has provided several suggestions on how to derive value from this Solution. Ultimately however, the “best” approach will depend on your specific needs and data. If you’re interested in adapting this project to the specific goals and needs of your organization, Dataiku offers roll-out and customization services on demand.