Solution | Credit Scoring#

Overview#

Business Case#

Credit decision-making is the cornerstone of successful lending operations and continuously evolves with customer behavior and data availability. The complexity and depth of analysis required to offer competitive pricing and accurate prediction of credit events is ever increasing. Higher performance models demonstrably increase revenue, reduce credit loss, and improve efficiency.

Credit scorecards are a foundational part of a credit teams’ workflow, and enhancing them with more powerful data sources and faster collaborative review is vital to retaining and expanding a customer base. Existing tools can be difficult to adapt to this new environment, and future-focused approaches can often be disconnected from the current technology and needs of the team, siloing the potential benefits and preventing them from being effectively integrated into the working model that directly impacts customers.

By leveraging Dataiku’s unified space where existing business knowledge, machine-assisted analytics (for example, automatic searching of a large number of features and feature iterations for credit signals), and real-time collaboration on credit scorecards are unified, credit teams can immediately benefit from the value of an ML-assisted approach, establish a foundation on which to build dedicated AI credit scoring models, all while remains connected to their current customer base and systems.

Installation#

The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data Requirements.

Technical Requirements#

To leverage this solution, you must meet the following requirements:

  • Have access to a Dataiku 13.2+* instance.

  • Generalized Linear Models Plugin.

  • A Python 3.8 or 3.9 code environment named solution_credit-scoring with the following required packages:

scorecardpy==0.1.9.2
monotonic-binning==0.0.1
scikit-learn>=1.0,<1.1
Flask<2.3
glum==2.6.0
cloudpickle>=1.3,<1.6
lightgbm>=3.2,<3.3
scikit-learn>=1.0,<1.1
scikit-optimize>=0.7,<0.10
scipy>=1.5,<1.6
statsmodels==0.12.2
xgboost==0.82

Data Requirements#

The project is initially shipped with all datasets using the filesystem connection. One input dataset must contain at least two mandatory columns with some additional columns to build the models. The mandatory columns that need to be mapped are:

  • the credit event variable which should be constructed to fit the specifications of the study

  • the id column a unique identifier for each applicant

Workflow Overview#

You can follow along with the solution in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow zones.

The project has the following high-level steps:

  1. Connect your data as input and select your model parameters via the Dataiku Application.

  2. Explore your credit model with pre-built visualizations.

  3. Understand your scorecards through the webapp through a responsible framework.

Walkthrough#

Note

In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Plug and play with your own data and parameter choices#

To begin, you will need to create a new instance of the Credit Scoring Application. This can be done by selecting the Dataiku Application from your instance home and clicking Create App Instance. The project is delivered with sample data that can be used to run an initial demo of the solution or should be replaced with your own data. A Snowflake or PostgreSQL data connection can be selected and used to reconfigure the Flow and App to your own data. After selecting your connection, press Reconfigure, which will switch dataset connections on the Flow. Once the reconfiguration is complete, we can input the names of the input dataset and press the Load button before refreshing the page to allow the Dataiku App to reflect our data accurately.

We use the Feature Identification section to configure our two mandatory columns mentioned in the data requirements. Then, the user can select sensitive variables (which will be used as features to predict the applicant’s creditworthiness) from the dataset. They will be removed from the core analysis and analyzed specifically with the Responsible Credit Scoring Consideration.

Dataiku screenshot of the Dataiku Application that enables feature identification.

The next step is the Feature Filtering, which takes place in two successive steps. First, the univariate filters are driven by Information Value and Chi-Square p-value. The user can specify the threshold for both these metrics to discriminate between kept and discarded variables. Then, the correlation filter removes from pairs of correlated variables the ones with the most negligible information value. The user selects the correlation threshold (the absolute correlation is compared to this value) and chooses the method for computing correlation among Spearman and Pearson. The filtering occurs when pressing the Run button, which triggers the scenario.

After that, Feature Binning is the next step to bin the features using the weight of evidence and encode the variables using that same metric. The run button will trigger the Bin Variables scenario, and then the user can edit the editable dataset to give more meaningful labels to the bins.

Dataiku screenshot of the Dataiku Application that enables Plug and Play usage of the solution.

The Feature Selection is where the user can specify one of the three available methods for feature selection and the number of selected features desired in the end. The Score Card Building fits the logistic regression coefficients and defines the three parameters to scale the scorecard as desired (Base score, Base odds & Points to Double Odds). After launching the Build Score Card scenario, we can access the scorecard in the dashboard. The last step is the third scenario which updates the API and refreshes the webapp.

As borrowing is a key service for people in society and access to it can significantly affect one’s economic opportunities, a Fairness Analysis was added to the solution. This analysis is carried out on one sensitive variable at a time. The user selects the specific sensitive variable and runs the process with the scenario Analyze Sensitive Variable.

Further explore your credit model with pre-built visualizations#

The Credit Scoring Dashboard contains 5 pages: Feature Filtering, Feature Binning, Feature Selection, Credit Model, and Responsible Credit Scoring.

The first page Feature Filtering, contains visualizations published from the Dataiku Application. The first two graphs represent the Information Value of a variable with respect to the credit event target. The higher the value, the more information is contained in the variable to explain the credit event. The below graphs represent the Chi-Square p-value, which is very similar to the above except that here, the lower, the better. The tab has a statistical card in a correlation matrix form, shown below. The user should focus on the brighter tiles, either red or blue, that indicate either significant positive or negative correlations.

Dataiku screenshot of the correlation matrix in the dashboard.

The Feature Binning page contains a set of three charts. The two graphs on the right help observe the bins and their weight of evidence. To understand them, one must click on them and adjust the filters to show only one variable at a time. The third graph displays the information value after binning is plotted for each variable, and the line is the threshold for keeping a variable.

The Feature Selection is achieved using one of the three available Automated Feature Selection algorithms selected in the Dataiku Application. Forward Selection, Lasso Selection, and Tree-based Selection. The bar chart represents either the absolute value of the coefficients for Forward and Lasso Selection or the Feature Importance for Tree-based Selection. It indicates the rank of each variable in the selection.

Dataiku screenshot of some of the charts that can be used to understand feature selection

The Credit Model tab is here to help the user understand the scorecard. It is explained in two different formats. The first is displayed as a dataset. It contains as many rows as there are bins within each selected feature. And for each of these bins, a score is precomputed. The second is a chart representing the average credit event frequency per group of scores, all computed on a test dataset. Scores have been computed using the scorecard and binned to have reliable estimations of the credit event frequency within each bin.

Dataiku screenshot of the Credit Model's tab in the dashboard to understand the scorecard.

Finally, the Responsible Credit Scoring page focuses on the aspect of the project that follows the Responsible AI (RAI) framework. First, two tests are run to check the relationship between the sensitive variable and the target. Then, the chi-square test looks at the independence of sensitive_variable and target. If they are not independent and/or their means differ, some concerns might be raised about bias in the data and how it can affect populations.

Dataiku screenshot of the Responsible Credit Scoring page in the dashboard.

Webapp, API and Responsible AI#

The webapp provides an interactive experience to navigate the scorecard and is built using the parameters defined in the Dataiku Application. Users can modify input values on the left-hand side panel and visualize the results on the main screen. For each variable, a card shows all the possible buckets a value can belong to and their corresponding number of points. The final score will be the sum of the points from each variable. Hence, depending on the number of points attributed to a variable, it might contribute positively or negatively to the final score.

Dataiku screenshot of the solution's webapp

In the Flow, two datasets are exported to a CSV format to be usable in the API service. Indeed, the API is a feature that allows exposure of a model outside of the project and makes it actionable. In our case, the score is not a direct output of the model (it outputs a prediction and a probability from the raw logistic regression) but the result of further computation to scale the score.

As previously mentioned, credit scoring impacts people’s lives. To fight bias, statistical techniques appear to have removed human-biased judgment in making these decisions, as the quantitative measures seem more objective. However, the ways in which data is processed and models are designed can also create or perpetuate bias that would continue to affect some groups of people. So that is why this project follows the Responsible AI framework developed by Dataiku.

Reproducing these Processes With Minimal Effort For Your Own Data#

The intent of this project is to enable Credit Risk Analysts to understand how Dataiku can be used to create a credit-worthiness model to build scorecards. By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization or across multiple organizations, credit teams can immediately benefit from the value of an ML-assisted approach, establish a foundation on which to build dedicated AI credit scoring models, all while remains connected to their current customer base and systems.

We’ve provided several suggestions on how to use a large number of features and feature iterations for credit signals to build dedicated AI credit scoring models but ultimately, the best approach will depend on your specific needs and data. If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.