Solution | Credit Card Fraud#

Overview#

Business case#

Although fraud detection rules are complex and well-established, they’re often based on business rules only. Integrating setups with machine learning opens opportunities for increased efficiency in better detecting fraudulent behaviors and maximizing focus.

Providing a unified space for teams to manage business rules alongside machine learning approaches, and allowing for sandbox experimentation and enterprise-grade productionization, ensures the gains from machine learning are realized, without losing established success through existing approaches.

This solution provides a robust and data-oriented template project in Dataiku that empowers investigator teams to monitor customer transactions, successfully detect fraud and efficiently handle alerts.

Installation#

The process to install this Solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data requirements.

Technical requirements#

To leverage this Solution, you must meet the following requirements:

  • Have access to a Dataiku 12.0+* instance.

  • When you first download the solution, you will need to change the path of the logs folder in the workflow so that it points to the one that contains the logs of your APIs. Your instance admin will be able to provide you with the connection pointing to these logs.

Data requirements#

The Solution requires four datasets:

Dataset

Description

whitelist

Includes a whitelist of customers and terminals.

blacklist

Includes a blacklist of customers and terminals.

transactions

Contains information about each transaction mode from a customer account to a payment terminal. This dataset contains 18 columns detailed in the wiki.

country_list

Represents a list of countries and their associated risk levels. This data is publicly available from KnowYourCountry.

The 0 - Input Flow zone converts both the whitelist and blacklist datasets to editable datasets. You can use them to systematically block or authorize transactions from specific customers or to particular terminals.

Workflow overview#

You can follow along with the Solution in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow zones.

The project has the following high-level goals:

  1. Train and deploy a first model.

  2. Detect fraud in real-time.

  3. Investigate alerts.

Walkthrough#

Note

In addition to reading this document, it’s recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

How to explore the solution#

To begin, you will need to create a new instance of the Credit Card Fraud application. You can do this by selecting the Dataiku application from your instance home, and clicking Create App Instance. The project include sample data. You should replace it with your own white/blacklists and transaction data. The Dataiku application will prompt you to upload or connect your own data as you follow along.

This solution has been specifically designed with the Dataiku application as the primary source of exploration. You are welcome to walk through the solution’s underlying Flow instead, but the Dataiku application delivers many important insights and information about how it works.

When viewing the solution in its project view, it’s recommended to use the project tags. They have been applied to relate datasets, recipes, scenarios, and dashboards to project objectives, the type of insights they generate, and more.

Lastly, this Solution depends on a variety of project variables, which are automatically changed via scenarios. The wiki provides a full table describing each project variable.

Dataiku screenshot of the final project Flow showing the tagging system.

Exploratory data analysis#

The Train & Deploy a First Model section of the app begins by giving an the option to input your own transactions data. Optionally, you can provide the number of weeks you want to analyze in the dashboard. You can then click the Run Now button to Clean & Process your input data. This launches a scenario that builds the first few datasets of the 1 - Feature Engineering Flow zone.

Specifically, the involved recipes:

  • Join the transactions to the list of countries.

  • Extract date components from the transactions dataset.

  • Exclude transactions with negative amounts.

  • Compute the log of the amount of the transactions for use later.

  • Create windows of time to observe customer spending behavior.

The 2 - Data Exploration Flow zone is also used to prepare the visualizations for the Dashboard.

Dataiku screenshot of the Dataiku application steps to upload and build your transactions dataset.

Once the run has completed, you can explore the transactions data by going to the 1 - Exploratory Data Analysis dashboard. The dashboard contains three tabs:

Tab

Description

About this Dashboard

Gives a quick overview of what’s available in the other two tabs and how to change the visualizations.

Overview

Presents key metrics from the transactions, maps the locations associated with the transactions, visualizations representing the evolution of transactions over time, and Analysis by Payment Method.

Analysis of Frauds

Presents key metrics on fraudulent transactions in the data, maps the location of fraud transactions, the evolution of fraud over time, fraud type analysis, and Analysis by Payment Method.

Dataiku screenshot of the Overview Tab in the Exploratory Data Analysis Dashboard

Train and deploy a first model#

The steps described in this section, correspond to recipes and datasets in the Model Training Flow zone.

Back in the Dataiku application, you’re asked to input time window parameters to scope the training of the model. Additionally, you can input the quantile used to define thresholds from anomaly ratios.

Clicking the Compute Ratio Thresholds Run Now button computes multiple ratios. The updated thresholds will dynamically update in the application after refreshing the page. However, you can manually edit these thresholds by editing their associated project variable. The threshold computation occurs in the 1 - Feature Engineering Flow zone.

Note

When viewing the parent project of the solution, there are several empty datasets in the 1 - Feature Engineering Flow zone. These intermediary datasets are intentionally empty to make the download size of the solution more manageable. None of these datasets are used for dashboard visualizations and were cleared after the full Flow had been built.

Once you have a set of satisfactory ratio thresholds, you can optionally define model metrics and train an ML model. Refresh the page after model training to see a dynamically updated list of resulting flag weights from the trained model.

Like the ratio thresholds, you can change these weights by editing the project variables. After you’re satisfied with the model’s performance, you’re given the option to upload blacklist and whitelist customers for which transactions will be either systematically blocked or authorized. You can also choose to use the blacklist and whitelist datasets provided with the solution to see how the model performs.

Dataiku screenshot of the parameters used to train a model from the App.

The example build of this Solution splits the data into a training dataset containing all transactions with a date at least two months greater than the minimum date, and a validation dataset containing all the transactions that occurred during the last weeks. The scoring model is based on an XGBoost algorithm (although other algorithms may perform better on different data) and the defined anomaly flags. The score associated with each transaction is a weighted average of the ML prediction and the anomaly flags.

Having trained a model, you’re ready to deploy it. To do so, go to the main Credit Card Fraud project, and then navigate to the API Designer section of the project. Select the cc_fraud_model API, and click Publish on Deployer. Lastly, change the project variable API_Service with the version ID you chose.

Dataiku screenshot of the API Deployer in Dataiku.

Detect fraud in real-time#

This section of the Dataiku application makes use of the 1 - Feature Engineering, 2 - Data Exploration, and 3 - Model Training Flow zones, retraining the model based on new data. In addition, it will run the 4 - Data Drift Monitoring Flow zone.

After having deployed the model, the project runs a scenario every day computing the features used by the scoring model. Additionally, another scenario runs every week to compute the performance of the scoring model. You can activate or deactivate the scenarios in the Scenario tab of the project.

Dataiku screenshot of the steps in the Dataiku application used to set up Model performance monitoring.

You can create a reporter in the scenarios section to notify the organization when updates have been made. Go to the Model Monitoring dashboard to see the performance of the models and understand what values are used to assess performance. However, please note that when you first use this project on your data, it will take time before you can observe changes in performance.

The Model Monitoring dashboard contains four tabs:

Tab

Description

About this Dashboard

Gives a quick overview of what’s available in the other three tabs.

Prediction Drift

Shows to what extent the scoring model weights have changed over time.

Performance Drift

Contains metrics and visualizations that show how the performance of the model has evolved so far.

Input Data Drift

Provides several charts and metrics showing how the distribution of features has changed over time.

Dataiku screenshot showing Performance Drift in the Model Monitoring dashboard.

By exploring these values, you can see if it’s necessary to revisit the design of the model and, perhaps, retrain it with new metrics and thresholds on the drift. You can do this via the Dataiku application. However, please note that the retrained model will need to be manually deployed following the same steps as for deploying the first trained model. You can automate model retraining via a scenario and configure a trigger based on a period of time and based on model performance drift.

Investigate alerts#

This section of the Dataiku application uses recipes and datasets contained in the 6 - Alerts Management Flow zone.

Having trained a model with checks in place to ensure its long-term performance in production, you can set up the scenarios necessary for alert investigation. Specifically, you will:

  • Set up the means to view the last detected fraud alerts by priority order.

  • Assign each alert to an investigator.

  • Provide them with the relevant data.

  • Follow the status of investigations.

When you first download the solution, you will need to change the path of the logs folder in the workflow. It should point to the folder containing your API logs. You can test the scenario by sending fake requests to the API.

Warning

The fake requests use the blacklist uploaded in the previous section of the Dataiku application to be sure that the request leads to alerts. If you left the blacklist empty, you will get an error message.

Dataiku screenshot of the section in the Dataiku application used to set-up fraud alerts reporting.

In this section of the Dataiku application, you should input the URL of your API node before sending requests and collecting the alerts. The editable dataset alerts_to_assign allows for assigning alerts to specific investigators.

Moving to the 3 - Alerts Management Dashboard, you’re given three tabs:

Tab

Description

About this Dashboard

Gives a quick overview of what’s available in the other two tabs and how to change the name of the investigators for which you want to see assignments.

Overview

Provides key metrics and charts regarding identified fraud alerts

Alerts Board

Allows for analyzing the data and exploring the graphs needed to handle each alert assigned to an investigator. You can filter all visualizations of the tab using the filters on the left.

Once you’ve explored the data related to an alert, you can confirm or invalidate alerts by editing the editable dataset alerts_after_investigation.

Dataiku screenshot of the visualizations provided to an investigator to investigate their assigned alerts

Reproducing these processes with minimal effort for your data#

The intent of this project is to enable investigator teams to understand how they can use Dataiku to leverage established insights and rules for credit card fraud detection modeling within a robust and full-featured data science platform, while incorporating new machine learning approaches and ensuring real-time alerts management.

By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization, you can design smarter and more holistic strategies to identify potential enhancements and lighten the workload of investigator teams.

This documentation has reviewed provided several suggestions on how to derive value from this Solution. Ultimately however, the “best” approach will depend on your specific needs and data. If you’re interested in adapting this project to the specific goals and needs of your organization, Dataiku offers roll-out and customization services on demand.