Solution | Credit Card Fraud#

Overview#

Business case#

Although fraud detection rules are complex and well-established, they’re often based on business rules only. Integrating setups with machine learning opens opportunities for increased efficiency in better detecting fraudulent behaviors and maximizing focus.

Providing a unified space for teams to manage business rules alongside machine learning approaches, and allowing for sandbox experimentation and enterprise-grade productionization, ensures the gains from machine learning are realized, without losing established success through existing approaches.

This Solution provides a robust and data-oriented template project in Dataiku that empowers investigator teams to monitor customer transactions, successfully detect fraud and efficiently handle alerts.

Installation#

From the Design homepage of a Dataiku instance connected to the internet, click + Dataiku Solutions.
Search for and select Credit Card Fraud.
If needed, change the folder into which the Solution will be installed, and click Install.
Follow the modal to either install the technical prerequisites below or request an admin to do it for you.

Note

Alternatively, download the Solution’s .zip project file, and import it to your Dataiku instance as a new project.

Technical requirements#

To leverage this Solution, you must meet the following requirements:

Have access to a Dataiku 12.0+* instance.
When you first download the Solution, you will need to change the path of the logs folder in the workflow so that it points to the one that contains the logs of your APIs. Your instance admin will be able to provide you with the connection pointing to these logs.

Data requirements#

The Solution requires four datasets:

Dataset	Description
whitelist	Includes a whitelist of customers and terminals.
blacklist	Includes a blacklist of customers and terminals.
transactions	Contains information about each transaction mode from a customer account to a payment terminal. This dataset contains 18 columns detailed in the wiki.
country_list	Represents a list of countries and their associated risk levels. This data is publicly available from KnowYourCountry.

The 0 - Input Flow zone converts both the whitelist and blacklist datasets to editable datasets. You can use them to systematically block or authorize transactions from specific customers or to particular terminals.

Workflow overview#

You can follow along with the Solution in the Dataiku gallery.

The project has the following high-level goals:

Train and deploy a first model.
Detect fraud in real-time.
Investigate alerts.

Walkthrough#

Note

In addition to reading this document, it’s recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

How to explore the Solution#

To begin, you’ll need your own instance of the Dataiku app associated with this Solution.

From the waffle () menu of the Design node’s top navigation bar, select Dataiku Apps.
Search for and select the application with the name of this Solution.
Then, click Create App Instance.

The project include sample data. You should replace it with your own white/blacklists and transaction data. The Dataiku app will prompt you to upload or connect your own data as you follow along.

This Solution has been specifically designed with the Dataiku app as the primary source of exploration. You are welcome to walk through the Solution’s underlying Flow instead, but the Dataiku app delivers many important insights and information about how it works.

When viewing the Solution in its project view, it’s recommended to use the project tags. They have been applied to relate datasets, recipes, scenarios, and dashboards to project objectives, the type of insights they generate, and more.

Lastly, this Solution depends on a variety of project variables, which are automatically changed via scenarios. The wiki provides a full table describing each project variable.

Exploratory data analysis#

The Train & Deploy a First Model section of the app begins by giving an the option to input your own transactions data. Optionally, you can provide the number of weeks you want to analyze in the dashboard. You can then click the Run Now button to Clean & Process your input data. This launches a scenario that builds the first few datasets of the 1 - Feature Engineering Flow zone.

Specifically, the involved recipes:

Join the transactions to the list of countries.
Extract date components from the transactions dataset.
Exclude transactions with negative amounts.
Compute the log of the amount of the transactions for use later.
Create windows of time to observe customer spending behavior.

The 2 - Data Exploration Flow zone is also used to prepare the visualizations for the Dashboard.

Once the run has completed, you can explore the transactions data by going to the 1 - Exploratory Data Analysis dashboard. The dashboard contains three tabs:

Tab	Description
About this Dashboard	Gives a quick overview of what’s available in the other two tabs and how to change the visualizations.
Overview	Presents key metrics from the transactions, maps the locations associated with the transactions, visualizations representing the evolution of transactions over time, and Analysis by Payment Method.
Analysis of Frauds	Presents key metrics on fraudulent transactions in the data, maps the location of fraud transactions, the evolution of fraud over time, fraud type analysis, and Analysis by Payment Method.

Train and deploy a first model#

The steps described in this section, correspond to recipes and datasets in the Model Training Flow zone.

Back in the Dataiku app, you’re asked to input time window parameters to scope the training of the model. Additionally, you can input the quantile used to define thresholds from anomaly ratios.

Clicking the Compute Ratio Thresholds Run Now button computes multiple ratios. The updated thresholds will dynamically update in the app after refreshing the page. However, you can manually edit these thresholds by editing their associated project variable. The threshold computation occurs in the 1 - Feature Engineering Flow zone.

Note

When viewing the parent project of the Solution, there are several empty datasets in the 1 - Feature Engineering Flow zone. These intermediary datasets are intentionally empty to make the download size of the Solution more manageable. None of these datasets are used for dashboard visualizations and were cleared after the full Flow had been built.

Once you have a set of satisfactory ratio thresholds, you can optionally define model metrics and train an ML model. Refresh the page after model training to see a dynamically updated list of resulting flag weights from the trained model.

Like the ratio thresholds, you can change these weights by editing the project variables. After you’re satisfied with the model’s performance, you’re given the option to upload blacklist and whitelist customers for which transactions will be either systematically blocked or authorized. You can also choose to use the blacklist and whitelist datasets provided with the Solution to see how the model performs.

The example build of this Solution splits the data into a training dataset containing all transactions with a date at least two months greater than the minimum date, and a validation dataset containing all the transactions that occurred during the last weeks. The scoring model is based on an XGBoost algorithm (although other algorithms may perform better on different data) and the defined anomaly flags. The score associated with each transaction is a weighted average of the ML prediction and the anomaly flags.

Having trained a model, you’re ready to deploy it. To do so, go to the main Credit Card Fraud project, and then navigate to the API Designer section of the project. Select the cc_fraud_model API, and click Publish on Deployer. Lastly, change the project variable API_Service with the version ID you chose.

Detect fraud in real-time#

This section of the Dataiku app makes use of the 1 - Feature Engineering, 2 - Data Exploration, and 3 - Model Training Flow zones, retraining the model based on new data. In addition, it will run the 4 - Data Drift Monitoring Flow zone.

After having deployed the model, the project runs a scenario every day computing the features used by the scoring model. Additionally, another scenario runs every week to compute the performance of the scoring model. You can activate or deactivate the scenarios in the Scenario tab of the project.

You can create a reporter in the scenarios section to notify the organization when updates have been made. Go to the Model Monitoring dashboard to see the performance of the models and understand what values are used to assess performance. However, please note that when you first use this project on your data, it will take time before you can observe changes in performance.

The Model Monitoring dashboard contains four tabs:

Tab	Description
About this Dashboard	Gives a quick overview of what’s available in the other three tabs.
Prediction Drift	Shows to what extent the scoring model weights have changed over time.
Performance Drift	Contains metrics and visualizations that show how the performance of the model has evolved so far.
Input Data Drift	Provides several charts and metrics showing how the distribution of features has changed over time.

By exploring these values, you can see if it’s necessary to revisit the design of the model and, perhaps, retrain it with new metrics and thresholds on the drift. You can do this via the Dataiku app. However, please note that the retrained model will need to be manually deployed following the same steps as for deploying the first trained model. You can automate model retraining via a scenario and configure a trigger based on a period of time and based on model performance drift.

Investigate alerts#

This section of the Dataiku app uses recipes and datasets contained in the 6 - Alerts Management Flow zone.

Having trained a model with checks in place to ensure its long-term performance in production, you can set up the scenarios necessary for alert investigation. Specifically, you will:

Set up the means to view the last detected fraud alerts by priority order.
Assign each alert to an investigator.
Provide them with the relevant data.
Follow the status of investigations.

When you first download the Solution, you will need to change the path of the logs folder in the workflow. It should point to the folder containing your API logs. You can test the scenario by sending fake requests to the API.

Warning

The fake requests use the blacklist uploaded in the previous section of the Dataiku app to be sure that the request leads to alerts. If you left the blacklist empty, you will get an error message.

In this section of the Dataiku app, you should input the URL of your API node before sending requests and collecting the alerts. The editable dataset alerts_to_assign allows for assigning alerts to specific investigators.

Moving to the 3 - Alerts Management Dashboard, you’re given three tabs:

Tab	Description
About this Dashboard	Gives a quick overview of what’s available in the other two tabs and how to change the name of the investigators for which you want to see assignments.
Overview	Provides key metrics and charts regarding identified fraud alerts
Alerts Board	Allows for analyzing the data and exploring the graphs needed to handle each alert assigned to an investigator. You can filter all visualizations of the tab using the filters on the left.

Once you’ve explored the data related to an alert, you can confirm or invalidate alerts by editing the editable dataset alerts_after_investigation.

Reproducing these processes with minimal effort for your data#

The intent of this project is to enable investigator teams to understand how they can use Dataiku to leverage established insights and rules for credit card fraud detection modeling within a robust and full-featured data science platform, while incorporating new machine learning approaches and ensuring real-time alerts management.

By creating a singular Solution that can benefit and influence the decisions of a variety of teams in a single organization, you can design smarter and more holistic strategies to identify potential enhancements and lighten the workload of investigator teams.

This documentation has provided several suggestions on how to derive value from this Solution. Ultimately however, the “best” approach will depend on your specific needs and data. If you’re interested in adapting this project to the specific goals and needs of your organization, Dataiku offers roll-out and customization services on demand.