Solution | Credit Card Fraud¶
Fraud detection rules are complex and well-established though often are based on business rules only. Enhancing set-ups with machine learning integration opens opportunities for increased efficiency in better detecting fraudulent behaviors and maximizing focus. Providing a unified space for teams to manage business rules alongside machine learning approaches, and allowing for sandbox experimentation and enterprise-grade productionization, ensures the gains from machine learning are realized, without losing established success through existing approaches. This Solution provides a robust and data-oriented template project in Dataiku that empowers investigator teams to monitor customer transactions, successfully detect fraud and efficiently handle alerts.
To leverage this solution, you must meet the following requirements:
Have access to a DSS 10.0+ instance.
A Python 3.6+ code environment named
solution_cc-fraudwith the following required packages:
When creating a new code environment, please be sure to use the name
solution_cc-fraud or remapping will be required.
When you first download the solution, you will need to change the path of the logs folder in the workflow so that it points to the one that contains the logs of your APIs. Your instance Admin will be able to provide you with the connection pointing to these logs.
If the technical requirements are met, this solution can be installed in one of two ways:
On your Dataiku instance click + New Project > Business solutions > Search for Credit Card Fraud.
Download the .zip project file and upload it directly to your Dataiku instance as a new project.
Four datasets are required for the solution
whitelist of customers and terminals
blacklist of customers and terminals
transactions containing information about each transaction mode from a customer account to a payment terminal. This dataset contains 18 columns which are detailed in the wiki.
country_list represents a list of countries and their associated risk levels. This data is publicly available and was sourced from KnowYourCountry.
In the 0 - Input flow zone, both the whitelist and blacklist datasets are converted to editable datasets which can be used to systematically block or authorize transactions from specific customers or to particular terminals.
You can follow along with the solution in the Dataiku gallery.
The project has the following high-level goals:
Train & Deploy a First Model
Detect Fraud In Real-Time
In addition to reading this document, it is recommended to read the wiki of the project before beginning in order to get a deeper technical understanding of how this solution was created, the different types of data enrichment available, longer explanations of solution-specific vocabulary, and suggested future direction for the solution.
How to Explore the Solution¶
To begin, you will need to create a new instance of the Credit Card Fraud Application. This can be done by selecting the Dataiku Application from your instance home, and clicking Create App Instance. The project is delivered with sample data that should be replaced with our own white/blacklists and transaction data. The Dataiku Application will prompt you to upload or connect your own data as you follow along.
This solution has been specifically designed with the Dataiku Application as the primary source of exploration. You are welcome and enable to walk through the solution’s underlying flow instead but the Dataiku Application delivers a lot of important insights and information about how things work. When viewing Solution in its project view, it is recommended to use the Project tags which have been applied in order to relate datasets, recipes, scenarios, and dashboards to project objectives, the type of insights they generate, and more. Lastly, this solution depends on a variety of Project Variables which are automatically changed via scenarios. A full table describing each project variable is provided in the wiki.
Exploratory Data Analysis¶
The Train & Deploy a First Model section of the App begins by giving us the option to input our own transactions data and optionally providing the number of weeks we want to be analyzed for the dashboard. We can then press the Run Now button in order to Clean & Process our input data. This launches a scenario that builds the first few datasets of the 1 - Feature Engineering flow zone. Specifically, the involved recipes join our transactions to the list of countries, extract date components from the transactions dataset, exclude transactions with negative amounts, compute the log of the amount of the transactions for use later, and create windows of time to observe customer spending behavior. The 2 - Data Exploration flow zone is also used to prepare the visualizations for the Dashboard.
Once the run has successfully completed we can explore our transactions data by going to the 1 - Exploratory Data Analysis Dashboard. The dashboard contains 3 tabs:
About this Dashboard gives a quick overview of what is available in the other 2 tabs and how to change the visualizations
Overview presents key metrics from our transactions, maps the locations associated with our transactions, visualizations representing the evolution of transactions over time, and Analysis by Payment Method.
Analysis of Frauds presents key metrics on Fraudulent transactions in our data, maps the location of Fraud transactions, the evolution of Fraud over time, Fraud type Analysis, and Analysis by Payment Method.
Train & Deploy a First Model¶
The steps described in this section, correspond to recipes and datasets in the Model Training flow zone.
Back in the Dataiku Application, we are asked to input time window parameters to scope the training of our model. Additionally, we can input the quantile used to define thresholds from anomaly ratios. Multiple ratios are computed when we click the Compute Ratio Thresholds Run Now button and the updated thresholds will dynamically update in the Application after refreshing the page. These thresholds can, however, be manually edited by editing their associated project variable. The thresholds are computed in the 1 - Feature Engineering flow zone.
When viewing the parent project of the Solution, there are several empty datasets in the 1 - Feature Engineering flow zone. These intermediary datasets were left intentionally empty so as to make the download size of the Solution more manageable. None of these datasets are used for dashboard visualizations and were cleared after the full flow had been built.
Once we have a set of ratio thresholds that we’re satisfied with, we can optionally define our model metrics and train our ML model. We can refresh the page after our model has been trained to see a dynamically updated list of resulting flag weights from our trained model. Like our ratio thresholds, these weights can be changed by editing the project variables. After we are satisfied with our model’s performance, we are given the option to upload our blacklist and whitelist customers for which transactions will be either systematically blocked or authorized. You can also choose to use the blacklist and whitelist datasets we have provided with the solution to see how the model performs.
In our build of this Solution, we split our data into a training dataset containing all transactions with a date at least 2 months greater than the minimum date, and a validation dataset containing all the transactions that occurred during the last weeks. The scoring model is based on an XGBoost algorithm (although other algorithms may perform better on different data) and the defined anomaly flags. The score associated with each transaction is a weighted average of the ML prediction and the anomaly flags.
Now that our model is trained, we are ready to deploy it. To do so, we should go to the main Credit Card Fraud project and then navigate to the API Designer section of the project. Then we can select the cc_fraud_model API and click Publish on Deployer. Lastly, we should change the project variable API_Service with the version id we chose.
Detect Fraud In Real-Time¶
This section of the Dataiku Application makes use of the 1 - Feature Engineering, 2 - Data Exploration, and 3 - Model Training flow zones since we will retrain our model based on new data. In addition, it will run the 4 - Data Drift Monitoring flow zone. Once our model is deployed, a scenario is run every day which will compute the features used by the scoring model. Additionally, another scenario is run every week to compute the performance of our scoring model. The scenarios can be activated/deactivated in the Scenario tab of the project.
A reporter can be set up in the scenarios to notify our organization when updates have been made. We can head over to the Model Monitoring Dashboard to see the performance of our models and understand what values are being used to assess performance. However, please note that when you first use this project on your data, it will take time before changes in performance can be observed.
The Model Monitoring dashboard contains 4 tabs:
About this Dashboard gives a quick overview of what is available in the other 3 tabs
Prediction Drift shows us to what extent the scoring model weights have changed over time
Performance Drift contains metrics and visualizations that show how the performance of the model has evolved so far
Input Data Drift provides several charts and metrics showing how the distribution of features has changed over time.
By exploring these values we can see if it’s necessary to revisit the design of the model and, perhaps, retrain it with new metrics and thresholds on the drift. This can be done via the Dataiku Application but please note that the retrained model will need to be manually deployed following the same steps that we completed to deploy our first trained model. Model retraining can be automated via a scenario and configured to be triggered based on a period of time AND based on the drift in the model performance.
This section of the Dataiku Application uses recipes and datasets contained in the 6 - Alerts Management flow zone.
Now that we have a trained model with checks in place to ensure its long-term performance in production, we can set up the scenarios necessary for alert investigation. Specifically, we will set up the means to view the last detected fraud alerts by priority order, assign each alert to an investigator and provide them with the relevant data, and follow the status of investigations. When you first download the solution, you will need to change the path of the logs folder in the workflow so that it points to the folder containing your API logs. You can test the scenario by sending fake requests to the API.
The fake requests use the blacklist uploaded in the previous section of the Dataiku Application to be sure that the request leads to alerts. If you left the blacklist empty, you will get an error message.
In this section of the Dataiku Application, we should input the URL of our API node before sending requests and collecting the alerts. The editable dataset alerts_to_assign allows us to assign alerts to specific investigators. Moving to the 3 - Alerts Management Dashboard, we are given 3 tabs:
About this dashboard gives a quick overview of what is available in the other 2 tabs and how to change the name of the investigator(s) for which we want to see assignments.
Overview provides key metrics and charts regarding identified fraud alerts
Alerts Board allows us to analyze the data and explore the graphs needed to handle each alert assigned to an investigator. We can filter all of the visualizations of the tab using the filters on the left.
Once we have explored the data related to an alert, we can confirm or invalidate alerts by editing the editable dataset alerts_after_investigation.
Reproducing these Processes With Minimal Effort For Your Own Data¶
The intent of this project is to enable investigator teams to understand how Dataiku can be used to leverage established insights and rules for credit card fraud detection modeling within a robust and full-featured data science platform, while easily incorporating new machine learning approaches and ensuring real-time alerts management. By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization, smarter and more holistic strategies can be designed in order to identify potential enhancements and lighten the workload of investigator teams.
We’ve provided several suggestions on how to use transaction records to build models for credit card fraud detection but ultimately the “best” approach will depend on your specific needs and your data. If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.