Solution | AML Alerts Triage#
Overview#
Business Case#
Anti-money laundering processes are complex and multifaceted, and generate large numbers of alerts which must be investigated. Most generated alerts are ultimately not escalated for further review. Reducing the total number of false-positive alerts is a complex and heavily regulated process.
Improvements in AML processes must occur at many points in the chain, and a modular solution that can be readily incorporated into existing flows to more efficiently process existing alerts is a means to improve detection rates and reduce alert fatigue, acting as a first step to AML set-up efficiency.
Thanks to this adapt and apply Solution, Financial Crime analysts are supported in initial assessment through risk likelihood prioritization. Insights delivered by the solution also include other elements which can be used as a starting point to review effectiveness of used business rules, paving the road to further AML set-up reinforcement.
Installation#
The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.
Dataiku Cloud users should follow the instructions for installing solutions on cloud.
The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.
Once the Solution has been added to your space, move ahead to Data Requirements.
After meeting the technical requirements below, self-managed users can install the Solution in one of two ways:
On your Dataiku instance connected to the internet, click + New Project > Dataiku Solutions > Search for AML Alerts Triage.
Alternatively, download the Solution’s .zip project file, and import it to your Dataiku instance as a new project.
Technical Requirements#
To leverage this solution, you must meet the following requirements:
Have access to a Dataiku 12.0+* instance.
The downloadable version uses filesystem managed datasets and the built-in Dataiku engine as the only processing engine. Performance could be greatly improved by changing all the connections to Snowflake connections. In particular, daily batch computation time can be optimized from 2 hours to a couple of minutes by switching from File System managed datasets to Snowflake connections and enabling the use of an in-database SQL engine.
Data Requirements#
We will work with a fictional financial services company called Haiku Bank, and use their data to illustrate the steps to prioritize AML alerts. In this project, alerts are defined at a transaction level so within our initial alerts datasets we will find transaction ids, alert ids, and the label is_escalated for historical alerts. Additionally, we have a segments input dataset which represents client categories that have been defined using KYC.
Note
This project is meant to be used as a template to guide development of your own analysis on Dataiku’s platform. The results of the model should not be used as actionable insights and the data provided with the project may not be representative of actual data in a real-life project.
Workflow Overview#
You can follow along with the solution in the Dataiku gallery.
The project has the following high level steps:
Join and prepare our input data.
Train an alert triage model.
Conduct rules based triage in parallel.
Monitor the performance of our model over time.
Enable compliance teams to understand the models they use via interactive Model Explainability in a pre-built Dashboard.
Automate the full pipeline to react to score new alerts, monitor model performance over time, and retrain the model with new data.
Walkthrough#
Note
In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.
Gather Input Data and Prepare for Training#
The first two Flow zones of our project are fairly straightforward.
We begin in the Input Flow zone by bringing in 6 initial datasets:
alerts_history is combined with new_alerts via a stack recipe. All of the alerts in the new_alerts are obviously missing values for the is_escalated column since they haven’t yet been classified.
Similarly, transactions_history and new_transactions are stacked together. New transactions correspond to the last day’s transactions but both datasets share the same schema.
This zone also combines client information (accounts) with segments representing KYC created client categories.
This project has scenarios that can be configured to be triggered based on changes to these initial datasets but some additional work is needed to update the project to connect to updating datasets. Within the first Flow zone, some additional data preparation is done to handle email addresses, IP addresses, and home addresses.
Moving along to the Data Preparation Flow zone, we sequentially process the data to create some additional features for the model’s feature engineering. A first prepare recipe creates the age of the customer and then a window recipe creates all the major features that will be important for the model. The data is additionally filtered to keep only alerts before being split into three datasets:
Dataset |
Description |
---|---|
alerts_train |
Contains the bulk of the data. |
alerts_test |
Contains the last 4 weeks of data (the split is done automatically via a project variable set by the scenario Evaluate Model). |
alerts_unlabelled |
Contains the unlabelled alerts from the last day. |
Training an Alert Triage Model#
The Alert Triage Flow zone takes as an input two datasets: alerts_train and alerts_unlabelled. We train a two-class classification model that predicts the previously mentioned is_escalated variable. The dataset is imbalanced with around 4% of 1 and the rest of 0 and handled using class weights as a weighting strategy. All variables that make sense are included in the model, and processed in a standard way (dummy encoded for categories and normalized for numerical variables). In the design part of the model, we choose a custom cost matrix function to optimize the threshold. Thus, false negatives are heavily weighted, and true positive and false positive have the same weight but with opposite signs. Therefore, the user can input how important each of the metrics is within the parametrization of the model.
We ultimately selected the XGBoost algorithm because it shares the same performance as the Random Forest algorithm while being lighter. Looking at the variable importance of our model, orig_is_escalated_avg is the most important variable, which makes sense because past escalated alerts influence new alerts being escalated.
The deployed model is then used to score the new alerts and the main output from this scoring is the proba_1 column which can be interpreted as a priority. Compliance officers would first investigate the high priority alerts before processing the ones further down the list. Thus the scored alerts are sorted by priority and represent the output of the project that will be used elsewhere.
Using a Rules-Based Approach for Alerts Triage#
In parallel to the Machine Learning model for Alert triaging, we also generate some rules-based priority scores for new alerts in the Rules Based Triage Flow zone. There are two parallel branches in this Flow zone. In the first branch a prepare recipe uses a predefined formula in a prepare recipe on the transactions_unlabelled dataset. The full dataset is then sorted based on the resulting priority score to place alerts with the highest priority score first.
In the second branch, we once again use the same predefined formula in a prepare recipe on the alerts_test dataset in order to evaluate the performance of this rule and compare it to the machine learning model. As we are using a prepare recipe and not a visual ML model to prioritize alerts, we cannot use the evaluate recipe to compute metrics. Instead we apply the same rule as for the unlabelled alerts for the test alerts and compute the needed aggregates using the grouping recipe. And finally the false positive and false negative rates are computed before appending results each time a new evaluation is executed.
Long Term Model Performance#
We end with the Model Drift Flow zone where the model’s performance over time is monitored, as well as the data, to ensure neither has drifted too much. Using the alerts_test dataset, we compute the performance metrics of the model on data made of the last 4 weeks of alerts and has not been used in the model training. As time goes by, the model will become more and more outdated with regards to the updated test data. We use a model evaluation recipe to compute all the standard metrics to evaluate the performance of a binary classification model. A prepare recipe removes unnecessary metrics and creates the ones that are most relevant to the business, namely the false positive rate and the false negative rate. The two subsequent recipes pick the oldest and the most recent evaluations and then compute the difference between them for each of these values. We also compare the performance metrics from the rules-based prioritization with the ML-based prioritization. A trigger can be created on the output of the model drift performance to retrain the model when it has drifted too much.
In the upper part of the Flow zone, we also conduct data drift analysis. First, the alerts_validation dataset is created in the same way as in the model. We then leverage the evaluate recipe from the Model Evaluation Store to compute the data drift between the validation dataset used for the training at the moment when the model was deployed and the most recent validation dataset. So at the first iteration, there is no data drift because the two datasets are identical. Then as data drifts over time, the drift model accuracy increases, meaning that it is getting easier to discriminate between the two datasets. When the drift exceeds 0.5, the user should start worrying about the drift and consider training again with fresher data.
Enable Compliance Teams with Pre-Built Explainability Dashboards#
A key component of the AML process is the ability of Compliance Teams to understand precisely how the models they are using behave and ensure the models won’t break under certain conditions. In the pre-built project dashboard there are several components that can be used to investigate the trained model.
In the Global Explanation tab of the dashboard, we can analyze our model via feature importance analysis and partial dependence plots. If there are surprising results in the importance or dependence values for certain variables then investigation into the model is warranted. By enabling Compliance Teams to have access to these graphs themselves, it empowers them to use their invaluable knowledge to additionally evaluate model performance.
The Individual Explanations tab allows us to dig a bit deeper on the model performance. Users of the Dashboard can interactively change model features and see the output from the model. Additionally, users can see how probability and influential features are impacted by their changes to the model feature values. A final view in this dashboard focused on individual explanations can be used to detect patterns in how the top and bottom predictions have been made.
As previously mentioned, model and data drift should be continuously monitored. This can be done using auto-trigger scenarios for re-training, the Drift tab of the dashboard (where the model is evaluated each time new data comes in through the Model Evaluation Store), and the results view of our saved model.
Reproducing these Processes With Minimal Effort For Your Own Data#
The intent of this project is to enable compliance teams to understand how Dataiku can be used to supplement their existing AML processes by prioritizing alerts. By creating a singular solution that can benefit and influence the decisions of a variety of teams in a single organization, smarter and more holistic strategies can be designed in order to prioritize investigations, avoid additional regulatory burden, and provide insights for reviewing business rules.
We’ve provided several suggestions on how to use alert data to classify and triage alerts but ultimately the “best” approach will depend on your specific needs and your data. If you’re interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.