Solution | Reconciliation#

Overview#

Business Case#

Organizations across various industries often face reconciliation challenges, particularly when matching and verifying data between multiple datasets. These challenges are frequently managed using Excel spreadsheets, but this approach can be improved and made more robust through the use of Dataiku.

The Reconciliation Solution offers a range of matching capabilities, including perfect join for exact matches, fuzzy join based on customizable criteria, and access to a web application for manual matching.

This combination of automated and manual methods ensures a comprehensive and flexible reconciliation process.

Installation#

The process to install this solution differs depending on whether you are using Dataiku Cloud or a self-managed instance.

Dataiku Cloud users should follow the instructions for installing solutions on cloud.

  1. The Cloud Launchpad will automatically meet the technical requirements listed below, and add the Solution to your Dataiku instance.

  2. Once the Solution has been added to your space, move ahead to Data Requirements.

Technical Requirements#

To leverage this solution, you must meet the following requirement:

  • Have access to a Dataiku 13.2+* instance.

Data Requirements#

The project is initially shipped with all datasets using the filesystem connection.

The input data should be stored into two different datasets:

Dataset

Description

primary

Serves as the reference dataset.

secondary

Observations from this dataset will be matched to the primary dataset.

The solution accepts three types of data: numeric, string, and date.

The corresponding key columns between the two datasets must be of the same type. For example, if you want to match a column containing credit card numbers in both datasets, ensure that the column type is set to string in both cases.

The corresponding key columns between the two datasets do not need to have the same name, although they can if desired.

Workflow Overview#

You can follow along with the solution in the Dataiku gallery.

Dataiku screenshot of the final project Flow showing all Flow Zones.

The project has the following high-level steps:

  1. Prepare your data in the right format following the data requirements.

  2. Use the Project Setup interface to replace the demo data and setup your own version.

  3. Explore your own results in the dashboard.

  4. Review pending matches in the webapp.

Walkthrough#

Note

In addition to reading this document, it is recommended to read the wiki of the project before beginning to get a deeper technical understanding of how this Solution was created and more detailed explanations of Solution-specific vocabulary.

Plug and play with your own data and parameter choices#

The Project Setup allows users to customize parameters and use the solution with their own data through a visual interface.

This can be done in two ways:

  1. Data can be uploaded directly from the filesystem in the first section of the Dataiku app.

  2. Data can be connected to your database of choice by selecting an existing connection.

In both options, users must click the Check button which will load the data and verify the schema.

Dataiku screenshot of the accompanying Project Setup for this solution.

With our data selected and loaded into the Flow, we can move to the following sections:

Section

Description

ID Columns

The user can decide whether to use their own ID column(s) or allow the project to generate them automatically.

Columns Matching

The corresponding columns (referred to as keys) from each dataset to be matched are selected. In the current version of the project, up to five keys can be used.

Extra Columns

Users can choose extra columns (if any) that are not part of the keys, but are still important for displaying in the manual matching webapp and the final results, to help with decision-making.

Automatic Matching

Allows to establish an automatic matching threshold along with a maximum limit for potential matches.

Reconciliation

The last section is where the reconciliation process is launched.

Important

Be sure to refresh the page so that the app can dynamically take your data into account.

Reconciliation Analysis - Before Manual Review#

The Reconciliation Analysis dashboard provides an overview of how records are categorized based on their alignment with the secondary dataset. Records may fall into one of several match types: perfect, automatic, pending, manual, or no match.

On the first run, we can observe that there are no manual matches. This is because all potential manual matches have not yet been flagged manually and are still classified as pending matches.

Dataiku screenshot of the accompanying dashboard for this solution.

Manual Matching#

The manual matching webapp enables the manual approval or rejection of pending matches. It offers the option to review each match individually (focus table) or through a more comprehensive view (full table).

In the Focus Table tab, the ID from the primary dataset can be selected for review (e.g., 04baa4) either by using the dropdown menu or by typing the ID directly. The page will update dynamically based on the selected ID. Then, we can use the dropdown button to flag matches and insert comments if necessary.

Dataiku screenshot of the accompanying webapp for this solution.

In the Full Table tab, users can scroll down to view more matches and click the button in the bottom right to navigate to the next page if needed. If multiple potential matches exist, the observations are grouped together by the primary ID column.

Dataiku screenshot of the accompanying webapp for this solution.

Using one or both tabs, matches can be manually flagged as approved, rejected, or pending. Additionally, users can enter comments to provide more context or information regarding these decisions.

Reconciliation Analysis - After Manual Review#

To update the reconciliation analysis and include the reviews made in the manual matching webapp, click the Update Analysis button, wait a few seconds, and refresh the page.

Dataiku screenshot of the accompanying webapp for this solution.

Following our manual input in the webapp, we can see that some matches are now classified as manual matches. To view details about these matches, we can go to the Manual Matches section.

Dataiku screenshot of the accompanying webapp for this solution.

We can also see if any matches were manually rejected in the No Match section of the dashboard.

Dataiku screenshot of the accompanying webapp for this solution.

After completing the reconciliation process, users can either extract and use the reconciled observations or update the project settings to adjust the matching criteria as needed.

Responsible AI Considerations#

The Reconciliation Solution is designed to facilitate accurate and efficient data matching across datasets. However, it should not be used for critical decision-making without proper oversight and validation.

Reproducing these Processes With Minimal Effort For Your Own Data#

The intent of this project is to users to understand how Dataiku can be used to match and verify data between two distinct datasets.

However, the best approach will ultimately depend on your specific needs and the data of interest. If you are interested in adapting this project to the specific goals and needs of your organization, roll-out and customization services can be offered on demand.