Tutorial | Data quality#

Data quality rules allow you to ensure that your data meets quality standards, and to view data quality information at the dataset, project, and instance levels. Metrics are a quick way to check metadata on your dataset.

Let’s see how these work!

Get started#

Objectives#

In this tutorial you will:

  • Create metrics on a dataset.

  • Set up data quality rules on a dataset.

  • Monitor data quality at the dataset, project, and instance levels.

Prerequisites#

To reproduce the steps in this tutorial, you’ll need:

  • Dataiku 12.6 or later.

  • Basic knowledge of Dataiku (Core Designer level or equivalent).

Create the project#

  1. From the Dataiku Design homepage, click + New Project.

  2. Select Learning projects.

  3. Search for and select Data Quality.

  4. Click Install.

  5. From the project homepage, click Go to Flow (or g + f).

Note

You can also download the starter project from this website and import it as a zip file.

You’ll next want to build the Flow.

  1. Click Flow Actions at the bottom right of the Flow.

  2. Click Build all.

  3. Keep the default settings and click Build.

Use case summary#

The project has three data sources:

Dataset

Description

tx

Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made.

It also indicates whether the transaction has either been:

  • Authorized (a score of 1 in the authorized_flag column)

  • Flagged for potential fraud (a score of 0)

merchants

Each row is a unique merchant with information such as the merchant’s location and category.

cards

Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US).

Compute and view metrics#

Our dataset of interest is tx_prepared. Let’s start by computing the default metrics already available for this dataset.

  1. Open the tx_prepared dataset.

  2. Navigate to the Metrics tab.

  3. Click Compute All to calculate the default metrics such as column count and record count.

Dataiku screenshot of the metrics subtab of a dataset.

Create custom metrics#

In addition to these default metrics, we can also natively configure many more kinds of metrics on properties such as the dataset’s statistics, most frequent values, percentiles, and meaning validity.

For instance, we would expect the values of the authorized_flag column to always be 0 or 1. We can create custom min and max metrics on the column to verify this expectation.

  1. Click Edit Metrics.

  2. Toggle On the Columns statistics section.

  3. For authorized_flag, select Min and Max.

  4. Click on Click to run this now, then Run.

  5. After the metric has been computed, click Last run results to view the results.

  6. Click Save in the top right to preserve the new metrics.

Dataiku screenshot of the edit metrics page of a dataset.

Display metrics#

The next step is to display any new metrics you have created.

  1. Click the back button next to Edit Metrics to navigate back to the metrics view screen.

  2. Click X/Y Metrics to view all available metrics.

  3. Click Add all next to ”Metrics available”.

  4. Click Save to include these metrics in the display.

Dataiku screenshot of the dialog for displaying metrics.

Create data quality rules#

While metrics are useful to perform ad-hoc checks on your data as you work on a project, you can also use data quality rules to perform systemic data quality checks across datasets, projects, and even your entire Dataiku instance.

The Data Quality tab contains a wide selection of built-in rule types:

  • Many types do not require you to explicitly create a metric as done above (e.g. column count in range).

  • Other types are built directly on top of an explicitly-created metric (e.g. metric value in range/set).

  • We can also create custom rules (or metrics for that matter) using Python code.

Tip

If interested in creating Python metrics and data quality rules, see the Custom Automation Academy course.

Record count in range#

Let’s assume we expect our dataset most often contains at least 50,000 records, but most definitely should have more than 5,000 records. We can use a built-in data quality rule to check whether the record count meets this criteria, and return an error or warning if it doesn’t.

Let’s purposely create a rule that will return a warning.

  1. Navigate to the Data Quality tab of the tx_prepared dataset.

  2. Click Edit Rules.

  3. Select Record count in range as the category. You can use the search box to speed up this process.

  4. Based on our expectations, set the min to 5k (5000) and the soft min to 50k (50000).

  5. Note the rule name that has been auto-generated. You can click on the pencil icon to overwrite it if you choose.

  6. Click Run Test to test it, and confirm the warning.

Dataiku screenshot of a check for a metric in a range of values.

Column min in range#

Using a different rule type, we can also check if a column’s minimum value falls within a certain range or is equal to an exact value.

Since we know that the minimum value for authorized_flag must be 0, we can create a rule to check that the column minimum has a range between 0 and 0.

  1. In the left panel, click + New Rule.

  2. Select Column min in range.

  3. Select authorized_flag as the Column.

  4. Toggle On the Min of 0.

  5. Do the same for Max.

  6. Click Run Test to confirm it returns OK.

Dataiku screenshot of a data quality rule for a minimum value in a column.

View data quality#

We’re now ready to compute the status of the data quality rules and view the results.

  1. Click the back arrow next to Edit Data Quality Rules to return to the Current Status view.

  2. Click Compute All.

  3. Take note of the Monitoring flag.

    • Monitoring is toggled on by default as soon as you create rules on a dataset. Monitored datasets are included in the project and instance level data quality views. If you don’t want a dataset to be included in these views, you can toggle the flag off.

  4. Take note of the current dataset status and status breakdown. The current dataset status displays the most severe current status of all enabled rules. Since one rule returned a warning, the current status is a warning.

  5. Click through the subtabs Current Status, Timeline, and Rule History to view the results in different displays.

Dataiku screenshot of a data quality view for a dataset.

This view shows data quality at the dataset level, but we can also get a broader view of data quality across the entire project.

  1. In the Jobs (Play button icon.) menu, select Data Quality (or type g + q).

  2. Take note of the current project status. It shows a warning because there is only one monitored dataset, tx_prepared, which is currently returning a warning.

  3. Click the blue Clear filters button for a look at all datasets in the project, regardless of whether they are monitored or not.

  4. Click through the subtabs Current Status and Timeline to see the different views for project data quality.

Dataiku screenshot of a data quality view for a project.

Tip

You can experiment with setting up data quality rules on other datasets and check back to see how the status breakdown changes in this view!

Also check out the instance data quality view from waffle (Waffle icon.) > Data Quality to see an overview of data quality for all projects you have access to on your Dataiku instance.

View Data Lineage#

As part of monitoring data quality, you can also view data lineage, or track data transformations throughout your pipeline. This can help you diagnose data issues upstream or potential impacts of changes downstream. Let’s view the data lineage of a column!

Important

You must use Dataiku version 13.2+ to complete this section of the tutorial.

  1. Return to the Flow and open the tx_prepared dataset.

  2. Click on the purchase_amount_eur column header.

  3. In the dropdown menu, choose See column lineage.

Steps to view the lineage of a column.

This takes you to the Data Lineage view, which is in the Data Catalog. The column we selected is highlighted in blue. Let’s investigate the lineage of this column.

  1. Follow the blue lines to the left to see that this column was created in a Prepare recipe using the input columns purchase_date and purchase_amount.

  2. Follow the blue lines to the right to see the column is used in one downstream dataset, tx_windows.

  3. Click Change Lineage and choose another column in this dataset to view its lineage and experiment with this view.

Steps to view the lineage of a column.

What’s next?#

Congratulations on setting up metrics and data quality rules! On their own, these features can provide deep insight into objects in the Flow.

However, to operationalize these insights, you’ll want to explore scenarios. Learn how to automate your data pipeline with a scenario next.

See also

See the reference documentation to learn more about Data Quality Rules.