Tutorial | Data quality#

Data quality rules allow you to ensure that your data meets quality standards, and to view data quality information at the dataset, project, and instance levels. Metrics are a quick way to check metadata on your dataset.

Let’s see how these work!

Objectives#

In this tutorial you will:

  • Create metrics on a dataset.

  • Set up data quality rules on a dataset.

  • Monitor data quality at the dataset, project, and instance levels.

Prerequisites#

To reproduce the steps in this tutorial, you’ll need:

  • Access to an instance of Dataiku 12.6.

  • Basic knowledge of Dataiku (Core Designer level or equivalent).

You may also want to review this tutorial’s associated concept article.

Create the project#

  1. From the Dataiku Design homepage, click + New Project > DSS tutorials > Advanced Designer > Data Quality.

  2. From the project homepage, click Go to Flow.

Note

You can also download the starter project from this website and import it as a zip file.

Compute and view metrics#

Our dataset of interest is tx_prepared. Let’s start by computing the default metrics already available for this dataset.

  1. Open the tx_prepared dataset, and navigate to the Metrics tab.

  2. Click Compute all to calculate the default metrics such as column count and record count.

Dataiku screenshot of the metrics subtab of a dataset.

Create custom metrics#

These default metrics give us a quick overview, and we can set up our own metrics that give us more insight into the dataset values. For instance, we would expect the values of the authorized_flag column to always be 0 or 1. We can create custom min and max metrics on the column.

  1. Click on Edit metrics.

  2. Toggle On the Columns statistics section.

  3. Select Min and Max next to authorized_flag.

  4. Click on Click to run this now, then Run.

  5. After the metric has run, click Last run results to view the results.

Dataiku screenshot of the edit metrics page of a dataset.
Optional: Custom Python metric

In addition to the out-of-the-box metric options, you can also create custom metrics with Python. For example, we might want to build a metric off of two columns at once, to track the most and least authorized item categories.

  1. Scroll to the bottom of the Edit Metrics page.

  2. Click New Python Probe.

  3. Toggle it On, and click the pencil icon to rename it Most/least authorized categories.

  4. Copy-paste the following code block:

    import numpy as np
    
    # Define here a function that returns the metric.
    def process(dataset, partition_id):
        # dataset is a dataiku.Dataset object
        df = dataset.get_dataframe()
        df.authorized_flag.fillna(0, inplace=True)
        df_per_category = (df
                            .groupby('item_category')
                            .agg({'authorized_flag':np.mean})
                            .sort_values('authorized_flag', ascending=False)
                        )
    
        most_authorized_category = df_per_category.index[0]
        least_authorized_category = df_per_category.index[-1]
        return {'Most authorized item category' : most_authorized_category,
                'Least authorized item category' : least_authorized_category}
    
  5. Click on Click to run this now and then Run to test out these new metrics.

  6. Click to view the Last run results.

  7. Click Save.

Display metrics#

The next step is to display any new metrics you have created.

  1. Click Save in the top right.

  2. Click the back button next to Edit Metrics to navigate back to the metrics view screen.

    Dataiku screenshot of the dialog for displaying metrics.
  3. Click X/Y Metrics.

  4. Click Add all next to ”Metrics available”.

  5. Click Save.

Dataiku screenshot of the dialog for displaying metrics.

Create data quality rules#

While metrics are useful to perform ad-hoc checks on your data as you work on a project, you can also use data quality rules to perform systemic data quality checks across datasets, projects, and even your entire Dataiku instance.

The Data Quality tab contains an array of built-in rule types, and we can also create custom rules using code.

Record count in range#

Let’s assume we expect our dataset to contain at least 50,000 records. We can use a built-in data quality rule to check whether the record count meets this criteria, and return an Error if it doesn’t.

Let’s purposely create a rule that will fail.

  1. Navigate to the Data Quality subtab of the tx_prepared dataset.

  2. Click Edit rules.

  3. Select Record count in range. You can use the search box to speed up this process.

  4. Set the minimum to 50000.

  5. Note the rule name that has been auto-generated. You can click on the pencil icon to overwrite it if you choose.

  6. Click Run Test to test it, and confirm the error.

Dataiku screenshot of a check for a metric in a range of values.

Column min in range#

Using a different rule type, we can also check if a column’s minimum value falls within a certain range, or equal to an exact value. Since we know that the minimum value for authorized_flag must be 0, we can create a rule to check that the minimum and maximum are both zero. If these are both true, then the value is 0 and the rule will return OK. If not, the rule will return an Error.

  1. In the left panel, click + New Rule.

  2. Select Column min in range.

  3. Select authorized_flag as the Column.

  4. Add 0 as the Min and toggle this check On.

  5. Do the same for Max.

  6. Click Run Test to test it returns “OK”.

Dataiku screenshot of a data quality rule for a minimum value in a column.
Optional: Custom Python rule

We can also create custom rules with Python code. For example, we could create a rule to check that the difference between the maximum and minimum values of authorized_flag is 1.

  1. In the left panel, click on + New Rule.

  2. Select Python code from the available rule types.

  3. Click on the pencil icon at the top of the screen to name the rule Expected target values. Note that rule names are not auto-generated for Python rules.

  4. Copy-paste the below code block.

  5. Click Run test to test that it returns “OK”, and then Save.

# Define here a function that returns the outcome of the rule check.
def process(last_values, dataset, partition_id):
      # last_values is a dict of the last values of the metrics,
      # with the values as a dataiku.metrics.MetricDataPoint.
      # dataset is a dataiku.Dataset object
      min_flag = int(last_values['col_stats:MIN:authorized_flag'].get_value())
      max_flag = int(last_values['col_stats:MAX:authorized_flag'].get_value())

      if max_flag - min_flag == 1:
          return 'OK', 'Expected target values'
      else:
          return 'ERROR', 'Unexpected target values'

View data quality#

We’re now ready to compute the status of the data quality rules and view the results.

  1. Click on the back arrow in the top left to return to the Current Status view.

  2. Click Compute All.

  3. Take note of the Monitoring flag. Monitoring is toggled on by default as soon as you create rules on a dataset. Monitored datasets are included in the project and instance level data quality views. If you don’t want a dataset to be included in these views, you can toggle the flag off.

  4. Take note of the current Dataset status and status breakdown. The current Dataset status is the worst current status of all enabled rules. Since one rule returned an error, the current status is Error.

  5. Click through the subtabs Current Status, Timeline, and Rule History to view the results in different displays.

Dataiku screenshot of a data quality view for a dataset.

This view shows data quality at the dataset level, and we can also get a broader view of data quality across the entire project.

  1. In the Jobs menu, select Data Quality (or press g + q).

  2. Take note of the current Project status. The current project status is in Error because there is only one monitored dataset, tx_prepared, which is currently returning an Error.

  3. Click the blue Clear filters button for a look at all datasets in the project, regardless of whether they are monitored or not.

  4. Click through the subtabs Current Status and Timeline to see the different views for project data quality.

Dataiku screenshot of a data quality view for a project.

You can experiment with setting up data quality rules on other datasets and check back to see how the status breakdown changes in this view! Also check out the instance data quality view in the Applications menu > Data Quality to see an overview of data quality for all projects you have access to on your Dataiku instance.