Concept | Data quality rules#

Watch the video

Important

Data quality rules and views are available in Dataiku versions 12.6 and above. For versions 12.5 and below, use metrics and checks to verify data quality.

Data quality is critical to ensuring that any analytics or machine learning project is reliable.

In Dataiku, you can create rules to monitor data quality of datasets. You and your collaborators can also view data quality information at the dataset, project, and instance levels.

Setting up rules#

In the Data Quality tab of a dataset, you can choose from a menu of rule types, then set the relevant column(s) and parameters. Some available rule types include:

  • The sum, median, average, min, max, or standard deviation of a column is within a certain range.

  • Column values are empty, not empty, or unique.

  • The record count of a dataset is within a certain range.

  • Column values are valid according to their meaning.

The edit data quality rules page on a dataset.

For many types of rules, you can select one or multiple columns for the rule to check.

Menu to add one or many columns to a data quality rule.

Outputs#

Data quality rules will return one of four outputs:

Output

Meaning

OK

The rule outcome satisfied the set condition.

Error

The rule condition is not respected or the computation itself failed.

Warning

The rule fails a soft condition but not a hard one.

Empty

The rule cannot be computed.

Also note that if the rule has not yet been computed, the status will be “Not Computed”.

By default, data quality rules run every time a dataset is built. You can toggle this option on or off for each rule by de-selecting the “Auto-run after build” checkbox.

Rule templates#

To facilitate reuse and save time when creating similar rules on multiple datasets, you can save a set of data quality rules as a template that you or your colleagues can then add to any other datasets across your instance. You can create a template in the Edit Data Quality Rules page, and the template will save all rules created on that dataset.

When you use rules from a template on another dataset, you can then add, edit, or delete the rules to customize them for that dataset. The rules will apply to columns with the same name in the target dataset. If no column with that name exists, you can edit the rule to apply to the correct column.

Screenshot showing buttons to save data quality templates and add rules from a template.

Tip

You can also copy individual or multiple data quality rules and paste them into different datasets manually without templates. Select the rule(s) you want to copy and go to Actions > Copy then Actions > Paste in the other dataset, or use the keyboard shortcuts for copy (control + c) and paste (control + v).

Note

You can create custom data quality rules with Python code. See the documentation here to learn more.

Rules and metrics#

When you compute a data quality rule, Dataiku will also automatically calculate the relevant metric, where applicable. For example, computing a column minimum in range rule on a “price” column will create a “Min of price” metric. It will also create a metric for the rule itself.

You can also set up data quality rules directly on top of existing metrics if you want, using the Metric value in range or Metric value in set rule types. For example, you could use one of the metric value rule types to check the value of an SQL query probe metric.

Note

All existing checks will be migrated to metric rules when upgrading to version 12.6 or later.

Data quality views#

You can view data quality status information at the dataset, project, and instance levels.

Any dataset with at least one data quality rule becomes a monitored dataset by default. Monitored datasets are included in the project and instance level data quality views. You can toggle monitoring off if you don’t want to include a certain dataset in the broader views.

Dataset view#

At the dataset level, find the views in the Data Quality tab of each dataset. This tab provides three views:

View

Content

Current Status

Shows the status of each rule as of the latest run. Here, you can also view a status breakdown showing the count of rules returning each status. You can also compute all enabled rules and perform several mass actions from this view.

Timeline

Shows a daily summary of data quality rules, including the status of the dataset overall and details from the last run as well as the worst daily status of each rule.

Rule History

Includes a searchable table listing the run history of all rules, with the associated status, observed value, run date, and run origin.

The current status data quality view of a ataset with four rules set.

Project view#

At the project level, you can view data quality information by going to Jobs menu > Data Quality or g + q on your keyboard.

The Project Data Quality page shows the current status and a timeline of daily statuses for monitored datasets within the project. The current and daily statuses are determined by the worst status of all monitored datasets, meaning that all enabled rules must pass on every monitored dataset for the current project status to show OK.

The project view also includes a table of the current status for each dataset in that project. This view is filtered for monitored datasets only by default. Click on a dataset name to navigate to the Data Quality tab within that dataset for a detailed look at the rules.

Data quality views at the project level.

Instance view#

Moving up another level, you can view data quality for monitored projects on your Dataiku instance under Applications menu > Data Quality. This view shows a status breakdown for all projects with at least one monitored dataset that you have access to on the instance.

Data quality views at the instance level.

What’s next?#

To learn more about data quality rules, metrics, and checks, including through hands-on exercises, please register for the free Academy course on this subject found in the Advanced Designer learning path.

You can also learn more about metrics, checks, and data quality rules in the reference documentation.