Tutorial | Data quality#
Data quality rules allow you to ensure that your data meets quality standards, and to view data quality information at the dataset, project, and instance levels. Metrics are a quick way to check metadata on your dataset.
Let’s see how these work!
Get started#
Objectives#
In this tutorial you will:
Create metrics on a dataset.
Set up data quality rules on a dataset.
Monitor data quality at the dataset, project, and instance levels.
Prerequisites#
To reproduce the steps in this tutorial, you’ll need:
Access to an instance of Dataiku 12.6.
Basic knowledge of Dataiku (Core Designer level or equivalent).
Create the project#
From the Dataiku Design homepage, click + New Project > DSS tutorials > Advanced Designer > Data Quality.
From the project homepage, click Go to Flow.
Note
You can also download the starter project from this website and import it as a zip file.
You’ll next want to build the Flow.
Click Flow Actions at the bottom right of the Flow.
Click Build all.
Keep the default settings and click Build.
Use case summary#
The project has three data sources:
Dataset |
Description |
---|---|
tx |
Each row is a unique credit card transaction with information such as the card that was used and the merchant where the transaction was made. It also indicates whether the transaction has either been:
|
merchants |
Each row is a unique merchant with information such as the merchant’s location and category. |
cards |
Each row is a unique credit card ID with information such as the card’s activation month or the cardholder’s FICO score (a common measure of creditworthiness in the US). |
Compute and view metrics#
Our dataset of interest is tx_prepared. Let’s start by computing the default metrics already available for this dataset.
Open the tx_prepared dataset, and navigate to the Metrics tab.
Click Compute all to calculate the default metrics such as column count and record count.
Create custom metrics#
These default metrics give us a quick overview, and we can set up our own metrics that give us more insight into the dataset values. For instance, we would expect the values of the authorized_flag column to always be 0 or 1. We can create custom min and max metrics on the column.
Click on Edit Metrics.
Toggle On the Columns statistics section.
Select Min and Max next to authorized_flag.
Click on Click to run this now, then Run.
After the metric has run, click Last run results to view the results.
Display metrics#
The next step is to display any new metrics you have created.
Click Save in the top right.
Click the back button next to Edit Metrics to navigate back to the metrics view screen.
Click X/Y Metrics.
Click Add all next to ”Metrics available”.
Click Save.
Create data quality rules#
While metrics are useful to perform ad-hoc checks on your data as you work on a project, you can also use data quality rules to perform systemic data quality checks across datasets, projects, and even your entire Dataiku instance.
The Data Quality tab contains an array of built-in rule types, and we can also create custom rules using code.
Record count in range#
Let’s assume we expect our dataset to contain at least 50,000 records. We can use a built-in data quality rule to check whether the record count meets this criteria, and return an Error if it doesn’t.
Let’s purposely create a rule that will fail.
Navigate to the Data Quality subtab of the tx_prepared dataset.
Click Edit rules.
Select Record count in range. You can use the search box to speed up this process.
Set the minimum to
50000
.Note the rule name that has been auto-generated. You can click on the pencil icon to overwrite it if you choose.
Click Run Test to test it, and confirm the error.
Column min in range#
Using a different rule type, we can also check if a column’s minimum value falls within a certain range, or equal to an exact value. Since we know that the minimum value for authorized_flag must be 0, we can create a rule to check that the minimum and maximum are both zero. If these are both true, then the value is 0 and the rule will return OK. If not, the rule will return an Error.
In the left panel, click + New Rule.
Select Column min in range.
Select authorized_flag as the Column.
Add
0
as the Min and toggle this check On.Do the same for Max.
Click Run Test to test it returns “OK”.
View data quality#
We’re now ready to compute the status of the data quality rules and view the results.
Click on the back arrow in the top left to return to the Current Status view.
Click Compute All.
Take note of the Monitoring flag. Monitoring is toggled on by default as soon as you create rules on a dataset. Monitored datasets are included in the project and instance level data quality views. If you don’t want a dataset to be included in these views, you can toggle the flag off.
Take note of the current Dataset status and status breakdown. The current Dataset status is the worst current status of all enabled rules. Since one rule returned an error, the current status is Error.
Click through the subtabs Current Status, Timeline, and Rule History to view the results in different displays.
This view shows data quality at the dataset level, and we can also get a broader view of data quality across the entire project.
In the Jobs menu, select Data Quality (or press
g
+q
).Take note of the current Project status. The current project status is in Error because there is only one monitored dataset, tx_prepared, which is currently returning an Error.
Click the blue Clear filters button for a look at all datasets in the project, regardless of whether they are monitored or not.
Click through the subtabs Current Status and Timeline to see the different views for project data quality.
You can experiment with setting up data quality rules on other datasets and check back to see how the status breakdown changes in this view! Also check out the instance data quality view in the Applications menu > Data Quality to see an overview of data quality for all projects you have access to on your Dataiku instance.
What’s next?#
Congratulations on setting up metrics and data quality rules!
Next, you may want to learn how to create scenarios to automate your data pipeline.
See also
See the reference documentation to learn more about Data Quality Rules.