To make the data quality analytics most effective and profitable, it is critical to check if the data meets the necessary conditions for further usability.
If the data meets those conditions, it is considered valid, and the process to validate the data is called data validity check.
It’s easy to make a mistake while entering data by hand; simply enter a wrong digit in the box for the ID card number, one extra figure in the age field, or a number in the field for text-type data. Data provided by offices or administrative units is frequently manually entered into excel type applications, where this kind of issue is common.
As a result, data gets inserted improperly and often it is illogical; think of a person’s age. Values exceeding 110, for example, are doubtful, and those under the age of 18 are subject to various legal constraints that must be considered. To uncover problems of this nature, tests targeted at determining the data’s validity must be performed, such as the data range validity test in the example presented. The values in the age column are presumed to be valid, if they fit in the range between 18 and 110.
To make the data quality analytics most effective and profitable, it is critical to check if the data meets the necessary conditions for further useability. If the data meets those conditions, it is considered valid, and the process to validate the data is called data validity check.
In this article, we will describe what validity checks are, when they are used, and we will create a validity check on numerical data.
Table of Contents
Types of validity checks
There are many types of validity checks, depending on the type of data we are dealing with. The most obvious validity check verifies the type of the data and its format.
Apart from that, when dealing with a specific type of data, e.g. numeric data, there is a vast range of different validity tests possible: checking if data is in a certain range, or whether its values are non-negative. Performing this type of check has a key significance on data quality management.
The validity dimension corresponds to certain rules the data must fit in.
Of course, the configuration of a table in a database requires a valid format in most cases. In other words, a column set as an integer can not store any other types of data.
The different situation is when dealing with VARCHAR type columns, where characters, digits, and special characters can occur. This leads to multiple problems such as:
In the example, we will use Productivity Prediction of Garment Employees Data Set, which can be found here.
Below are the top 10 rows from the table.
Example of a validity check - problem statement
In this example, we will perform a validity check that inspects the range of numeric values in the “actual productivity” column. The data type of this column is FLOAT, and the values should be between 0 and 1. The values within this range are considered valid, and the rest – invalid.
We will write a query to calculate the number of valid records per day and find out the percentage of valid rows.
Technologies we use
In this case study, we will use BigQuery – a serverless data warehouse hosted on the Google Cloud Platform, which nowadays is one of the most frequently used tools for dealing with big datasets, especially in the cloud environment. Check out our article, where we cover in detail how to add a database to BigQuery on Google Cloud Platform.In BigQuery when we refer to a table, the syntax is as follows (keep in mind this part):
Example of a validity check - solution
First, create a query that checks the value of the “actual_productivity”, setting the lower and upper bound of the range as 0 and 1 respectively. Notice that we use “>= 0.0” and “<= 1.0”, not “>= 0” and “<= 1”. It is related to the FLOAT data type, you should use the adequate one.
SELECT CASE WHEN actual_productivity >= 0.0 AND actual_productivity <= 1.0 THEN 1 ELSE 0 END AS value, date AS day_of_tested_data FROM `dqo-learning.data_quality.garments_worker_productivity` ORDER BY date
This result is not quite readable, it returns all of the valid data, but it would be a good idea to group it by day. Let’s do that and sum all of the valid rows per day.
SELECT SUM( CASE WHEN actual_productivity >= 0.0 AND actual_productivity <= 1.0 THEN 1 ELSE 0 END ) AS value, date AS day_of_tested_data FROM `dqo-learning.data_quality.garments_worker_productivity` GROUP BY day_of_tested_data ORDER BY date
Now it is much clearer.
Let’s find what is the percentage of valid records. To do that we need to divide the calculated sum by the sum of all records and multiply that by 100 to see the result in %.
SELECT SUM( CASE WHEN actual_productivity >= 0.0 AND actual_productivity <= 1.0 THEN 1 ELSE 0 END ) / COUNT(*) * 100.0 AS value, date AS day_of_tested_data FROM `dqo-learning.data_quality.garments_worker_productivity` GROUP BY day_of_tested_data
In the first 19 leading dates, only values within the expected range are concluded,
SELECT actual_productivity, date FROM `dqo-learning.data_quality.garments_worker_productivity` WHERE date = '2015-01-26'
There are more than 10 rows, but let’s focus on the 9’th record. As you can see this value is greater than 1.0, so it is an invalid one.
Data validity is one of the fundamental aspects of data observability. You need to choose carefully the validity principles for a given problem so that checks truly return correct results.
The covered case explains how to deal with the validation of numerical data range. The final query shows the percentage of valid data per day.
As listed at the beginning of the article, there are many more data types upon which various tests can be performed.
For more tutorials, check out our blog.