To make data quality analytics the most effective and profitable, it is critical to check if the data meets the required requirements for continuous use.
If the data meets particular criteria, it is considered valid, and the process of validating the data is known as a data validity check.
Table of Contents
Problem statement and solution
In this article we will look at the following situation: we receive a table, where one of the column’s types is String or Varchar data type. When looking at the data we see that this column contains values like “2022-01-01”. We would like to find out the percent of values that are dates.
To find the percentage of valid records, we have to count the number of values that can be interpreted as dates by databases. In BigQuerythere is SAFE_CAST that works just as normal CAST, but when casting can not be performed, it returns a NULL instead of an error. Another useful function is SAFRE.PARSE_DATE, which works as a safe cast as the name suggests, but it parses a value for a given format.
Keep in mind that in different SQL engines, SAFE functions can be called differently, as the dialects differ.
For instance, in Snowflake we use TRY_CAST instead of SAFE_CAST.
One more thing should be considered. Apart from “normal” date formats, we can deal with date formats like EPOCH, which is the number of seconds that have elapsed since January 1, 1970 (midnight UTC/GMT), not counting leap seconds (in ISO 8601: 1970-01-01T00:00:00Z)
To find out the percentage of dates we can perform a column validity check date_type_percent. As the name suggests it calculates the percent of values that can be read as a date.
Step by step solution
List the connections by typing connection list. If there is no connection, add one. Type connection add, name the connection, select provider, and choose the schema (in BigQuery schema is dataset).
The next step is to import tables. As it is suggested, type
table import= <connection_name>.
We have imported one table on which we are going to run a validity check.
If you try to run any check at this point nothing will happen, because we have to configure it.
To run the check, open the table YAML configuration file, which looks like this:
We would like to perform a check on a column level, so go down to the column and type:
... columns: dates: ... checks: validity: date_type_percent: rules: min_count: low: 95.0 medium: 90.0 high: 80.0
This configuration tells the app to run a data validity check, that calculates the percentage of values of a column that could be interpreted as dates.
If the percentage is greater or equal to 95% then everything is fine, if the score is between 90% and 95% then assign severity level 1, if the score is lesser than 90.0 and greater or equal to 80.0 assign severity level 2, and below 80.0 severity level is 3.
Save the file and run checks:
checks run -c= <connection_name>.
As we see the data is valid, 95% or more satisfy our requirements.
If you would like to see how the checks run step by step, you can run checks in debug mode:
checks run -m=debug.
We can copy the query and verify the results ourselves.
Data validity is one of the fundamental aspects of data observability. You need to choose carefully the validity principles for a given problem so that checks truly return correct results.
The covered case explains how to validate the string values as dates in a column.
Of course, there are many more date formats, and as one might think this query would not cover them all.
There are two more features in this check – you can configure the date format to parse. We have built in the most popular date formats, so data like “Jan 1, 2022” would not be a problem, once configures. Apart from that user can provide a custom format as a String in the YAML configuration file.
Check out our blog for other articles.