One of the fundamental dimensions of data quality is timeliness
This metric can be measured in several ways, depending on the metric to check.
In this article, we will show how to set up a table timeliness check column_datatime_difference_percent.
Table of Contents
Running BigQuerry example
To demonstrate how the DateTime difference percent check works we will use the BigQuerry dataset. So before running this example be sure that your GCP project environmental variable is well defined. If any problems occur check up on our article on how to get started with DQOps.
You will find there a paragraph defining prerequisites for running an example.
The table used in our example is called bigquery-public-data.austin_crime.crime. For the purpose of demonstration, we use columns clearance_date and timestamp.
Those two contain timestamps we will calculate the average difference for. For better problem depiction here are the first 10 rows for chosen columns:
unique_key, clearance_date, and timestamp
+-------------+----------------------+----------------------+
| unique_key | clearance_date | timestamp |
+-------------+----------------------+----------------------+
| 2015150483 | 2015-01-27T12:00:00Z | 2015-01-15T12:00:00Z |
| 20151542027 | 2015-07-24T12:00:00Z | 2015-06-03T12:00:00Z |
| 20152001758 | 2015-08-15T12:00:00Z | 2015-07-19T12:00:00Z |
| 20151340282 | 2015-06-15T12:00:00Z | 2015-05-14T12:00:00Z |
| 20151871268 | 2015-10-27T12:00:00Z | 2015-07-06T12:00:00Z |
| 20155032413 | 2015-10-19T12:00:00Z | 2015-07-24T12:00:00Z |
| 2015900289 | 2015-04-23T12:00:00Z | 2015-03-31T12:00:00Z |
| 20151060966 | 2015-04-23T12:00:00Z | 2015-04-16T12:00:00Z |
| 20151460876 | 2015-07-15T12:00:00Z | 2015-05-26T12:00:00Z |
| 20155024205 | 2015-12-17T12:00:00Z | 2015-06-05T12:00:00Z |
+-------------+----------------------+----------------------+
Check configuration
After establishing the connection and importing the table’s configuration you can access it by running the following line of code:
table edit -c= -t=austin_crime.crime
We should now take a look at the YAML configuration file. This is a place where we can change conditions on how our check works.
# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
target:
schema_name: austin_crime
table_name: crime
time_series:
mode: current_time
time_gradient: day
checks:
timeliness:
column_datetime_difference_percent:
parameters:
column1: clearance_date
column2: timestamp
max_difference: 10
rules:
min_count:
low:
min_value: 90.0
medium:
min_value: 80.0
high:
min_value: 70.0
columns:
unique_key:
type_snapshot:
column_type: INT64
nullable: true
...
clearance_date:
type_snapshot:
column_type: TIMESTAMP
nullable: true
...
timestamp:
type_snapshot:
column_type: TIMESTAMP
nullable: true
...
Our interest at this point focuses on a part of the YAML file shown below
checks:
timeliness:
column_datetime_difference_percent:
parameters:
column1: clearance_date
column2: timestamp
max_difference: 10
rules:
min_count:
low:
min_value: 90.0
medium:
min_value: 80.0
high:
min_value: 70.0
As you can find on the DQOps documentation site column_datetime_differnece_percent check takes three parameters. Two of them are column1 and column2. Between the time provided in those are two columns check calculates the time difference. The third parameter is max_difference which sets up a time threshold after which we consider our data to be out of date.
Note that you can also define time_scale in which you want your data to be shown. By default, time_scale is set as DAY, but you can change it for HOUR, MINUTE, or SECOND.
Running the check
In order to run the check you can just use the check run command and specify the connection and the table:
check run –connection= –table=austin_crime.crime
Check evaluation summary per table:
+----------+------------------+------+--------------+-------------+------------+---------------+-------------+
|Connection|Table |Checks|Sensor results|Valid results|Alerts (low)|Alerts (medium)|Alerts (high)|
+----------+------------------+------+--------------+-------------+------------+---------------+-------------+
|test |austin_crime.crime|1 |1 |0 |0 |0 |1 |
+----------+------------------+------+--------------+-------------+------------+---------------+-------------+
To see the execution details use can specify the run mode equal to debug:
check run –connection= –table=austin_crime.crime –mode=debug
**************************************************
Executing quality checks on table austin_crime.crime from connection test
**************************************************
**************************************************
Executing a sensor for a check column_datetime_difference_percent on the table austin_crime.crime using a sensor definition table/timeliness/column_datetime_difference_percent
**************************************************
**************************************************
Jinja2 engine has rendered the following template:
table/timeliness/column_datetime_difference_percent/bigquery.sql.jinja2
**************************************************
**************************************************
Executing SQL on connection test (bigquery)
SQL to be executed on the connection:
SELECT
100.0*SUM(
CASE
WHEN
ABS(DATETIME_DIFF(CAST(analyzed_table.clearance_date AS DATETIME), CAST(analyzed_table.timestamp AS DATETIME), DAY)) < 10 THEN 1
ELSE 0 END
)/COUNT(*) AS actual_value, CAST(CURRENT_TIMESTAMP() AS date) AS time_period
FROM `bigquery-public-data`.`austin_crime`.`crime` AS analyzed_table
GROUP BY time_period
ORDER BY time_period
**************************************************
**************************************************
Finished executing a sensor for a check column_datetime_difference_percent on the table austin_crime.crime using a sensor definition table/timeliness/column_datetime_difference_percent, sensor result count: 1
Results returned by the sensor:
+-----------------+-----------+
|actual_value |time_period|
+-----------------+-----------+
|58.01757017355903|2022-06-23 |
+-----------------+-----------+
**************************************************
**************************************************
Finished executing rules (thresholds) for a check column_datetime_difference_percent on the table austin_crime.crime, verified rules count: 1
Rule evaluation results:
+-----------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+------------+----------+------------------+----------------------------------+-----------------+---------------------------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|actual_value |expected_value|time_period |time_gradient|dimension_id|connection_hash |connection_name|provider|table_hash |schema_name |table_name|check_hash |check_name |quality_dimension|sensor_name |executed_at |duration_ms|severity|rule_hash |rule_name|high_lower_bound|medium_lower_bound|low_lower_bound|
+-----------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+------------+----------+------------------+----------------------------------+-----------------+---------------------------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|58.01757017355903|70.0 |2022-06-23T00:00|day |0 |7701720396464728665|test |bigquery|6240082149560342038|austin_crime|crime |192313333026701629|column_datetime_difference_percent|timeliness |table/timeliness/column_datetime_difference_percent|2022-06-23T12:07:29.705Z|4371 |3 |4146619101597371801|min_count|70.0 |80.0 |90.0 |
+-----------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+------------+----------+------------------+----------------------------------+-----------------+---------------------------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
**************************************************
Check evaluation summary per table:
+----------+------------------+------+--------------+-------------+------------+---------------+-------------+
|Connection|Table |Checks|Sensor results|Valid results|Alerts (low)|Alerts (medium)|Alerts (high)|
+----------+------------------+------+--------------+-------------+------------+---------------+-------------+
|test |austin_crime.crime|1 |1 |0 |0 |0 |1 |
+----------+------------------+------+--------------+-------------+------------+---------------+-------------+
Results
As for the date 2022-06-23 the timeliness, according to the definition for this case: the ratio between an absolute value of a DateTime difference between two selected columns and the number of records is lesser than 10 days, Is around 58.02%, resulting in high severity.
In order to visualize the results on the dashboards, you have to synchronize the data in the first place:
cloud sync data
After a while, the synchronized results data should be available on the dashboards. To access them run.
cloud login
Conclusion
Data timeliness is one of the main data quality metrics. It ensures you that the data you want to use is up to date, or points to the data that appeared late.
In this article, we’ve demonstrated how easily you can analyze one of the data timeliness’ metrics.
For more information, we highly recommend you to take a look at the DQOps documentation page and other articles.