Data timeliness: average delay check with dashboard

In this article, we check the data timeliness of the public BigQuery dataset using an average delay check.

After checking configuration and running check, we would like to verify average timestamp difference between two columns and show a timeliness dashboard available at our application to visualize results

Our example will be performed at public dataset
bigquery-public-data.austin_crime.crime

Table of Contents

Description

The table timeliness check average_delay calculates an average timestamp difference between corresponding records (the ones that appear in the same row) in two specified columns. Setting time series mode=timestamp_column is essential here since the rule for this check needs a certain number of records to run properly. Timestamp difference part is configurable with time_scale parameter.

Daily average on BigQuery example

The data used in this article are available in public datasets at BigQuery.

The table is bigquery-public-data.austin_crime.crime, here are the first 5 rows for the column unique_key, clearance_date, timestamp:

				
					+-------------+     +----------------------+     +----------------------+
| unique_key  | ... |    clearance_date    | ... |      timestamp       | ...
+-------------+     +----------------------+     +----------------------+ 
|  2015150483 | ... | 2015-01-27T12:00:00Z | ... | 2015-01-15T12:00:00Z | ...
| 20151542027 | ... | 2015-07-24T12:00:00Z | ... | 2015-06-03T12:00:00Z | ...
| 20152001758 | ... | 2015-08-15T12:00:00Z | ... | 2015-07-19T12:00:00Z | ...
| 20151340282 | ... | 2015-06-15T12:00:00Z | ... | 2015-05-14T12:00:00Z | ...
| 20151871268 | ... | 2015-10-27T12:00:00Z | ... | 2015-07-06T12:00:00Z | ...
+-------------+     +----------------------+     +----------------------+

				
			
For this check, we will be using clearance_date, and timestamp column to calculate timestamp difference between them.

Check configuration

After adding connection and importing tables you can access the table configuration by running.
				
					table edit -c=<connection_name> -t=austin_crime.crime

				
			
The YAML configuration looks like this (all the code necessary to define this check is highlighted below):
				
					# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  target:
    schema_name: austin_crime
    table_name: crime
  time_series:
    mode: timestamp_column
    timestamp_column: timestamp
    time_gradient: day
  checks:
    timeliness:
      average_delay:
        parameters:
          column1: "clearance_date"
          column2: "timestamp"
        rules:
          moving_average:
            low:
              max_percent_above: 5.0
              max_percent_below: 5.0
            medium:
              max_percent_above: 10.0
              max_percent_below: 10.0
            high:
              max_percent_above: 15.0
              max_percent_below: 15.0
  columns:
    unique_key:
      type_snapshot:
        column_type: INT64
        nullable: true
    ...
    clearance_date:
      type_snapshot:
        column_type: TIMESTAMP
        nullable: true
    ...
    timestamp:
      type_snapshot:
        column_type: TIMESTAMP
        nullable: true
    ...

				
			
Notice that column timestamp has two roles here:
  • it is used to calculate the timestamp difference
  • it serves as timestamp column in time_series, truncated into a DAY (in this case).
To see the result of our check, type in terminal:
				
					check run
				
			
The table below is the same as the one you would see on the provider’s platform (in this case BigQuery).
				
					Finished executing a sensor for a check average_delay on the table austin_crime.crime using a sensor definition table/timeliness/average_delay, sensor result count: 1096

Results returned by the sensor:
+------------------+-----------+
|actual_value      |time_period|
+------------------+-----------+
|18.989583333333336|2014-01-01 | <--- Here are the dates from
+------------------+-----------+      the column `timestamp`.
|19.49122807017543 |2014-01-02 |
+------------------+-----------+
|21.253968253968246|2014-01-03 |
+------------------+-----------+
|19.614457831325293|2014-01-04 |
+------------------+-----------+
|19.990196078431367|2014-01-05 |
+------------------+-----------+
|17.62015503875969 |2014-01-06 |
+------------------+-----------+
|24.454545454545457|2014-01-07 |
+------------------+-----------+
|16.343749999999996|2014-01-08 |
+------------------+-----------+
|30.16666666666667 |2014-01-09 |
+------------------+-----------+
|21.54455445544554 |2014-01-10 |
+------------------+-----------+

				
			

Check summary

Check evaluation summary briefly informs us about check execution:
				
					Check evaluation summary per table:
+------------+------------------+------+--------------+-------------+------------+---------------+-------------+
|Connection  |Table             |Checks|Sensor results|Valid results|Alerts (low)|Alerts (medium)|Alerts (high)|
+------------+------------------+------+--------------+-------------+------------+---------------+-------------+
|connection_1|austin_crime.crime|1     |1096          |232          |201         |181            |475          |
+------------+------------------+------+--------------+-------------+------------+---------------+-------------+

				
			

Each run check can be visualized in our application. Dashboard below available in cloud.dqo.ai presents the graphical form of our results.

Conclusion

There are many more checks and examples to explore. This article shows the most basic dashboard and YAML file configuration that enables you to successfully visualize the results of your checks. The following documentation link shows you more information about described case.  Check out our blog where we describe how to use the application in detail.

Do you want to learn more about Data Quality?

Subscribe to our newsletter and learn the best data quality practices.

Please share this post
Related Articles