What is a data timeliness check?

When collecting various sorts of data, we usually keep track of the date the data was collected.

But, in terms of data observability, what matters most is when the data was uploaded and whether it is current.

Assume we’re using a scheduler, such as Airflow that delivers revenue data to the database at the end of the day. The data import may fail if we do not specify a timeout for a certain task in DAG, or if we choose a timeout that is too short. One task might crash and the subsequent one won’t launch. As a consequence, we don’t have data for today, which creates a dilemma. It is crucial to do regular timeliness checks to avoid this from occurring.

The consistency of these types of data sets is critical. Before you begin any analysis, double-check that the data is current and free of gaps. Such failures can be caused by a variety of factors, including a lost connection during insertion, a failure to send data on time, or a loss of access to a database (due to a change of location or permissions). They all imply that the data might not be uploaded in a timely manner, resulting in inaccurate analysis results.

Timeliness checks are aimed at verifying whether the data is up to date, there is no delay in loading files into the databases, or whether it does not take too much time.

In this article, we will show how to perform a data timeliness check. We will use electricity load data which should be updated every 15 minutes.

We will get the data timeliness information, by extracting the delays between each two, neighboring record’s time stamps. Then we will save them in the result table and see if the data is current.

Table of Contents

What is data timeliness

The dimension that checks if the data is current is called data timeliness. A measure of this dimension is the time between when the data is expected to appear, and when it is available. The principles of when to perform a timeliness check are dictated by the time when we expect a new batch of data.

There are a couple of different checks that you can associate with timeliness. The most fundamental one is controlling data for being up to date, e.g. you can check if the data emerges regularly, you can find out the interval of two events (for example the date of data arrival and the tests upon it), etc.

Of course, when performing such checks everything may be fine – the data is processed regularly with no errors. So you might ask:

Why do we need the test if everything is correct? 

In a great majority of cases, it is dangerous to assume that the dataset has no flaws.

In the real case scenario, there is a procedure to find out whether everything is correct concerning timeliness and other data quality dimensions.

As we discussed earlier, timeliness is a dimension we should test regularly. Constant information over data flow is a major aspect of data observability. If we observe incorrectness in timeliness, it is a sign that some issues might appear.

Another aspect regarding data quality is tracking data timeline. A high freshness level of the data indicates uninterrupted data processing at stages up to the current one we use. 

Timeliness is one of the most important and valuable aspects of data observability, so we would like to take some time in a couple of next articles to walk you through examples of timeliness checks on different data sets.

Let’s begin with discussing the dataset and how to run analysis in BigQuery. After that, we shall move on to the example, where we will find out if there is a delay in the timestamp in our dataset.

Technologies we use

In this example, we will use BigQuery – a serverless data warehouse hosted on the Google Cloud Platform. Check out our article, where we cover in detail how to add a database to BigQuery on Google Cloud Platform.

In BigQuery when we refer to a table, the syntax is as follows (keep in mind this part:):

`[project_name].[dataset_name].[table_name]`.

The dataset

The data set can be obtained from here where you can find the details.

It describes the energy consumption of individual clients. The first column is a timestamp, which indicates the time of the measurement. The interval between measurements should be regular and equal to 15 minutes. The rest of the columns are clients marked as “MK_i”, where “i” takes values from 001 to 370, so there are 371 columns in total.

Here are some exemplary records from the table. As you can see, the “MT_001” column contains only values equal to 0.0 for these timestamps. There are two examinations: the client did not exist at that time (in some cases there would be NULL values), or it did not consume any energy.

The solution

First things first, let’s extract the row’s timestamp, day_of_tested_data which will be an aggregation unit (in the further queries), and order the records by the date.

				
					SELECT timestamp_field_0 AS present_timestamp,
	DATE(FORMAT_TIMESTAMP("%Y-%m-%d", timestamp_field_0)) AS day_of_tested_data
FROM `dqo-learning.data_quality.electricity_load_diagrams`
ORDER BY present_timestamp ASC
				
			

For further delay calculation purpose, we are about to append a previous date to each row, as follows:

				
					SELECT timestamp_field_0 as present_timestamp,
	LAG(timestamp_field_0) OVER (
		ORDER BY timestamp_field_0 ASC
	) AS preceding_timestamp,
	DATE(FORMAT_TIMESTAMP("%Y-%m-%d", timestamp_field_0)) AS day_of_tested_data
FROM `dqo-learning.data_quality.electricity_load_diagrams`
ORDER BY present_timestamp ASC
				
			

The preceding_timestamp for the first row is NULL, which is fine because the first row of the whole dataset does not have a preceding value.

Let’s calculate the delay between the present and preceding timestamp. Keep in mind that the time difference does not make sense for the row with NULL- filter it out using the WHERE statement.

				
					WITH dq_check_preparation AS (
	SELECT timestamp_field_0 AS present_timestamp,
		LAG(timestamp_field_0) OVER (
			ORDER BY timestamp_field_0 ASC
		) AS preceding_timestamp,
		DATE(FORMAT_TIMESTAMP("%Y-%m-%d", timestamp_field_0)) AS day_of_tested_data
	FROM `dqo-learning.data_quality.electricity_load_diagrams`
)
SELECT present_timestamp,
	preceding_timestamp,
	TIMESTAMP_DIFF(present_timestamp, preceding_timestamp, MINUTE) AS value -- the delay
,
	day_of_tested_data
FROM dq_check_preparation
WHERE preceding_timestamp IS NOT NULL
ORDER BY present_timestamp
				
			

As you can see, we got values that are delayed between records described in minutes.

Next, remove both timestamps and extract maximum delay from each day.

				
					WITH dq_check_preparation AS (
	SELECT timestamp_field_0 as present_timestamp,
		LAG(timestamp_field_0) OVER (
			ORDER BY timestamp_field_0 ASC
		) AS preceding_timestamp,
		DATE(FORMAT_TIMESTAMP("%Y-%m-%d", timestamp_field_0)) as day_of_tested_data
	FROM `dqo-learning.data_quality.electricity_load_diagrams`
)
SELECT MAX(
		TIMESTAMP_DIFF(present_timestamp, preceding_timestamp, MINUTE)
	) AS value -- the delay
,
	day_of_tested_data
FROM dq_check_preparation
WHERE preceding_timestamp IS NOT NULL
GROUP BY day_of_tested_data
ORDER BY day_of_tested_data
				
			

And save outputs to a specially prepared result table.

If you are curious how to create such a table in BigQuery, check the creation steps here.

The whole query that accesses tested table rows, calculates and stores the results.

				
					INSERT INTO `dqo-learning.data_quality.dq_check_results` (
		uuid,
		tested_resource,
		test_name,
		value,
		created_time,
		day_of_tested_data
	) WITH dq_check_preparation AS (
		SELECT timestamp_field_0 AS present_timestamp,
			LAG(timestamp_field_0) OVER (
				ORDER BY timestamp_field_0 ASC
			) AS preceding_timestamp,
			DATE(FORMAT_TIMESTAMP("%Y-%m-%d", timestamp_field_0)) AS day_of_tested_data
		FROM `dqo-learning.data_quality.electricity_load_diagrams`
	),
	delay_extraction AS (
		SELECT MAX(
				TIMESTAMP_DIFF(present_timestamp, preceding_timestamp, MINUTE)
			) AS value -- the delay
,
			day_of_tested_data
		FROM dq_check_preparation
		WHERE preceding_timestamp IS NOT NULL
		GROUP BY day_of_tested_data
		ORDER BY day_of_tested_data
	)
SELECT GENERATE_UUID() AS uuid,
	'dqo-learning.data_quality.electricity_load_diagrams' AS tested_resource,
	'max_delay_in_minutes' AS test_name,
	value,
	CURRENT_TIMESTAMP() AS created_time,
	day_of_tested_data
FROM delay_extraction
				
			

We can now run queries on the inserted data.

				
					SELECT * 
FROM `dqo-learning.data_quality.dq_check_results` 
WHERE tested_resource = 'dqo-learning.data_quality.electricity_load_diagrams'
    AND test_name = 'max_delay_in_minutes'
    AND created_time >= '2021-12-14T13:00:00'
ORDER BY day_of_tested_data
				
			

And finally, filter readings to discover the data correctness.

				
					SELECT *
FROM `dqo-learning.data_quality.dq_check_results`
WHERE tested_resource = 'dqo-learning.data_quality.electricity_load_diagrams'
    AND test_name = 'max_delay_in_minutes'
    AND created_time >= '2021-12-14T13:00:00'
    AND value > 15
ORDER BY day_of_tested_data
				
			

This results in the following message:

This means there is no delayed row, which might not be understandable at first sight, so redefine the SQL to be more argumentative.

				
					SELECT SUM(
        CASE
            WHEN value <= 15 THEN 1
            ELSE 0
        END
    ) AS timely_days,
    SUM(
        CASE
            WHEN value > 15 THEN 1
            ELSE 0
        END
    ) AS delayed_days
FROM `dqo-learning.data_quality.dq_check_results`
WHERE tested_resource = 'dqo-learning.data_quality.electricity_load_diagrams'
    AND test_name = 'max_delay_in_minutes'
    AND created_time >= '2021-12-14T13:00:00'
				
			

Conclusion

Data timeliness is another foundation of data observability. It provides us with information about the time of data income or certain operations. In the real world, when we deal with a whole operational data management system, it is a starting point for troubleshooting when an error occurs.

This example covered the case when the data appeared to be timely. We showed how to run a timeliness check, insert the results into a dedicated table, and ensure the correctness of the test.

We often would like to visualize the data on dashboards. In terms of data observability it is useful if the data is timely, thus data timeliness checks are crucial for a proper dashboard functionality. 

For more tutorials, check out our blog.

Do you want to learn more about Data Quality?

Subscribe to our newsletter and learn the best data quality practices.

Please share this post
Related Articles