Data timeliness: delay check

We usually keep track of the date the data was collected in the data set.

When it comes to data observability, the most important factors are when the data was provided and whether it is current.

Suppose you are maintaining an ETL pipeline that inserts the records into your database. Underestimating the volume of data and utilizing improper hardware are the two most typical sorts of mistakes. When the pipeline fails, it stops progressively loading new data, which is a problem if we need the most recent entries.

Those errors are more common with on-premise choices. Of course, scalable cloud-based solutions may tackle these difficulties pretty effectively by using tools such as load balancer, but there are still situations when they can not. The answer to this problem is a timeliness delay check that can be implemented in the cloud as well.

In the previous articles, we discussed a timeliness averange delay check, how to perform it when the dataset is incomplete and how to verify that. This time we would like to focus on the aspect of data delivery timeliness checks.

This article will show how to create a delay check for data upload. We will work on a dataset concerning weather in Poland. There are columns with the information about the day of temperature measurement and the data upload timestamp.

Table of Contents

Defining the delay

To calculate a delay in the data flow, we need two columns in our database: the day of measurement and the time of data delivery.

Data delivery time depends on what we define as delivery. It can be the time the data appears in the system or the creation time of the file. Of course, depending on the situation, we should choose the adequate. Nevertheless, the most commonly used and the easiest one to handle is the time of data insertion. It is often named “created time”, “time of creation” or “created at”.

Before querying the database, you have to be familiar with the dataset and know what each column represents. Especially in the case of delay check you have to be sure what to interpret as data delivery.

The dataset

The data set is provided by IMGW (Institute of Meteorology and Water Management – National Research Institute) and can be found here. We will work on synoptic data.

The system we describe here collects weather data from various sources and performs calculations on it. The data set is updated every day, and the data should be inserted within 24 hours the following day.

Let’s have a look at the first 10 rows.

So far, so good, the data is inserted properly. But scrolling down a little bit, we discover that some records (row no. 53) appear after the established time (more than 48 hours counting from the day_of_data column).

In this example, we will analyze the delay of data insertion (delayed – appeared 48 hours after the measurement). Our goal is to find the percentage of the data that is timely for each month.

Technologies we use

In this example, we will use BigQuery – a serverless data warehouse hosted on the Google Cloud Platform. Check out our article, where we cover in detail how to add a database to BigQuery on Google Cloud Platform.

In BigQuery when we refer to a table, the syntax is as follows (keep in mind this part):

`[project_name].[dataset_name].[table_name]`.

Delay check

First, let’s extract some data stored from the IMGW API for the town of Siedlce.

				
					SELECT station_id,
    name,
    average,
    day_of_data,
    create_time
FROM `dqo-learning.data_quality.imgw_temp`
WHERE name = 'SIEDLCE'
ORDER BY name,
    day_of_data
				
			

We will use the relevant data concerning timeliness check – date and time information for the town.

The timeliness test extracts the delay in full days between the day of data measurement and insertion time.

				
					WITH data_table AS (
    SELECT station_id,
        name,
        average,
        day_of_data,
        create_time
    FROM `dqo-learning.data_quality.imgw_temp`
    WHERE name = 'SIEDLCE'
    ORDER BY name,
        day_of_data
)
SELECT TIMESTAMP_DIFF(create_time, CAST(day_of_data AS TIMESTAMP), day) AS delay_in_days,
    day_of_data
FROM data_table
				
			

We will group the data by months to calculate sums of timely and untimely records. Put the query into a CTE (Common Table Expression).

				
					dq_check AS (
    SELECT TIMESTAMP_DIFF(create_time, CAST(day_of_data AS TIMESTAMP), day) AS delay_in_days
        , DATE_TRUNC(day_of_data, MONTH) AS month_of_data
    FROM data_table
)

				
			

Finally, extract the percentage of timely data.

				
					SELECT SUM(
        CASE
            WHEN delay_in_days <= 1 THEN 1
            ELSE 0
        END
    ) * 100.0 / COUNT(*) AS value,
    month_of_data
FROM dq_check
GROUP BY month_of_data
ORDER BY month_of_data
				
			

It appears that most of the data is timely. There are only 4 months when around 4% of the data was delivered after 2 days.

The overall timeliness is quite good: around 98.82%.

The whole query is below.

				
					WITH data_table AS (
    SELECT station_id,
        name,
        average,
        day_of_data,
        create_time
    FROM `dqo-learning.data_quality.imgw_temp`
    WHERE name = 'SIEDLCE'
    ORDER BY name,
        day_of_data
),
dq_check AS (
    SELECT TIMESTAMP_DIFF(create_time, CAST(day_of_data AS TIMESTAMP), day) AS delay_in_days,
        DATE_TRUNC(day_of_data, MONTH) AS month_of_data
    FROM data_table
)
SELECT SUM(
        CASE
            WHEN delay_in_days <= 1 THEN 1
            ELSE 0
        END
    ) * 100.0 / COUNT(*) AS value,
    month_of_data
FROM dq_check
GROUP BY month_of_data
ORDER BY month_of_data
				
			

Conclusion

The analysis we performed is another example of a fundamental timeliness check. We showed how to create a check that verifies the amount of data delivered within a certain time. 

The dataset we used – weather data is one of many examples of why data delivery timeliness is important. To gather statistics, such as monthly average temperature, we have to be sure that the monthly data is complete, otherwise, such analysis is misleading.

It is important to observe the data from every point of view. Timeliness is one of those metrics which helps achieve that.

For more tutorials, check out our blog.

Do you want to learn more about Data Quality?

Subscribe to our newsletter and learn the best data quality practices.

Please share this post
Related Articles