We usually keep track of the date the data was collected in the data set.
When it comes to data observability, the most important factors are when the data was provided and whether it is current.
Suppose you are maintaining an ETL pipeline that inserts the records into your database. Underestimating the volume of data and utilizing improper hardware are the two most typical sorts of mistakes. When the pipeline fails, it stops progressively loading new data, which is a problem if we need the most recent entries.
Those errors are more common with on-premise choices. Of course, scalable cloud-based solutions may tackle these difficulties pretty effectively by using tools such as load balancer, but there are still situations when they can not. The answer to this problem is a timeliness delay check that can be implemented in the cloud as well.
In the previous articles, we discussed a timeliness averange delay check, how to perform it when the dataset is incomplete and how to verify that. This time we would like to focus on the aspect of data delivery timeliness checks.
This article will show how to create a delay check for data upload. We will work on a dataset concerning weather in Poland. There are columns with the information about the day of temperature measurement and the data upload timestamp.
Table of Contents
Defining the delay
To calculate a delay in the data flow, we need two columns in our database: the day of measurement and the time of data delivery.
Data delivery time depends on what we define as delivery. It can be the time the data appears in the system or the creation time of the file. Of course, depending on the situation, we should choose the adequate. Nevertheless, the most commonly used and the easiest one to handle is the time of data insertion. It is often named “created time”, “time of creation” or “created at”.
Before querying the database, you have to be familiar with the dataset and know what each column represents. Especially in the case of delay check you have to be sure what to interpret as data delivery.
The data set is provided by IMGW (Institute of Meteorology and Water Management – National Research Institute) and can be found here. We will work on synoptic data.
The system we describe here collects weather data from various sources and performs calculations on it. The data set is updated every day, and the data should be inserted within 24 hours the following day.
Let’s have a look at the first 10 rows.
In this example, we will analyze the delay of data insertion (delayed – appeared 48 hours after the measurement). Our goal is to find the percentage of the data that is timely for each month.
Technologies we use
In this example, we will use BigQuery – a serverless data warehouse hosted on the Google Cloud Platform. Check out our article, where we cover in detail how to add a database to BigQuery on Google Cloud Platform.
In BigQuery when we refer to a table, the syntax is as follows (keep in mind this part):
First, let’s extract some data stored from the IMGW API for the town of Siedlce.
SELECT station_id, name, average, day_of_data, create_time FROM `dqo-learning.data_quality.imgw_temp` WHERE name = 'SIEDLCE' ORDER BY name, day_of_data
We will use the relevant data concerning timeliness check – date and time information for the town.
The timeliness test extracts the delay in full days between the day of data measurement and insertion time.
WITH data_table AS ( SELECT station_id, name, average, day_of_data, create_time FROM `dqo-learning.data_quality.imgw_temp` WHERE name = 'SIEDLCE' ORDER BY name, day_of_data ) SELECT TIMESTAMP_DIFF(create_time, CAST(day_of_data AS TIMESTAMP), day) AS delay_in_days, day_of_data FROM data_table
We will group the data by months to calculate sums of timely and untimely records. Put the query into a CTE (Common Table Expression).
dq_check AS ( SELECT TIMESTAMP_DIFF(create_time, CAST(day_of_data AS TIMESTAMP), day) AS delay_in_days , DATE_TRUNC(day_of_data, MONTH) AS month_of_data FROM data_table )
Finally, extract the percentage of timely data.
SELECT SUM( CASE WHEN delay_in_days <= 1 THEN 1 ELSE 0 END ) * 100.0 / COUNT(*) AS value, month_of_data FROM dq_check GROUP BY month_of_data ORDER BY month_of_data
It appears that most of the data is timely. There are only 4 months when around 4% of the data was delivered after 2 days.
The overall timeliness is quite good: around 98.82%.
The whole query is below.
WITH data_table AS ( SELECT station_id, name, average, day_of_data, create_time FROM `dqo-learning.data_quality.imgw_temp` WHERE name = 'SIEDLCE' ORDER BY name, day_of_data ), dq_check AS ( SELECT TIMESTAMP_DIFF(create_time, CAST(day_of_data AS TIMESTAMP), day) AS delay_in_days, DATE_TRUNC(day_of_data, MONTH) AS month_of_data FROM data_table ) SELECT SUM( CASE WHEN delay_in_days <= 1 THEN 1 ELSE 0 END ) * 100.0 / COUNT(*) AS value, month_of_data FROM dq_check GROUP BY month_of_data ORDER BY month_of_data
The analysis we performed is another example of a fundamental timeliness check. We showed how to create a check that verifies the amount of data delivered within a certain time.
The dataset we used – weather data is one of many examples of why data delivery timeliness is important. To gather statistics, such as monthly average temperature, we have to be sure that the monthly data is complete, otherwise, such analysis is misleading.
It is important to observe the data from every point of view. Timeliness is one of those metrics which helps achieve that.
For more tutorials, check out our blog.