One of the most important aspects of data quality is data timeliness.
It refers to the expected time of availability and accessibility of the data. Timeliness is a metric of major importance since out-of-date information can lead individuals to make poor decisions. As a result, companies lose time, money, and their reputation.
Of course, data timeliness has a lot of interpretations, it can answer many questions
- Is the data up-to-date?
- How fresh is the data?
- What is the most recent record in the table?
- Do we have the most recent data?
- Did we have data quality incidents related to timeliness recently?
- How long does it take to load data through our data pipeline?
- On which days of the week do we receive updated data from suppliers, business partners, or any other type of data providers?
- When did we load the data into the database?
- What is an average lag between records?
DQOps Data Quality Operations Center is an open-source data quality platform that supports all types of timeliness checks. DQOps both tracks the current data timeliness delays on dashboards and measures the data timeliness reliability by tracking timeliness KPI scores. Please refer to the DQOps documentation to learn how to start tracking timeliness for free.
Some timeliness-related questions refer to the current state of data. Simply asking if the data is up-to-date, and could be used for decision-making. Other questions need to apply continuous data quality monitoring, commonly referred to as data observability.
By checking the timeliness of data in regular intervals, such as daily, we can count the data quality timeliness issues raised when the data freshness exceeds the longest delay. If the timeliness issues are very rare, we can trust the data source and our data platform that the data used for analytics is always fresh and up-to-date.
The source of data timeliness issues
In order to fully understand the timeliness, we have to think about why data might not be timely. The delays can be caused by technical reasons, such as misconfiguration of the data platform. They are easy to fix, and fully under control of the data engineering group.
Other delays are caused by external parties, such as external vendors that are sharing data. Fixing those issues may require involving the legal department for renegotiating data sharing contracts.
In order to fully understand the timeliness, we should understand technical issues that cause data delays.
Well, data timeliness issues caused by technical issues depend on the architecture of the data platform.
Assume you have an ETL pipeline that loads data into your database. Many configuration issues may break the data loading process, skipping the most recent data.
- Credentials to the data source or the target data platform expired
- The source table has significantly grown since the data pipeline was set up, and the data loading process takes too long to finish
- The data pipeline jobs are failing due to a change to the data schema
- Reading, transforming, or loading bad records (values in a wrong format) breaks the data-loading process
Let’s analyze another issue. Imagine that you are using a scheduler like Airflow to upload data to your database at the end of each day. If we do not define a timeout for a specific task in DAG, or if we choose a timeout that is too short, the process may fail. It’s possible that one task will crash and the next won’t start. As a result, we don’t have data for today, which puts us in a bind.
There are many more examples of what could go wrong, but usually, the solution is the same in each situation: perform regular timeliness checks.
Data timeliness issues caused by external parties
We listed common sources of data timeliness issues when they are caused by the data platform that is under our control.
The world is highly interconnected, companies depend on other companies working together on a common goal. Manufacturers depend on suppliers, optimizing the manufacturing performance and efficiency across the whole supply chain process. Retailers receive data from suppliers. Marketers delegate campaigns to external marketing agencies.
The end-to-end process monitoring requires sharing data across collaborating business partners, not only delivering goods and services. For that purpose, collaborating companies decide to share data. When the data is shared between different parties, they play two types of roles in the process.
- The data publisher is the business party that provides data. The publisher follows the agreed data contract, making sure that the data is prepared and delivered on time. The publisher may face technical issues in its own data pipeline, delaying the data delivery.
- The data consumer is the business party that receives the data. Because data timeliness issues are cumulative, like a snowball, the consumer can be affected by two types of timeliness issues. Consumer’s own data pipeline can be affected by timeliness issues, or the data publisher did not provide the data on time. Those two root causes must be identified to avoid raising a complaint to the data publisher when the issue is caused internally.
Data delays caused by external parties may require handling with care. Especially when asking the data publisher to speed up the data delivery may require additional work that is costly.
We will discuss measuring data timeliness KPIs later in this article, showing how it is calculated on timelines dashboards in DQOps.
Data timeliness issues in a Data Mesh
The growing complexity of data platforms and the need to achieve agility in data analytics are often solved by decentralizing the data stack. Independent departments expose their data to other departments for consumption. The data-sharing contracts are loose, based on oral agreements. Even when issues in timely data delivery are detected, the department that is consuming the data can only escalate the incident up the management ladder to a common leader. For the interest of both the publishing and consuming parties, it is better if the timeliness issues are identified early, and both department heads discuss how to solve the problem in the most efficient way.
The agility in data delivery is a crucial competitive advantage for the company, but it causes issues. When two departments trust each other, the data publisher will release data before finishing extensive testing of the data delivery platform. Technical issues on the data publisher side will now flow downstream to the department consuming the data.
All technical issues on the way that caused failures in the data pipeline on the way will be visible as errors in logs on the publisher side. The department that is consuming the data will notice them after a delay, as data timeliness issues, when the new data is not received within an agreed time window.
What is a data timeliness issue?
The data timeliness issue is a technical issue on the data provider side, that was observed and detected by the data consumer when the new data was not received on time.
Source columns for data timeliness
Before we can dive into the definition of timeliness data quality checks, we must first understand how to decide if the data is up-to-date.
The only types of datasets for which timeliness could be measured are tables that contain some kind of business events that are identified at a point in time.
- Sales facts have a sales date
- Business transactions have a transaction timestamp
- Logs have the timestamp when the entry was generated
- Clickstreams have an impression timestamp
Before we can track the timeliness of each dataset, we must identify the column in the table that is filled with the event (or transaction) timestamp. It doesn’t matter what type of date or date/time type it is. For a DATE type, we will be limited to measuring delays at a whole-day scale, but that could be still acceptable for daily delivered data. We are referring to the event timestamp column as the event_timestamp column in the following article.
Another important column that is often added to a target table that is loaded is an inserted_at column, which is filled by the ETL engine with the current timestamp when the row was loaded. The exact name of the inserted_at column may differ because it depends on the naming convention applied in the data platform. The simplest way to fill the inserted_at during the data loading is to skip it in the column mapping. Instead, configure the default value as a call to the database engine’s function that captures the current system time.
Given that the monitored table has both the event_timestamp column and an inserted_at column, we can now show how to measure the timeliness, including the current system time returned by an SQL function NOW(), which is common in all database dialects.
Types of data timeliness measures
Once we have both the event_timestamp and the inserted_at columns in a table, we can calculate all types of timeliness measures as shown below. We will be using the MAX function to find both the timestamp of the most recent record by calculating MAX(event_timestamp). And the last time when the table was loaded, by calculating the MAX(inserted_at). The timeliness measures are just the difference between those two most recent timestamps and the current system time, captured by calling the NOW() SQL function.
The types of data timeliness measures are shown as the time ranges between timestamps.
The most important data timeliness measure is data freshness. It is the age of the most recent row in the monitored table.
Data freshness is measured by calculating the difference between the most recent event timestamp in a table to the current system time at the time of observing the freshness.
The SQL query that can calculate freshness used by DQOps on PostgreSQL database is shown below.
SELECT EXTRACT(EPOCH FROM ( CURRENT_TIMESTAMP - MAX((analyzed_table."event_timestamp")::TIMESTAMP) )) / 24.0 / 3600.0 AS actual_value FROM "your_postgresql_database"." "." " AS analyzed_table
The query calculates the time difference between the current system time and the most recent timestamp of the event timestamp column. The result is the number of days between now, and the most recent timestamp in the table.
Data staleness is a similar data timeliness check, but it measures the time since the target table was recently loaded. The data staleness also measures the time interval between now, and the most recent timestamp of writing the row to the table. The inserted_at column used in this check should be filled by the data pipeline at the time of loading rows.
Data staleness is useful when the data pipeline performs a full refresh of the target table or incremental reload of the last several days of data. Even if no new data was received from the data source and the data freshness is long, the short data staleness proves that the data pipeline is working and constantly reloading the target table.
The SQL query that can calculate data freshness used by DQOps on PostgreSQL database is shown below. It is the same as the query for calculating data freshness, except that a different column is used.
SELECT EXTRACT(EPOCH FROM ( CURRENT_TIMESTAMP - MAX((analyzed_table."inserted_at")::TIMESTAMP) )) / 24.0 / 3600.0 AS actual_value FROM "your_postgresql_database"." "." " AS analyzed_table
The query calculates the time difference between the current system time and the most recent timestamp of the last data load. The result is the number of days between now, and the time when the table was refreshed.
It is important to mention that not all data lakes and data warehouses are designed to store the most recent row insertion timestamp in a dedicated column, filled by the data transformation logic.
However, some database engines are returning pseudo-columns that are generated on the fly by the query engine.
- Google BigQuery has a pseudo-column named _PARTITIONTIME which is the partition date of ingestion-time partitioned tables. This fake column could be used for measuring the data ingestion up to an hourly precision.
- Databricks has a hidden column _metadata.file_modification_time that will return the last file modification timestamp that could be used to detect the time when a parquet file was uploaded.
The ingestion delay measures the time how long the data pipeline or the ETL process needed to pick the most recent data and load it into the target table.
The SQL query that can calculate the ingestion delay used by DQOps on PostgreSQL database is shown below. It is a mix of freshness and staleness queries, but it does not use the current system time. Only the most recent timestamp of the event timestamp column and the insertion column are compared.
SELECT EXTRACT(EPOCH FROM ( MAX((analyzed_table."inserted_at")::TIMESTAMP) - MAX((analyzed_table."event_timestamp")::TIMESTAMP) )) / 24.0 / 3600.0 AS actual_value FROM "your_postgresql_database"." "." " AS analyzed_table
The data ingestion delay is sensitive to data reloads. Especially, when the target data warehouse is rebuilt by fully reloading the source data. When the rows that were already loaded are deleted and loaded again, their insertion timestamp column will be different, giving false information that the data pipeline takes too long to load the data.
On the other hand, a long ingestion delay may tell us that the data was reloaded, which doesn’t need to be a bad thing.
Configuring data timeliness monitoring
DQOps supports configuring data timeliness monitoring checks both in a YAML file and using a user interface.
The YAML file can be stored in the source repository. This example of configuring a data freshness check provides the configuration of the column names in the monitored table that contain the event timestamp and the insertion timestamp column names. The name of the monitored table is not shown in the YAML file, because DQOps follows the principle of “configuration by convention”, storing the table name as the file name.
The daily_data_freshness monitoring check requires a threshold parameter to decide how long data freshness delay should trigger a data quality issue at a warning severity level.
apiVersion: dqo/v1 kind: table spec: timestamp_columns: event_timestamp_column: event_timestamp ingestion_timestamp_column: inserted_at monitoring_checks: daily: timeliness: daily_data_freshness: warning: max_days: 1.0
The data freshness can be also configured from the user interface, as shown on the next screenshot taken from DQOps. The green square before the name of the daily_data_freshness check shows the status that the data was 0.05 days old (around 1 hour) at the time of running the data freshness check.
The configuration uses two different alerting threshold configurations. Warning level issues are raised when the data is 1 day old, but an error severity data quality issue is raised for delays longer than 2 days.
The error severity configuration is important for a different reason. DQOps uses error severity issues for measuring the data quality KPI score of the table. The concept of measuring the trustfulness of data in regards to the data timeliness uses data quality KPIs. The data quality KPI is just a percentage of passed data freshness checks when the delay did not exceed the error level threshold.
Measuring data timeliness KPI score
As shown on the screenshot taken from DQOps above, the data timeliness KPI is a percentage of passed data timeliness checks for a period of time.
The data timeliness KPI is a useful metric for measuring the data timeliness data contract between the data publisher, such as an external vendor, supplier, or a department, and the data consumer that is measuring the compliance with the data delivery contract.
At best, the data timeliness KPI should be measured by both parties. The publisher should run data timeliness checks, measuring the data freshness on the landing zone where the data is exposed for consumption. In case of delays in the data delivery, the data publisher is able to prove that the data was provided on time, every day for the last week, month, or year.
The data consumer that is receiving the data and measuring the data freshness can detect if the source data is up to date, before trying to find the issue inside the consumer’s data pipeline that is transforming the data.
Quick reaction to data delays is also important. When data timeliness issues are detected, the data observability platform, such as DQOps, sends notifications to inform the data support team that the agreed data delivery delay was not met.
Data timeliness dashboards
The notifications raised when the data timeliness issues are detected allow only a reactive approach. The data timeliness incidents are raised, forwarded to the data support team, and fixed, without reviewing the root cause of the issue. Frequent data timeliness issues lead to a decrease in the trustfulness of the data source.
The trustfulness of data is based on proof that the data is always fresh, and the data timeliness issues were just rare incidents. When the business sponsor or a head of the data department asks how trustful is the data, a quantitative approach is required.
DQOps uses data quality dashboards for presenting both the current state of data timeliness and measuring the data timeliness KPI over time.
The list of most up-to-date tables is shown on the “Table freshness – tables with most current data” shown below.
The next dashboard shows how to calculate the data quality KPI. The data timeliness check that we just executed on the monitored source passed, so the data timeliness KPI is 100%.
DQOps provides also other types of data timeliness dashboards that can help in investigating data timeliness issues.
- Current timeliness issues show the current incidents when the data was not refreshed within an accepted time window
- History of timeliness issues allows us to investigate how often timeliness issues are detected.
- Table freshness – tables with the most current data dashboard shows the most up-to-date tables
- Table freshness – tables with the oldest data dashboard shows the tables that are not receiving new data and are probably outdated.
- Table staleness – tables most recently loaded dashboard shows when the data was loaded
- Table staleness – tables no longer loaded dashboard identifies tables that are not even refreshed and should be probably decommissioned
- History of table freshness dashboard shows a calendar for each table, showing the delay observed for each day
- Minimum, maximum, and average delay dashboard summarizes the statistics for all timeliness metrics, allowing for comparison of data timeliness between tables.
If the set of built-in data timeliness dashboards is not enough, DQOps supports creating custom data quality dashboards.
Data observability is data timeliness is the cornerstone of data observability. It tells us when data is collected when particular processes are performed and when data is available to use.
When dealing with a whole operational data management system in the real world, it is a starting point for troubleshooting when an error arises.
The data quality KPIs are the perfect way to apply data contracts for data timeliness, measuring the trustfulness and freshness of data. Both the data publisher and the data consumer can be sure that all delays in data delivery are detected early, and problems with data timeliness are analyzed to avoid new incidents in the future.
Check out the DQOps getting started guide to learn how to configure data timeliness checks. DQOps is an open-source platform, you can measure data quality right now.