We normally maintain a record of the date the data was acquired in the data set.
In matters of data timeliness it is important to work on data that is relevant and valid.
Imagine you are running an ETL process. There are a variety of problems that you may face. Table name change or total deletion, lost connection during data loading, not running insertion owing to job being stuck in a queue. All of the possible issues above may cause insertion of NULL data records.
As a result, suitable tests must be performed prior to executing timeliness checks.
In the previous article, we presented how to deal with data timeliness checks. It turned out that the dataset was timely and it did not contain any missing values.
This time we would like to walk you through an example when the data set contains NULL values and invalid data types.
We will describe the impact on the timeliness dimension when data is affected by null rows and present an exemplary solution using BigQuery.
Table of Contents
NULL records impact
The placeholder rows change scores and the overall analysis metrics as timeliness. There is a possibility that whenever data should be delivered, a new row is inserted into a dataset even if there is no data uploaded. Considering delays in records we will not find any gap because of the specifics of the system.
The process of data acquisition from e.g. temperature sensors may fail a step before data insertion or a single sensor might just crash. Both of the cases would be handled by NULL insertion with a proper timestamp.
To be sure we can rely on the table’s data we have to observe them with suitable methods.
Problem statement
This time we are dealing with the dataset that contains invalid rows – NULLs and unexpected values.
We will perform a quality check that excludes the invalid records from the dataset and checks timeliness on the valid data.
The solution consists of 3 steps:
- data cleaning – getting rid of corrupted data,
- preparation – preparing the data we would like to run the queries on,
- extraction – running a query to extract all of the necessary information.
Technologies we use
In this example, we will use BigQuery – a serverless data warehouse hosted on the Google Cloud Platform. Check out our article, where we cover in detail how to add a database to BigQuery on Google Cloud Platform.
In BigQuery when we refer to a table, the syntax is as follows (keep in mind this part):
`[project_name].[dataset_name].[table_name]`.
The dataset
The data set can be obtained from here where you can find the details.
This archive contains measurements gathered between December 2006 and November 2010 (47 months).
All timestamps are present in the dataset but for some of them, the measurement values are missing.
The solution
Let’s have a look at the data set.
SELECT *
FROM `dqo-learning.data_quality.household_power_consumption`
ORDER BY Date, Time
Run the following query to find the NULLs in the table.
SELECT *
FROM `dqo-learning.data_quality.household_power_consumption`
WHERE Date IS NULL
OR Time IS NULL
OR Global_reactive_power IS NULL
OR Voltage IS NULL
OR Global_intensity IS NULL
OR Sub_metering_1 IS NULL
OR Sub_metering_2 IS NULL
OR Sub_metering_3 IS NULL
Cells with question marks can be treated as NULL since they are not numerical values. We have detected the problem concerning data validity.
Now that we know the invalid records, we can select the ones that are valid and perform a timeliness test on them.
First things first, let’s prepare the data we will be working on. We will select all of the recorded timestamps. A good idea is to merge Date and Time into one column. The further distances between two records, measured in units of time (e. g. minutes) are easier to calculate. Otherwise, the shifts in both time and day would be harder to deal with.
We begin with filtering out all useless data that exist in the table (there is no gap in the datetimes).
WITH raw_data AS (
SELECT TIMESTAMP(CONCAT(Date, "T", Time)) AS last_updated
FROM `dqo-learning.data_quality.household_power_consumption`
WHERE Date IS NOT NULL
AND Time IS NOT NULL
AND Global_active_power IS NOT NULL AND Global_active_power != "?"
AND Global_reactive_power IS NOT NULL AND Global_reactive_power != "?"
AND Voltage IS NOT NULL AND Voltage != "?"
AND Global_intensity IS NOT NULL AND Global_intensity != "?"
AND Sub_metering_1 IS NOT NULL AND Sub_metering_1 != "?"
AND Sub_metering_2 IS NOT NULL AND Sub_metering_2 != "?"
AND Sub_metering_3 IS NOT NULL
)
dq_check_preparation AS (
SELECT last_updated AS current_timesamp,
LAG(last_updated) OVER (
ORDER BY last_updated ASC
) AS preceding_timesamp,
DATE(FORMAT_TIMESTAMP("%Y-%m-%d", last_updated)) AS day_of_tested_data
FROM raw_data
)
Using previous queries, we are ready to show the gaps in minutes for the meaningful data, where the delay is greater than one minute.
SELECT TIMESTAMP_DIFF(current_timesamp, preceding_timesamp, MINUTE) AS delay,
day_of_tested_data,
current_timesamp,
preceding_timesamp
FROM dq_check_preparation
WHERE preceding_timesamp IS NOT NULL
AND TIMESTAMP_DIFF(current_timesamp, preceding_timesamp, MINUTE) > 1
ORDER BY preceding_timesamp
- Data cleaning. Create CTE (Common Table Expression) raw_data from the original table which:
- Concatenates the date and the time into column last_updated.
- Selects all of the records that are not NULL or a question mark.
- Data preparation. Create CTE dq_check_preparation from the original table:
- Include the column last_updated from the original table. This is the first (current) of the timestamps
- Create column proceeding_timesteamp which, as the name suggests, is a previous timestamp (ordered by date) to the current timestamp.
- Extract the date from the last_updated which indicates the day of data being tested.
- Data extraction. Run a query that returns all the necessary data.
- Include delay between neighboring timestamps
- Exclude the NULL timestamp (there is no timestamp before the first one recorded).
- Filter the records where the interval between timestamps is greater than a certain period, in this case, 1 minute.
WITH raw_data AS (
SELECT TIMESTAMP(CONCAT(Date, "T", Time)) AS last_updated
FROM `dqo-learning.data_quality.household_power_consumption`
WHERE Date IS NOT NULL
AND Time IS NOT NULL
AND Global_active_power IS NOT NULL AND Global_active_power != "?"
AND Global_reactive_power IS NOT NULL AND Global_reactive_power != "?"
AND Voltage IS NOT NULL AND Voltage != "?"
AND Global_intensity IS NOT NULL AND Global_intensity != "?"
AND Sub_metering_1 IS NOT NULL AND Sub_metering_1 != "?"
AND Sub_metering_2 IS NOT NULL AND Sub_metering_2 != "?"
AND Sub_metering_3 IS NOT NULL
)
,dq_check_preparation AS (
SELECT last_updated AS current_timesamp,
LAG(last_updated) OVER (ORDER BY last_updated ASC) AS preceding_timesamp,
DATE(FORMAT_TIMESTAMP("%Y-%m-%d", last_updated)) AS day_of_tested_data
FROM raw_data
)
SELECT
TIMESTAMP_DIFF(current_timesamp, preceding_timesamp, MINUTE) as delay,
day_of_tested_data,
current_timesamp,
preceding_timesamp
FROM dq_check_preparation
WHERE preceding_timesamp IS NOT NULL
AND TIMESTAMP_DIFF(current_timesamp, preceding_timesamp, MINUTE) > 1
ORDER BY preceding_timesamp
The delay values in the preceding table vary greatly. A gap of two or three minutes should not impact the importance of the data. This may be the simple lack of a system’s connection or another manageable failure that is temporary. The worst-case scenario is that the system loses information about the state of electric power consumption for a little more than two and a half days. This might be the reason for such a big number in row number seven.
If no maintenance work was planned for the gapped records, it would be a system failure worth keeping track of.
Conclusion
In conclusion, the datasets are not perfect generally. They usually have some missing values or invalid data types. Before running any data quality checks it is important to prepare the data.
This article covers the case when NULLs and invalid values are present in the dataset. The solution shows how to deal with this and run a timeliness check on the valid data.
In case of timeliness, the data preparation is crucial, since checking if the irrelevant data is delivered on time does not make much sense.
For more tutorials check out our blog.