How does NULL data impact timeliness?

We normally maintain a record of the date the data was acquired in the data set.

In matters of data timeliness it is important to work on data that is relevant and valid.

Imagine you are running an ETL process. There are a variety of problems that you may face. Table name change or total deletion, lost connection during data loading, not running insertion owing to job being stuck in a queue. All of the possible issues above may cause insertion of NULL data records.

As a result, suitable tests must be performed prior to executing timeliness checks.

In the previous article, we presented how to deal with data timeliness checks. It turned out that the dataset was timely and it did not contain any missing values.

This time we would like to walk you through an example when the data set contains NULL values and invalid data types.

We will describe the impact on the timeliness dimension when data is affected by null rows and present an exemplary solution using BigQuery.

Table of Contents

NULL records impact

The placeholder rows change scores and the overall analysis metrics as timeliness. There is a possibility that whenever data should be delivered, a new row is inserted into a dataset even if there is no data uploaded. Considering delays in records we will not find any gap because of the specifics of the system. 

The process of data acquisition from e.g. temperature sensors may fail a step before data insertion or a single sensor might just crash. Both of the cases would be handled by NULL insertion with a proper timestamp.

To be sure we can rely on the table’s data we have to observe them with suitable methods.

Problem statement

This time we are dealing with the dataset that contains invalid rows – NULLs and unexpected values.

We will perform a quality check that excludes the invalid records from the dataset and checks timeliness on the valid data.

The solution consists of 3 steps:

  • data cleaning – getting rid of corrupted data,
  • preparation – preparing the data we would like to run the queries on,
  • extraction – running a query to extract all of the necessary information.

Technologies we use

In this example, we will use BigQuery – a serverless data warehouse hosted on the Google Cloud Platform. Check out our article, where we cover in detail how to add a database to BigQuery on Google Cloud Platform.

In BigQuery when we refer to a table, the syntax is as follows (keep in mind this part):

`[project_name].[dataset_name].[table_name]`.

The dataset

The data set can be obtained from here where you can find the details.

This archive contains measurements gathered between December 2006 and November 2010 (47 months).

All timestamps are present in the dataset but for some of them, the measurement values are missing.

The solution

Let’s have a look at the data set.

				
					SELECT *
FROM `dqo-learning.data_quality.household_power_consumption`
ORDER BY Date, Time
				
			

Run the following query to find the NULLs in the table.

				
					SELECT *
FROM `dqo-learning.data_quality.household_power_consumption`
WHERE Date IS NULL
	OR Time IS NULL
	OR Global_reactive_power IS NULL
	OR Voltage IS NULL
	OR Global_intensity IS NULL
	OR Sub_metering_1 IS NULL
	OR Sub_metering_2 IS NULL
	OR Sub_metering_3 IS NULL
				
			

Cells with question marks can be treated as NULL since they are not numerical values. We have detected the problem concerning data validity.

Now that we know the invalid records, we can select the ones that are valid and perform a timeliness test on them.

First things first, let’s prepare the data we will be working on. We will select all of the recorded timestamps. A good idea is to merge Date and Time into one column. The further distances between two records, measured in units of time (e. g. minutes) are easier to calculate. Otherwise, the shifts in both time and day would be harder to deal with. 
We begin with filtering out all useless data that exist in the table (there is no gap in the datetimes).

				
					WITH raw_data AS (
    SELECT TIMESTAMP(CONCAT(Date, "T", Time)) AS last_updated
    FROM `dqo-learning.data_quality.household_power_consumption`
    WHERE Date IS NOT NULL 
        AND Time IS NOT NULL 
        AND Global_active_power      IS NOT NULL AND Global_active_power   != "?" 
        AND Global_reactive_power    IS NOT NULL AND Global_reactive_power != "?"
        AND Voltage                  IS NOT NULL AND Voltage               != "?"
        AND Global_intensity         IS NOT NULL AND Global_intensity      != "?"
        AND Sub_metering_1           IS NOT NULL AND Sub_metering_1        != "?"
        AND Sub_metering_2           IS NOT NULL AND Sub_metering_2        != "?"
        AND Sub_metering_3           IS NOT NULL
)
				
			
You may think the above query is incomplete due to missing the last comparison to the “?” symbol. 

When importing a dataset to BigQuery the schema is defined automatically. All other columns except Sub_metering_3, Time, and Date were set as VARCHAR because of question marks in the rows. The last one, Sub_metering_3 exists here as numeric, since it always has floats or nulls, and never “?”.

In other words, if we add Sub_metering_3!= “?”, the query execution finishes with the exception connected with an inability to compare float to string.

Prepare data for data check
				
					dq_check_preparation AS (
	SELECT last_updated AS current_timesamp,
		LAG(last_updated) OVER (
			ORDER BY last_updated ASC
		) AS preceding_timesamp,
		DATE(FORMAT_TIMESTAMP("%Y-%m-%d", last_updated)) AS day_of_tested_data
	FROM raw_data
)
				
			

Using previous queries, we are ready to show the gaps in minutes for the meaningful data, where the delay is greater than one minute.

				
					SELECT TIMESTAMP_DIFF(current_timesamp, preceding_timesamp, MINUTE) AS delay,
	day_of_tested_data,
	current_timesamp,
	preceding_timesamp
FROM dq_check_preparation
WHERE preceding_timesamp IS NOT NULL
	AND TIMESTAMP_DIFF(current_timesamp, preceding_timesamp, MINUTE) > 1
ORDER BY preceding_timesamp
				
			
  1. Data cleaning. Create CTE (Common Table Expression) raw_data from the original table which:
    • Concatenates the date and the time into column last_updated.
    • Selects all of the records that are not NULL or a question mark.
  2. Data preparation. Create CTE dq_check_preparation from the original table:
    • Include the column last_updated from the original table. This is the first (current) of the timestamps
    • Create column proceeding_timesteamp which, as the name suggests, is a previous timestamp (ordered by date) to the current timestamp.
    • Extract the date from the last_updated which indicates the day of data being tested.
  3. Data extraction. Run a query that returns all the necessary data.
    • Include delay between neighboring timestamps
    • Exclude the NULL timestamp (there is no timestamp before the first one recorded).
    • Filter the records where the interval between timestamps is greater than a certain period, in this case, 1 minute.
				
					WITH raw_data AS (
    SELECT TIMESTAMP(CONCAT(Date, "T", Time)) AS last_updated
    FROM `dqo-learning.data_quality.household_power_consumption`
    WHERE Date IS NOT NULL 
        AND Time IS NOT NULL 
        AND Global_active_power      IS NOT NULL AND Global_active_power   != "?" 
        AND Global_reactive_power    IS NOT NULL AND Global_reactive_power != "?"
        AND Voltage                  IS NOT NULL AND Voltage               != "?"
        AND Global_intensity         IS NOT NULL AND Global_intensity      != "?"
        AND Sub_metering_1           IS NOT NULL AND Sub_metering_1        != "?"
        AND Sub_metering_2           IS NOT NULL AND Sub_metering_2        != "?"
        AND Sub_metering_3           IS NOT NULL
)
,dq_check_preparation AS (
    SELECT last_updated AS current_timesamp,
        LAG(last_updated) OVER (ORDER BY last_updated ASC) AS preceding_timesamp,
        DATE(FORMAT_TIMESTAMP("%Y-%m-%d", last_updated)) AS day_of_tested_data
    FROM raw_data
)
SELECT 
        TIMESTAMP_DIFF(current_timesamp, preceding_timesamp, MINUTE) as delay,
        day_of_tested_data,
        current_timesamp,
        preceding_timesamp
    FROM dq_check_preparation
    WHERE preceding_timesamp IS NOT NULL
        AND TIMESTAMP_DIFF(current_timesamp, preceding_timesamp, MINUTE) > 1
    ORDER BY preceding_timesamp
				
			

The delay values in the preceding table vary greatly. A gap of two or three minutes should not impact the importance of the data. This may be the simple lack of a system’s connection or another manageable failure that is temporary. The worst-case scenario is that the system loses information about the state of electric power consumption for a little more than two and a half days. This might be the reason for such a big number in row number seven.

If no maintenance work was planned for the gapped records, it would be a system failure worth keeping track of.

Conclusion

In conclusion, the datasets are not perfect generally. They usually have some missing values or invalid data types. Before running any data quality checks it is important to prepare the data.

This article covers the case when NULLs and invalid values are present in the dataset. The solution shows how to deal with this and run a timeliness check on the valid data.

In case of timeliness, the data preparation is crucial, since checking if the irrelevant data is delivered on time does not make much sense.

For more tutorials check out our blog.

Do you want to learn more about Data Quality?

Subscribe to our newsletter and learn the best data quality practices.

Please share this post
Related Articles