Data quality relates to the reliability of a dataset.
To use the data quality management techniques in data development and implementation to ensure that it is appropriate for the organization's specific needs, the data requires to be of high quality and suitable for its intended purpose.
Modern companies collect data to make data-driven decisions: marketing, process improvements, or other purposes. There are comparatively rare cases of business-intended datasets that do not contain date information, for example, a list of products or a list of employees.
But typically in these cases, information about the date may be required as well (e.g. date of production, date of sale, date of employment)
Missing dates, especially in the case of data showing changes over time, can significantly impact the results of future analysis.
In this article, we will introduce you to the process of data quality management concerning finding missing dates in a dataset and explain in detail why it is important.
Table of Contents
Why do we care about missing dates in the database?
Let’s move on to the hands-on example.
The data we use in the example is a free dataset titled Online Retail II. And it contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.
The company mainly sells unique all-occasion giftware. Many customers of the company are wholesalers. The dataset is acquired from here.
First rows of the table:
You can run a simple SELECT statement which returns data for May 2011 ordered by date of transaction like beneath:
SELECT * FROM `dqo-learning.data_quality.online_retail_II` WHERE InvoiceDate >= '2011-05-01' AND InvoiceDate < '2011-06-01' ORDER BY InvoiceDate
So far the data looks good, but scrolling down you may observe the following
As you see, there is a gap between May, 1st, and 3rd – the second day of May is missing. Because of this fact, you should find out whether any other days are misplaced as well.
We will cover the following problem: extract all dates from May 2011 when the retail does not make a single transaction.
Technologies we use
Business valued data are mostly stored in proper databases, such as BigQuery, SQL Server, Postgres, Oracle, MySQL, etc.
In this case study, we will use BigQuery – a serverless data warehouse hosted on the Google Cloud Platform, which nowadays is one of the most frequently used tools for dealing with big datasets, especially in the cloud environment. Check out our article, where we cover in detail how to add a database to BigQuery on Google Cloud Platform.
In BigQuery when we refer to a table, the syntax is as follows (keep in mind this part):
Missing days query example solution
Now we shall find missing dates in May 2011.
To keep the queries as readable as possible, let’s start with defining begin and end dates.
DECLARE beginDate DATE; DECLARE endDate DATE; SET beginDate = '2011-05-01'; SET endDate = '2011-06-01';
Secondly, create CTE (Common Table Expression). The newly generated table contains dates for each day of May 2011.
WITH generatedDays AS ( SELECT day FROM UNNEST( GENERATE_DATE_ARRAY(beginDate, endDate - 1, INTERVAL 1 DAY) ) AS day )
And here are the first 10 days of the generated table.
The next CTE view contains two columns, timestamps parsed to date, and a count of rows existing for those dates.
rawData AS ( SELECT DATE(FORMAT_TIMESTAMP("%Y-%m-%d", InvoiceDate)) AS dates, COUNT(*) AS cnt FROM `dqo-learning.data_quality.online_retail_II` GROUP BY dates )
Both tables need to be joined. All dates stored in the database will have information of rows in the cnt column.
SELECT generatedDays.day, cnt FROM generatedDays LEFT JOIN rawData r ON r.dates = generatedDays.day WHERE generatedDays.day >= beginDate AND generatedDays.day < endDate GROUP BY generatedDays.day, cnt ORDER BY generatedDays.day
The dates that were missing, will appear in the query result but cnt column will store NULL in the cell. So, important thing is to show only those dates where cnt column is NULL.
SELECT generatedDays.day, cnt FROM generatedDays LEFT JOIN rawData r ON r.dates = generatedDays.day WHERE generatedDays.day >= beginDate AND generatedDays.day < endDate AND cnt IS NULL GROUP BY generatedDays.day, cnt ORDER BY generatedDays.day
The above query parts give the final result shown below.
Finally, the whole SQL query used in our example looks like this.
DECLARE beginDate DATE; DECLARE endDate DATE; SET beginDate = '2011-05-01'; SET endDate = '2011-06-01'; WITH generatedDays AS ( SELECT day FROM UNNEST( GENERATE_DATE_ARRAY(beginDate, endDate - 1, INTERVAL 1 DAY) ) AS day ), rawData AS ( SELECT DATE(FORMAT_TIMESTAMP("%Y-%m-%d", InvoiceDate)) AS dates, COUNT(*) AS cnt FROM `dqo-learning.data_quality.online_retail_II` GROUP BY dates ) SELECT generatedDays.day, cnt FROM generatedDays LEFT JOIN rawData r ON r.dates = generatedDays.day WHERE generatedDays.day >= beginDate AND generatedDays.day < endDate AND cnt IS NULL GROUP BY generatedDays.day, cnt ORDER BY generatedDays.day
Things to watch out for
The data quality check described in this article is a simple case and it is designed for an easy, uniform dataset. Working with business data, we often have to deal with different date formatting. For example, the date 1st of October 2021 can be saved in the following formats:
- 01 Oct 2021
- 01 October 2021
- 1 October 2021
- Oct 01, 2021
As you can see, there are many different options and it’s not even a half of the sample formats. Sometimes people are free to complete the date field (STRING datatype) and as a result, we get a combination of all possible formats, with typos included. In practice, data quality queries should be adapted to operate on a specific type of data. Therefore, in the first parts of the code, there is a need to focus on date formatting, usually using the CONVERT() function or regular expressions.
At the end of this article, we would like to pay attention to one more issue. This data quality analysis lists the missing dates from the dataset. So, if you plan to analyze the data quality results, this outcome might be useless. In this case, it is not possible to compute even simple statistical quantities, like an average, mode, or standard deviation correctly.
To do this, you should have at least count of missing dates per month or per year. It depends on the business needs: why the company needs to check the data for missing dates, what kind of results are expected and what they will be used for.
Data quality checks that relate to date type of data are one of the most important ones, as they are most commonly used. We have discussed why databases have missing dates, an example of analysis on a dataset that has such gaps, and things to watch out for when dealing with similar data.
Understanding how to operate on datasets with date type of data is crucial to qualitative data observability.
For more tutorials, check out our blog.