The process of data quality analysis consists of many stages.
Assuming that the data collection process is already done, the following steps can be distinguished:
- design, build and integrate data from various sources,
- create and implement programs or queries to check data quality,
- store the results of data quality checks,
- prepare and manage dashboards showing data quality results,
- analysis of the results.
In this article, we focus on the third step - how to store the results of data quality checks.
You will learn how to create a dataset and table in Bigquery (using data definition language and built-in editor, cloud console, and cloud shell), how to store data quality results, and why it is important.
For quality check, to keep the simplicity, we will cover one of the simpler cases: NULL count for each day in the dataset.
Table of Contents
The importance of proper results storage
As you can see in the paragraph above, the data quality analysis process does not end with just checking the data quality. It is very important during the whole process to collect data, create dashboards and observe the continuity of results.
The connection with the data source may be broken for some reason, or the data in the analyzed table is lost for several hours. If the table is queried during this time for data quality, we will receive an error message. Such a single event will not tell us much. The data source might be still at the stage of implementation, or it may turn out that there was only one such situation within a few months.
If we do not have the historical data quality of a specific table, we will not be able to see and understand the problems that we will face.
The possibility of the result analysis provides us with a broader view of the quality of the data. Therefore we should collect the appropriate amount of data necessary to perform the analysis and show changes in the quality of data over time in the dashboards.
Technologies we use
In the following example, we will use Google Cloud BigQuery which is – a serverless, highly scalable data warehouse. BigQuery is designed for handling data of petabytes size. So, using this technology you can quickly load your data, run interactive queries, run batch queries, and create virtual tables from your data. This and other advantages of BigQuery attract the attention of corporations with a large volume of data. And the same companies also need to check the quality of the data and use BigQuery to JSONize data quality.
Therefore, even though we use a very small dataset (46 MB), we highly recommend using Google Cloud BigQuery for cloud projects.
The dataset
In the example, we will use data from a free dataset, acquired from here.
This dataset contains all the transactions occurring for a UK-based and registered, non-store online retail in two years.
First rows of the table:
Preparation of DQ check results
In our table we will need the following pieces of information:
- Uuid: identifier.
- Value: result value for each test (Data Quality check).
- Tested resource: the name of the tested table or column.
- Test name: to navigate the test giving the ability to understand received values.
- Created time: time of the creation of the row in the table.
- Day of tested data: tested data information about the day.
The table can be created using an SQL statement.
In BigQuery when we refer to a table, the syntax is as follows (keep in mind this part):
`[project_name].[dataset_name].[table_name]`
In our case the DDL is:
CREATE TABLE `dqo-learning.data_quality.dq_check_results` (
`uuid` STRING NOT NULL,
`tested_resource` STRING NOT NULL,
`test_name` STRING NOT NULL,
`value` BIGNUMERIC NOT NULL,
`created_time` TIMESTAMP NOT NULL,
`day_of_tested_data` DATE NOT NULL
);
The table can be created using Google Console’s BigQuery UI as well.
Use the drop-down menu and navigate to the BIG DATA section, click on Bigquery. And we highly recommend pinning the most frequently used services.
On the left-hand side of the UI, you will find the name of your current project. Click Create dataset if you have not done so already.
The Project ID is automatically filled. In the Dataset ID field name your dataset, and select data location. Also keep in mind that location choice affects latency. There are more options to be configured, but for this tutorial, you can leave the default. Click CREATE DATASET.
Once the dataset is successfully created (it will appear under the project name on the left-hand side), you can start adding data. Click Create table.
Now you have to name your table, and specify the data source or select an empty table.
You can see all of the currently possible options below. Select where you want to get the data from. In this case, we would like to create an empty table and insert data using an SQL statement (it enables us to select certain data from other tables). To do that, select the Empty table in the Source field.
In our case we have to specify the schema (because of INSERT statement construction, see Insertion preparation section)
Click Add filed under Schema section.
Add and fill all of the necessary fields.
Notice the difference in terminology, here in the Mode section we selected REQUIRED, which corresponds to NOT NULL constraint.
And another way to define the schema is to use the JSON format. Enable Edit as text and define the schema as below.
[
{
"mode": "REQUIRED",
"name": "uuid",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "tested_resource",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "test_name",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "value",
"type": "BIGNUMERIC"
},
{
"mode": "REQUIRED",
"name": "created_time",
"type": "TIMESTAMP"
},
{
"mode": "REQUIRED",
"name": "day_of_tested_data",
"type": "DATE"
}
]
Click Create table. If everything was set up correctly, a new table should appear under the database on the left panel.
The preferred method utilizes Google Cloud Storage and Cloud Shell. And we recommend saving the schemas in the GCS Bucket. In our case, the bucket (dqo-learning-bigquery) has the following structure.
dqo-learning
│
└───dqo-learning-bigquery
│
└───shcema
│ schema.json
First of all, BigQuery can not directly use a schema from a bucket. We have to copy a JSON file to a local machine from which we can pass it in BigQuery.
The following command copies a JSON file from the bucket and utilizes it in the table creation command.
gsutil cp gs://dqo-learning-bigquery/schema/schema.json schema.json && bq mk --table data_quality.example_table_name $_
Run the following command to display the information about the table.
bq show data_quality.dq_check_results
The output confirms that the table was created properly.
Implementing DQ check (null count)
SELECT SUM(
CASE
WHEN Customer_ID IS NULL THEN 1
ELSE 0
END
) AS VALUE,
DATE(
FORMAT_TIMESTAMP("%Y-%m-%d", InvoiceDate) AS day_of_tested_data
FROM `dqo-learning.data_quality.online_retail_II`
GROUP BY DATE(FORMAT_TIMESTAMP("%Y-%m-%d", InvoiceDate))
A few rows returned from the data quality test:
Improving readability
Let’s improve code readability by using CTE (Common Table Expression), without changing the logic of the query.
The query below prepares (extracts) only necessary pieces of information from the table
WITH data_preparation AS (
SELECT Customer_ID AS tested_column,
DATE(FORMAT_TIMESTAMP("%Y-%m-%d", InvoiceDate)) AS day_of_tested_data
FROM `dqo-learning.data_quality.online_retail_II`
)
Data Quality check is implemented in the next part:
SELECT SUM(
CASE
WHEN tested_column IS NULL THEN 1
ELSE 0
END
) AS value,
day_of_tested_data
FROM data_preparation
GROUP BY day_of_tested_data
Insertion Preparation
dq_check AS (
SELECT SUM(
CASE
WHEN tested_column IS NULL THEN 1
ELSE 0
END
) AS value,
day_of_tested_data
FROM data_preparation
GROUP BY day_of_tested_data
)
This SELECT statement returns all of the necessary columns.
SELECT GENERATE_UUID() AS uuid,
'dqo-learning.data_quality.online_retail_II' AS tested_resource,
'simple_row_count' AS test_name,
value,
CURRENT_TIMESTAMP() AS created_time,
day_of_tested_data
FROM dq_check
The result is:
Finally, by adding only one line at the beginning of the SQL query, the insertion is ready to be executed.
The complete SQL query in its full glory with INSERT statement is written below:
INSERT INTO `dqo-learning.data_quality.dq_check_results` (
uuid,
tested_resource,
test_name,
value,
created_time,
day_of_tested_data
) WITH data_preparation AS (
SELECT Customer_ID AS tested_column,
DATE(FORMAT_TIMESTAMP("%Y-%m-%d", InvoiceDate)) AS day_of_tested_data
FROM `dqo-learning.data_quality.online_retail_II`
),
dq_check AS (
SELECT SUM(
CASE
WHEN tested_column IS NULL THEN 1
ELSE 0
END
) AS value,
day_of_tested_data
FROM data_preparation
GROUP BY day_of_tested_data
)
SELECT GENERATE_UUID() as uuid,
'dqo-learning.data_quality.online_retail_II' AS tested_resource,
'simple_row_count' as test_name,
value,
CURRENT_TIMESTAMP() as created_time,
day_of_tested_data
FROM dq_check
When executed, similar information should appear:
From now on, it is possible to query created readings. Let’s go ahead and do that:
SELECT *
FROM `dqo-learning.data_quality.dq_check_results` AS r
WHERE r.day_of_tested_data >= '2011-02-01'
AND r.day_of_tested_data < '2011-03-01'
ORDER BY r.day_of_tested_data
What to do next?
The result table is a basic one, but it has sufficient data to do a simple analysis, eg. count an average from one month and calculate a standard deviation.
Having the results, the next logical step would be to visualize them using BI tools, such as Data Studio, PowerBI, or Tableau. After that, you could design a constant cycle of performing data quality checks, storing the result, and visualizing them.
Conclusion
In conclusion, storing data quality check results is as important as performing data quality analysis. It binds the processes of gathering and analyzing data. To be able to use data for advantageous decision-making appropriate management of storing data is crucial. It is an entry point for further analysis.
For more tutorials, check out our blog.