The development and implementation of programs or queries to check the data quality are not the sole aspects of the data quality analysis process.
To begin, we need to design, build, and integrate data from a variety of sources, frequently in incompatible environments
Then we’ll be able to use tools to perform data quality checks and save the results. The next stage is to create and administer dashboards that display data quality findings so that we can study them.
Assume you're gathering information from many sources.
Each of those sources is derived from data quality checks carried out on a large number of tables. During an hour, the database may be updated numerous times with new records. This amount of data is comparable to gigabytes or terabytes. Because of the magnitude of the data, it is absurd to display it in the form of tables.
To adequately analyze the results, we can use quality checks and commonly available visualization tools. Dashboards are extremely useful for managing the data quality analysis process. We can use them to examine the results and ensure that the entire process is running successfully (including errors in the application and so on). In this article, we’ll look at how to visualize data quality results.
Table of Contents
To adequately analyze the results, we can use quality checks and commonly available visualization tools.
Dashboards are extremely useful for managing the data quality analysis process. We can use them to examine the results and ensure that the entire process is running successfully (including errors in the application and so on).
In this article, we’ll look at how to visualize data quality results.
There are several steps to the data quality analysis process. Assuming that the data collection process has already been completed, the processes are as follows:
- design, build and integrate data from various sources,
- create and implement programs or queries to check data quality,
- store the results of data quality checks,
- prepare and manage dashboards showing data quality results,
- analysis of the results.
In the previous article we described why and how to store the data quality results. Now we go to the next step and we will use stored data to make a data quality dashboard.
Technologies we use
The results of the analysis are stored in Google Cloud BigQuery. The natural choice of visualization tool is Data Studio.
Data Studio is a widely used, free visualisation tool. It enables user to create dashboards. The data can be obtained from many sources, thanks to supporting many services such as BigQuery, Redshift, Google Analytics, PostreSQL, Google Sheets, and many more.
The dataset
In the previous article, we performed a simple quality check – null count. Let’s continue this example by visualizing some results.
- uuid – unique id for each result,
- tested_resource – the name of the analyzed table,
- test_name – the name of the data quality check,
- value – result, in this case, a number of null values per day,
- created_time – time the result was saved,
- day_of_tested_data – the day when the data was tested, in this case it is a date for which NULL values were counted
- simple_row_count,
- max_delay_in_minutes,
- null_count.
Ways to connect data to Data Studio
In order to prepare data for a certain dashboard, we can first create views in BigQuery. This allows us to visualize selected data quickly and easily; however, depending on the size of the table and the complexity of the query, dashboard performance may suffer.
Another option is to use Data Studio to extract the information and create additional data sources. It works by connecting Data Studio to the table and then extracting only the data we want to utilize in the dashboard in Data Studio.
There is also an option to connect Data Studio with an original table from BigQuery and create dashboards directly from it. Creating dashboards in this way is more complicated – we should filter the data for every table or plot in Data Studio, so for a data engineer who is fluent in SQL it may take more time to be done.
Such approach runs the risk of omitting the required filter, resulting in incorrect or misleading data being displayed. The main disadvantage is that the Data Studio has to process all the data from the table. This has a negative impact on the time needed to load dashboards. We do not want to distribute this solution, so it will not be covered in this article.
Preparing the data
As mentioned before, data may be prepared in two ways: creating views in BigQuery or extracting data in Data Studio.
It’s crucial to figure out what data we will need before moving on to the next step. In most cases, it is determined by the recipient. Dashboards are typically built to assist business people in making decisions. Technical teams, on the other hand, use dashboards from time to time for retrieving data or debugging.
Data in dashboards should be tailored to the needs of both parties, allowing them to improve their work.
- uuid,
- day_of_tested_data,
- value,
- created_time.
How to create view in Big Query
SELECT DISTINCT tested_resource
FROM `dqo-learning.data_quality.dq_check_results`
WHERE test_name LIKE 'null_count'
The query returned only one record, so everything is correct.
Now we know how to filter the data. The query that creates the view is shown below.
SELECT uuid AS id,
day_of_tested_data,
value AS number_of_null_values,
created_time
FROM `dqo-learning.data_quality.dq_check_results`
WHERE test_name LIKE 'null_count'
In order to create a view, we need to expand SAVE and select a Save View option and complete the information necessary to create the view: project name is set automatically, you have to specify the dataset and name the view.
How to extract required data from the table in Data Studio
In the next step, select the technology we will connect to. In this example we decided to use BigQuery.
Now, we need to specify the data which we want to add to the report.
Now we should complete the necessary fields.
Dimensions and date range are simple to fill.
We would like to see the sum of null values per day, so we have to create a metric. Add metric, then select the column on which the metric will be applied, in our case – value.
The sum is selected as default, when clicking on the square with the inscription sum on the left side, we can change the aggregation function by clicking on the square:
As you can see there is also an option to set auto-update. There are now two tables in recent data sources.
How to create a dashboard
Data studio offers a wide variety of plots and charts we can put on our dashboard. Depending on the type of data we want to present, there are many options starting with tables, column charts, and ending with geo charts. As an example we will create a table, and a column chart.
Now we can start creating our first report. Click the Create button in the upper left corner of the Data Studio main website and this time select Report. In the Add data to report page select ‘My data sources’, then choose a data source. In this case, we choose null_count_results and we approve it using Add bottom on the right down corner.
In the case of a NULL check, we want to know which of the tested data sources had the most NULL values and when it happened. The easiest way to find detailed information is to get a data quality result id. We will create a pivot table with heatmap – it shows outlier values (number of null values) using color gradient.
The table is automatically completed, but it is not exactly what we expected.
On the right side, there are two panels: data and style. Let’s focus on the first one:
However, we want to make this table more friendly for business users – we need to rename columns. We can do this using the square with the inscription ‘ABC’ on the left side of the column or metric name.
The table is almost ready. Since NULL values are associated with unwanted or incorrect data, we can change the color to red for example.
Add a filter to select the proper date. Click save.
The chart is sorted by dimension “value” in descending order. We would like to sort the data by date in ascending order. On the right-hand side corner of the chart click three dots, find Sort by, and sort by date. We will change the color to match the table
The whole dashboards looks like this
Conclusion
We’ve demonstrated optimal ways to present data from BigQuery in Data Studio, including how to use metrics dimensions and filters, as well as how to create charts.
Of course, a table and column chart are insufficient to extract the most value from data observability.
Finally, dashboards are the most commonly utilized tools for evaluating data analysis. They are simple to grasp and enable both engineers and business personnel to keep track of data quality if they are appropriately constructed. Furthermore, when an error arises, they allow IT staff to quickly resolve the issue.
For more tutorials check out our blog.