How to make a data quality dashboard?

The development and implementation of programs or queries to check the data quality are not the sole aspects of the data quality analysis process.

To begin, we need to design, build, and integrate data from a variety of sources, frequently in incompatible environments

Then we’ll be able to use tools to perform data quality checks and save the results. The next stage is to create and administer dashboards that display data quality findings so that we can study them.

Assume you're gathering information from many sources.

Each of those sources is derived from data quality checks carried out on a large number of tables. During an hour, the database may be updated numerous times with new records. This amount of data is comparable to gigabytes or terabytes. Because of the magnitude of the data, it is absurd to display it in the form of tables.

To adequately analyze the results, we can use quality checks and commonly available visualization tools. Dashboards are extremely useful for managing the data quality analysis process. We can use them to examine the results and ensure that the entire process is running successfully (including errors in the application and so on). In this article, we’ll look at how to visualize data quality results.

Table of Contents

To adequately analyze the results, we can use quality checks and commonly available visualization tools.

Dashboards are extremely useful for managing the data quality analysis process. We can use them to examine the results and ensure that the entire process is running successfully (including errors in the application and so on).

In this article, we’ll look at how to visualize data quality results.

There are several steps to the data quality analysis process. Assuming that the data collection process has already been completed, the processes are as follows:

  • design, build and integrate data from various sources,
  • create and implement programs or queries to check data quality,
  • store the results of data quality checks,
  • prepare and manage dashboards showing data quality results,
  • analysis of the results.

In the previous article we described why and how to store the data quality results. Now we go to the next step and we will use stored data to make a data quality dashboard.

Technologies we use​

The results of the analysis are stored in Google Cloud BigQuery. The natural choice of visualization tool is Data Studio.

Data Studio is a widely used, free visualisation tool. It enables user to create dashboards. The data can be obtained from many sources, thanks to supporting many services such as BigQuery, Redshift, Google Analytics, PostreSQL, Google Sheets, and many more.

The dataset

In the previous article, we performed a simple quality check – null count. Let’s continue this example by visualizing some results.

The dq_check_reults table contains the following columns:
  • uuid – unique id for each result,
  • tested_resource – the name of the analyzed table,
  • test_name – the name of the data quality check,
  • value – result, in this case, a number of null values per day,
  • created_time – time the result was saved,
  • day_of_tested_data – the day when the data was tested, in this case it is a date for which NULL values were counted
Currently, in this table, there are results for three data quality checks:
  • simple_row_count,
  • max_delay_in_minutes,
  • null_count.
So we need to filter results in the first place. A simple query is used to do this. We will focus on describing the results of the NULL count for the online_retail_II table. This can be done in a few ways, each has its own advantages and disadvantages.

Ways to connect data to Data Studio

In order to prepare data for a certain dashboard, we can first create views in BigQuery. This allows us to visualize selected data quickly and easily; however, depending on the size of the table and the complexity of the query, dashboard performance may suffer.

Another option is to use Data Studio to extract the information and create additional data sources. It works by connecting Data Studio to the table and then extracting only the data we want to utilize in the dashboard in Data Studio.

There is also an option to connect Data Studio with an original table from BigQuery and create dashboards directly from it. Creating dashboards in this way is more complicated – we should filter the data for every table or plot in Data Studio, so for a data engineer who is fluent in SQL it may take more time to be done.

Such approach runs the risk of omitting the required filter, resulting in incorrect or misleading data being displayed. The main disadvantage is that the Data Studio has to process all the data from the table. This has a negative impact on the time needed to load dashboards. We do not want to distribute this solution, so it will not be covered in this article.

Preparing the data

As mentioned before, data may be prepared in two ways: creating views in BigQuery or extracting data in Data Studio. 

It’s crucial to figure out what data we will need before moving on to the next step. In most cases, it is determined by the recipient. Dashboards are typically built to assist business people in making decisions. Technical teams, on the other hand, use dashboards from time to time for retrieving data or debugging. 

Data in dashboards should be tailored to the needs of both parties, allowing them to improve their work.

For the online_retail_II table, we absolutely require information on the number of null values per day in this situation. “Value” and “day_of_tested_data” are what we need to display in our dashboard. We should utilize “tested_resource” and “test_name” to filter data rather than display them in the table because they will be the same for every row. If engineers will be using the dashboard, it is vital to include “uuid”.

Let’s assume that we want to create a dashboard for both a business client and programmers. It will present results of the null count check applied on the online_retail_II table. Summing up, we need to see following data:

  • uuid,
  • day_of_tested_data,
  • value,
  • created_time.

How to create view in Big Query

We will start in the BigQuery SQL workspace. We know that null_count check was applied only at the online_retail_II table, but let’s make sure with the query below:
				
					SELECT DISTINCT tested_resource
FROM `dqo-learning.data_quality.dq_check_results`
WHERE test_name LIKE 'null_count'
				
			

The query returned only one record, so everything is correct.

Now we know how to filter the data. The query that creates the view is shown below.

				
					SELECT uuid AS id,
    day_of_tested_data,
    value AS number_of_null_values,
    created_time
FROM `dqo-learning.data_quality.dq_check_results`
WHERE test_name LIKE 'null_count'
				
			

In order to create a view, we need to expand SAVE and select a Save View option and complete the information necessary to create the view: project name is set automatically, you have to specify the dataset and name the view.

How to extract required data from the table in Data Studio

First, we need to include the original dq_check_results table as a data source in Data Studio. To do this, click the Create button in the upper left corner of the Data Studio and select Data source.

In the next step, select the technology we will connect to. In this example we decided to use BigQuery.

Now, we need to specify the data which we want to add to the report.

In our recent data sources we can see the dq_check_results table.
Of course, following the same steps, we can connect Data Studio with the null_count_results view created before in BigQuery.

We will create a data source that, as our view in BigQuery, contains only required data. We need to create a new data source again, but this time in Google Connectors we select “Extract Data”.

Now we should complete the necessary fields.

Dimensions and date range are simple to fill.

We would like to see the sum of null values per day, so we have to create a metric. Add metric, then select the column on which the metric will be applied, in our case – value.

The sum is selected as default, when clicking on the square with the inscription sum on the left side, we can change the aggregation function by clicking on the square:

The last part of this step is applying filters. Find the section “Filter” and click ADD A FILTER, then complete the formula and save it. We want to see results only “null_count” check, so our filter contains the following information:
Finally, the page for extracting data source looks like this.

As you can see there is also an option to set auto-update. There are now two tables in recent data sources.

How to create a dashboard

Data studio offers a wide variety of plots and charts we can put on our dashboard. Depending on the type of data we want to present, there are many options starting with tables, column charts, and ending with geo charts. As an example we will create a table, and a column chart. 

Now we can start creating our first report. Click the Create button in the upper left corner of the Data Studio main website and this time select Report. In the Add data to report page select ‘My data sources’, then choose a data source. In this case, we choose null_count_results and we approve it using Add bottom on the right down corner.

In the case of a NULL check, we want to know which of the tested data sources had the most NULL values and when it happened. The easiest way to find detailed information is to get a data quality result id. We will create a pivot table with heatmap – it shows outlier values (number of null values) using color gradient.

The table is automatically completed, but it is not exactly what we expected.

On the right side, there are two panels: data and style. Let’s focus on the first one:

We should make changes here to create a table that meets our assumptions. What we need to do is to remove “day_of_tested_data” from the column dimension and add it in the row dimension. The result is successful – we have a table that shows us days with the highest number of null values and id for programmers to deal with the issue.

However, we want to make this table more friendly for business users – we need to rename columns. We can do this using the square with the inscription ‘ABC’ on the left side of the column or metric name.

The table is almost ready. Since NULL values are associated with unwanted or incorrect data, we can change the color to red for example.

In some cases, table is not the most readable way to present the data. Let’s go ahead and create a column chart that shows the daily count of NULLS during a week (2 December 2011 and 9 December 2011). Click Add a chart and select column chart. Dimension is “day_of_tested_data” and metric is “value”.

Add a filter to select the proper date. Click save.

The chart is sorted by dimension “value” in descending order. We would like to sort the data by date in ascending order. On the right-hand side corner of the chart click three dots, find Sort by, and sort by date. We will change the color to match the table

The whole dashboards looks like this

When creating a dashboard in DQO, we can use ready-made dashboard templates. DQO has a very large number of dashboards that can be modified to create your own dashboard. This is very convenient, because you do not have to create a dashboard from scratch. A description of how to create dashboards with DQO is available in the documentation (link to create dashboards here).
Importantly, DQO has built-in Data Quality checks that we can use to create dashboards perfect for our needs. Below is an example of Timeliness check.

The easiest way to create a dashboard is to use the ready-made dashboards available from DQO. After accessing the looker studio and configuring the account (description available in the documentation), we can edit the dashboard by connecting our data sources and using the cheques available in the application.
Here are some examples of the dashboards available in DQO.

KPIs scoreboard- summary

The above dashboard, as the name suggests, shows the KPIs scoreboard summary.
On it, we see percentage charts showing how many checks have passed. The bar chart shows monthly summaries for the KPIs.

In the boxes we can see the checks that had different statuses. Correct, Warning, Error, Fatal. They show both the quantity and percentage of checks with a given status.
The pie chart shows the percentage of checks performed. We can see that, for example, in green are the checks that were more than 90% correct.

This is just a fragment of the Dashboard summarizing the KPIs. However, we can see that it is easy and illustrative to visualize the results.
This is extremely important especially from a business point of view, as it provides a quick insight into the analysis of the results.

Filters are available at the top of the dashboard. This gives us more options for the dashboard view. We can filter the results by Connection. Then we can see the results for a particular connection. Similarly, we can also select categories or filter by a specific table.
Such a detailed definition of what we want to see on the dashboard will certainly make it easier to analyze the data and enable us to work on improving the quality of the data.

Current table status per data quality dimension


The purpose of this dashboard is to show the current state of the table by data quality dimension.
On the dashboard we observe several small tables where the columns status, warnings, errors and fatals are visible.

The different columns show the results for a particular Connection, Schema name, Quality dimension, data group name and on a particular table. Status indicates which tables, connections or dimensions we should work on to fix the detected anomalies.

This is extremely important from the point of view of data quality. It allows a quick pictorial assessment of the amount of work, indicates the sources of irregularities, but also the level of inaccuracy. Fatal is marked in red because it requires the most attention in terms of repairing irregularities.

Details of execution errors

This dashboard can be helpful when we want to analyze errors. As can be seen in the above tables, we have again subdivided Connection, Schema, Table, Column, or Quality Dimension.

This is extremely important from the point of view of working on errors, as we can locate these errors according to the tables shown.

We can, for example, search for errors according to a particular connection, seeing where the errors are most numerous.

DQO offers a large number of dashboards that can be customized to fit your data set. Thanks to DQO, you save time and see what your dashboard will look like before you start creating it.

Conclusion

We’ve demonstrated optimal ways to present data from BigQuery in Data Studio, including how to use metrics dimensions and filters, as well as how to create charts.

Of course, a table and column chart are insufficient to extract the most value from data observability.

Finally, dashboards are the most commonly utilized tools for evaluating data analysis. They are simple to grasp and enable both engineers and business personnel to keep track of data quality if they are appropriately constructed. Furthermore, when an error arises, they allow IT staff to quickly resolve the issue.

For more tutorials check out our blog.

Do you want to learn more about Data Quality?

Subscribe to our newsletter and learn the best data quality practices.

Please share this post
Related Articles