The development and implementation of programs or queries to check the data quality are not the sole aspects of the data quality analysis process.
To begin, we need to design, build, and integrate data from a variety of sources, frequently in incompatible environments
Then we’ll be able to use tools to perform data quality checks and save the results. The next stage is to create and administer dashboards that display data quality findings so that we can study them.
Assume you're gathering information from many sources.
Each of those sources is derived from data quality checks carried out on a large number of tables. During an hour, the database may be updated numerous times with new records. This amount of data is comparable to gigabytes or terabytes. Because of the magnitude of the data, it is absurd to display it in the form of tables.
To adequately analyze the results, we can use quality checks and commonly available visualization tools. Dashboards are extremely useful for managing the data quality analysis process. We can use them to examine the results and ensure that the entire process is running successfully (including errors in the application and so on). In this article, we’ll look at how to visualize data quality results.
Table of Contents
To adequately analyze the results, we can use quality checks and commonly available visualization tools.
Dashboards are extremely useful for managing the data quality analysis process. We can use them to examine the results and ensure that the entire process is running successfully (including errors in the application and so on).
In this article, we’ll look at how to visualize data quality results.
There are several steps to the data quality analysis process. Assuming that the data collection process has already been completed, the processes are as follows:
- design, build and integrate data from various sources,
- create and implement programs or queries to check data quality,
- store the results of data quality checks,
- prepare and manage dashboards showing data quality results,
- analysis of the results.
In the previous article we described why and how to store the data quality results. Now we go to the next step and we will use stored data to make a data quality dashboard.
Technologies we use
The results of the analysis are stored in Google Cloud BigQuery. The natural choice of visualization tool is Data Studio.
Data Studio is a widely used, free visualisation tool. It enables user to create dashboards. The data can be obtained from many sources, thanks to supporting many services such as BigQuery, Redshift, Google Analytics, PostreSQL, Google Sheets, and many more.
The dataset
In the previous article, we performed a simple quality check – null count. Let’s continue this example by visualizing some results.
- uuid – unique id for each result,
- tested_resource – the name of the analyzed table,
- test_name – the name of the data quality check,
- value – result, in this case, a number of null values per day,
- created_time – time the result was saved,
- day_of_tested_data – the day when the data was tested, in this case it is a date for which NULL values were counted
- simple_row_count,
- max_delay_in_minutes,
- null_count.
Ways to connect data to Data Studio
In order to prepare data for a certain dashboard, we can first create views in BigQuery. This allows us to visualize selected data quickly and easily; however, depending on the size of the table and the complexity of the query, dashboard performance may suffer.
Another option is to use Data Studio to extract the information and create additional data sources. It works by connecting Data Studio to the table and then extracting only the data we want to utilize in the dashboard in Data Studio.
There is also an option to connect Data Studio with an original table from BigQuery and create dashboards directly from it. Creating dashboards in this way is more complicated – we should filter the data for every table or plot in Data Studio, so for a data engineer who is fluent in SQL it may take more time to be done.
Such approach runs the risk of omitting the required filter, resulting in incorrect or misleading data being displayed. The main disadvantage is that the Data Studio has to process all the data from the table. This has a negative impact on the time needed to load dashboards. We do not want to distribute this solution, so it will not be covered in this article.
Preparing the data
As mentioned before, data may be prepared in two ways: creating views in BigQuery or extracting data in Data Studio.
It’s crucial to figure out what data we will need before moving on to the next step. In most cases, it is determined by the recipient. Dashboards are typically built to assist business people in making decisions. Technical teams, on the other hand, use dashboards from time to time for retrieving data or debugging.
Data in dashboards should be tailored to the needs of both parties, allowing them to improve their work.
- uuid,
- day_of_tested_data,
- value,
- created_time.
How to create view in Big Query
SELECT DISTINCT tested_resource
FROM `dqo-learning.data_quality.dq_check_results`
WHERE test_name LIKE 'null_count'
The query returned only one record, so everything is correct.
Now we know how to filter the data. The query that creates the view is shown below.
SELECT uuid AS id,
day_of_tested_data,
value AS number_of_null_values,
created_time
FROM `dqo-learning.data_quality.dq_check_results`
WHERE test_name LIKE 'null_count'
In order to create a view, we need to expand SAVE and select a Save View option and complete the information necessary to create the view: project name is set automatically, you have to specify the dataset and name the view.
How to extract required data from the table in Data Studio
In the next step, select the technology we will connect to. In this example we decided to use BigQuery.
Now, we need to specify the data which we want to add to the report.
Now we should complete the necessary fields.
Dimensions and date range are simple to fill.
We would like to see the sum of null values per day, so we have to create a metric. Add metric, then select the column on which the metric will be applied, in our case – value.
The sum is selected as default, when clicking on the square with the inscription sum on the left side, we can change the aggregation function by clicking on the square:
As you can see there is also an option to set auto-update. There are now two tables in recent data sources.
How to create a dashboard
Data studio offers a wide variety of plots and charts we can put on our dashboard. Depending on the type of data we want to present, there are many options starting with tables, column charts, and ending with geo charts. As an example we will create a table, and a column chart.
Now we can start creating our first report. Click the Create button in the upper left corner of the Data Studio main website and this time select Report. In the Add data to report page select ‘My data sources’, then choose a data source. In this case, we choose null_count_results and we approve it using Add bottom on the right down corner.
In the case of a NULL check, we want to know which of the tested data sources had the most NULL values and when it happened. The easiest way to find detailed information is to get a data quality result id. We will create a pivot table with heatmap – it shows outlier values (number of null values) using color gradient.
The table is automatically completed, but it is not exactly what we expected.
On the right side, there are two panels: data and style. Let’s focus on the first one:
However, we want to make this table more friendly for business users – we need to rename columns. We can do this using the square with the inscription ‘ABC’ on the left side of the column or metric name.
The table is almost ready. Since NULL values are associated with unwanted or incorrect data, we can change the color to red for example.
Add a filter to select the proper date. Click save.
The chart is sorted by dimension “value” in descending order. We would like to sort the data by date in ascending order. On the right-hand side corner of the chart click three dots, find Sort by, and sort by date. We will change the color to match the table
The whole dashboards looks like this
When creating a dashboard in DQO, we can use ready-made dashboard templates. DQO has a very large number of dashboards that can be modified to create your own dashboard. This is very convenient, because you do not have to create a dashboard from scratch. A description of how to create dashboards with DQO is available in the documentation (link to create dashboards here).
Importantly, DQO has built-in Data Quality checks that we can use to create dashboards perfect for our needs. Below is an example of Timeliness check.
The easiest way to create a dashboard is to use the ready-made dashboards available from DQO. After accessing the looker studio and configuring the account (description available in the documentation), we can edit the dashboard by connecting our data sources and using the cheques available in the application.
Here are some examples of the dashboards available in DQO.
KPIs scoreboard- summary
The above dashboard, as the name suggests, shows the KPIs scoreboard summary.
On it, we see percentage charts showing how many checks have passed. The bar chart shows monthly summaries for the KPIs.
In the boxes we can see the checks that had different statuses. Correct, Warning, Error, Fatal. They show both the quantity and percentage of checks with a given status.
The pie chart shows the percentage of checks performed. We can see that, for example, in green are the checks that were more than 90% correct.
This is just a fragment of the Dashboard summarizing the KPIs. However, we can see that it is easy and illustrative to visualize the results.
This is extremely important especially from a business point of view, as it provides a quick insight into the analysis of the results.
Filters are available at the top of the dashboard. This gives us more options for the dashboard view. We can filter the results by Connection. Then we can see the results for a particular connection. Similarly, we can also select categories or filter by a specific table.
Such a detailed definition of what we want to see on the dashboard will certainly make it easier to analyze the data and enable us to work on improving the quality of the data.
Current table status per data quality dimension
The purpose of this dashboard is to show the current state of the table by data quality dimension.
On the dashboard we observe several small tables where the columns status, warnings, errors and fatals are visible.
The different columns show the results for a particular Connection, Schema name, Quality dimension, data group name and on a particular table. Status indicates which tables, connections or dimensions we should work on to fix the detected anomalies.
This is extremely important from the point of view of data quality. It allows a quick pictorial assessment of the amount of work, indicates the sources of irregularities, but also the level of inaccuracy. Fatal is marked in red because it requires the most attention in terms of repairing irregularities.
Details of execution errors
This dashboard can be helpful when we want to analyze errors. As can be seen in the above tables, we have again subdivided Connection, Schema, Table, Column, or Quality Dimension.
This is extremely important from the point of view of working on errors, as we can locate these errors according to the tables shown.
We can, for example, search for errors according to a particular connection, seeing where the errors are most numerous.
DQO offers a large number of dashboards that can be customized to fit your data set. Thanks to DQO, you save time and see what your dashboard will look like before you start creating it.
Conclusion
We’ve demonstrated optimal ways to present data from BigQuery in Data Studio, including how to use metrics dimensions and filters, as well as how to create charts.
Of course, a table and column chart are insufficient to extract the most value from data observability.
Finally, dashboards are the most commonly utilized tools for evaluating data analysis. They are simple to grasp and enable both engineers and business personnel to keep track of data quality if they are appropriately constructed. Furthermore, when an error arises, they allow IT staff to quickly resolve the issue.
For more tutorials check out our blog.