Verifying the Number of rows in the table
Prerequisites
To use the examples you need:
- Installed DQO
- A BiqQuery service account with BigQuery > BigQuery Job User permission. You can create a free trial Google Cloud account here.
- A working Google Cloud CLI if you want to use Google Application Credentials authentication.
After installing Google Cloud CLI, log in to your GCP account, by running:
gcloud auth application-default login
Problem
America’s Health Rankings provides an analysis of national health on a state-by-state basis by evaluating a historical and comprehensive set of health, environmental and socioeconomic data to determine national health benchmarks and state rankings.
The platform analyzes more than 340 measures of behaviors, social and economic factors, physical environment and clinical care data.
Data is based on public-use data sets, such as the U.S. Census and the Centers for Disease Control and Prevention’s Behavioral Risk Factor Surveillance System (BRFSS), the world’s largest, annual population-based telephone survey of over 400,000 people.
Here is a table with some sample customer data from the bigquery-public-data.america_health_rankings.ahr dataset. Some columns were omitted for clarity.
In this example, we will verify if the number of rows in a table does not exceed the minimum accepted count.
We want to verify that the number of rows in a table does not exceed the minimum accepted count.
Solution
We will verify the data using profiling row_count
table check.
Our goal is to verify if the number of rows does not fall below setup thresholds.
In this example, we will set three minimum count thresholds levels for the check:
- warning: 692
- error: 381
- fatal: 150
Value
If the number of rows falls below 692, a warning alert will be triggered.
Running the example
To run the examples using the graphical interface, follow the steps below.
- Go to the directory where you installed DQO and navigate to
examples/data-completeness/number-of-rows-in-the-table-bigquery. Run the commandrun_dqo
in Windows or./run_dqo
in MacOS/Linux. - Create DQO `userhome` folder. After installation, you will be asked whether to initialize the DQO “userhome” folder in the default location. Type Y to create the folder. The userhome folder locally stores data such as sensor and checkout readings, as well as data source configurations.
- Login to DQO Cloud.
To use DQO features, such as storing data quality definitions and results in the cloud or data quality dashboards, you must create a DQO cloud account. After creating an userhome folder, you will be asked whether to log in to the DQO cloud. After typing Y, you will be redirected to https://cloud.dqo.ai/registration, where you can create a new account, use Google single sign-on (SSO), or login if you already have an account. During the first registration, a unique identification code (API Key) will be generated and automatically passed to the DQO application. The API Key is now stored in the configuration file. - Open the DQO User Interface Console (http://localhost:8888).
- Go to the Profiling section at the navigation bar at the top of the screen.
- Select the table or column mentioned in the example description from the tree view on the left.
- Select the Advanced Profiling tab.
- Run the enabled check using the Run check button.
- Review the results by opening the Check details button.
- You should see the results as the one below.
The actual value of rows in this example is 18155, which is above the minimum threshold level set in the warning (692).
The check gives a valid result (notice the green square on the left of the name of the check).
- After executing the checks, synchronize the results with your DQO cloud account using the Synchronize button in the upper right corner of the graphical interface.
- To review the results on the data quality dashboards go to the Data Quality Dashboards section and select the dashboard from the tree view on the left. Below you can see the results displayed on the Issues dashboard showing results by check, number of issues per connection, and number of issues per table.
Conclusion
The example showed how easily you can start monitoring the quality of your data with DQO.