Getting started with DQOps

Verifying the Number of rows in the table

Standard DQO installation comes with a set of examples, which can be found in the `example/` directory. These examples use openly available datasets from Google Cloud. The following example describes how to verify that the number of rows in a table does not exceed the minimum accepted count. Here you can check the full list of examples.

Prerequisites

To use the examples you need:

After installing Google Cloud CLI, log in to your GCP account, by running:

gcloud auth application-default login

Problem

America’s Health Rankings provides an analysis of national health on a state-by-state basis by evaluating a historical and comprehensive set of health, environmental and socioeconomic data to determine national health benchmarks and state rankings.

The platform analyzes more than 340 measures of behaviors, social and economic factors, physical environment and clinical care data.

Data is based on public-use data sets, such as the U.S. Census and the Centers for Disease Control and Prevention’s Behavioral Risk Factor Surveillance System (BRFSS), the world’s largest, annual population-based telephone survey of over 400,000 people.

Here is a table with some sample customer data from the bigquery-public-data.america_health_rankings.ahr dataset. Some columns were omitted for clarity.

In this example, we will verify if the number of rows in a table does not exceed the minimum accepted count.

We want to verify that the number of rows in a table does not exceed the minimum accepted count.

Solution

We will verify the data using profiling row_count table check.
Our goal is to verify if the number of rows does not fall below setup thresholds.

In this example, we will set three minimum count thresholds levels for the check:

  • warning: 692
  • error: 381
  • fatal: 150

You can learn more about checks and threshold levels here.

Value

If the number of rows falls below 692, a warning alert will be triggered.

Running the example

To run the examples using the graphical interface, follow the steps below.

  1. Go to the directory where you installed DQO and navigate to
    examples/data-completeness/number-of-rows-in-the-table-bigquery. Run the command run_dqo in Windows or ./run_dqo in MacOS/Linux.

  2.  Create DQO `userhome` folder. After installation, you will be asked whether to initialize the DQO “userhome” folder in the default location. Type Y to create the folder. The userhome folder locally stores data such as sensor and checkout readings, as well as data source configurations.

  3. Login to DQO Cloud.
    To use DQO features, such as storing data quality definitions and results in the cloud or data quality dashboards, you must create a DQO cloud account. After creating an userhome folder, you will be asked whether to log in to the DQO cloud. After typing Y, you will be redirected to https://cloud.dqo.ai/registration, where you can create a new account, use Google single sign-on (SSO), or login if you already have an account. During the first registration, a unique identification code (API Key) will be generated and automatically passed to the DQO application. The API Key is now stored in the configuration file.

  4. Open the DQO User Interface Console (http://localhost:8888).

  5. Go to the Profiling section at the navigation bar at the top of the screen.  

  6. Select the table or column mentioned in the example description from the tree view on the left.

  7. Select the Advanced Profiling tab.
  1. Run the enabled check using the Run check button.
  1. Review the results by opening the Check details button.
  1. You should see the results as the one below.

    The actual value of rows in this example is 18155, which is above the minimum threshold level set in the warning (692).
    The check gives a valid result (notice the green square on the left of the name of the check).

  1. After executing the checks, synchronize the results with your DQO cloud account using the Synchronize button in the upper right corner of the graphical interface.

  2. To review the results on the data quality dashboards go to the Data Quality Dashboards section and select the dashboard from the tree view on the left. Below you can see the results displayed on the Issues dashboard showing results by check, number of issues per connection, and number of issues per table. 

Conclusion

The example showed how easily you can start monitoring the quality of your data with DQO.

Ready to get started?

Do you want to learn more about Data Quality?

Subscribe to our newsletter and learn the best data quality practices.

Please share this post
Related Articles