Use cases
We have provided a variety of examples to help you in using DQO effectively. These examples use openly available datasets from Google Cloud.
Prerequisite
To use the examples you need:
- Installed DQO.
- A BiqQuery service account with BigQuery > BigQuery Job User permission. You can create a free trial Google Cloud account here.
- A working Google Cloud CLI if you want to use Google Application Credentials authentication.).
After installing Google Cloud CLI, log in to your GCP account, by running:
Running the use cases
Standard DQO installation comes with a set of examples, which can
be found in the example/
directory. You can view a complete list of the examples with links to detailed explanation by
scrolling to the bottom of the page.
The example directory contains two configuration files: connection.dqoconnection.yaml
, which stores the data source
configuration, and *.dqotable.yaml
file, which stores the columns and tables metadata and checks configuration.
While it is not necessary to manually add the connection in our examples, you can find information on how to do it in the Working with DQO section.
To run the examples, follow the steps below.
-
Go to the directory where you installed DQO and navigate, for example, to
examples/data-completeness/number-of-rows-in-the-table-bigquery
.Run the command provided below.
-
Create DQO
userhome
folder.After installation, you will be asked whether to initialize the DQO userhome folder in the default location. Type Y to create the folder.
The userhome folder locally stores data such as sensor and checkout readings, as well as data source configurations. You can learn more about data storage here. -
Login to DQO Cloud.
To use DQO features, such as storing data quality definitions and results in the cloud or data quality dashboards, you must create a DQO cloud account.
After creating an userhome folder, you will be asked whether to log in to the DQO cloud. After typing Y, you will be redirected to https://cloud.dqo.ai/registration, where you can create a new account, use Google single sign-on (SSO) or log in if you already have an account.
During the first registration, a unique identification code (API Key) will be generated and automatically passed to the DQO application. The API Key is now stored in the configuration file.
-
To execute the checks that were prepared in the example, run the following command in DQO Shell:
You can also execute the checks using the graphical interface. Simply, open the DQO User Interface Console (http://localhost:8888).
Go to the Profiling section. Select the table or column mentioned in the example description from the tree view on the left.
Select the Advanced Profiling tab.
Run the enabled check using the Run check button.
Review the results by opening the Check details button.
-
After executing the checks, synchronize the results with your DQO cloud account by running the following command or using the Synchronize button located in the upper right corner of the graphical interface.
-
You can now review the results on the data quality dashboards as described in the Working with DQO section.
List of the use cases
Here is a comprehensive list of examples with links to the relevant documentation section with detailed descriptions.
Name of the example | Description | Link to the dataset description |
---|---|---|
Data accuracy | ||
Integrity check between columns in different tables | This example shows how to check the referential integrity of a column against a column in another table using foreign-key-match-percent check. | Link |
Data completeness | ||
Number of rows in the table | This example shows how to check that the number of rows in a table does not exceed the minimum accepted count using row_count check. | Link |
Number of null values | This example shows how to detect that the number of null values in a column does not exceed the maximum accepted count using nulls_cont check. | Link |
Data uniqueness | ||
Percentage of duplicates | This example shows how to detect that the percentage of duplicate values in a column does not exceed the maximum accepted percentage using duplicate_percent check. | Link |
Data validity | ||
Percentage of valid USA zipcodes | This example shows how to detect that the percentage of valid USA zip code in a column does not fall below a set threshold using valid_usa_zipcode_percent check. | Link |
Percentage of valid emails | This example shows how to detect that the percentage of valid email values in a column does not exceed the maximum accepted percentage using valid_email_percent check. | DQOps dataset |
Percentage of valid latitude and longitude | This example shows how to detect that the percentage of valid latitude and longitude values remain above a set threshold using numeric_valid_latitude_percent and numeric_valid_longitude_percentchecks. | Link |
Percentage of valid IP4 address | This example shows how to detect that the percentage of valid IP4 address in a column does not fall below a set threshold using valid_ip4_address_percent check. | DQOps dataset |
Percentage of strings matching date regex | This example shows how to detect that the percentage of strings matching the date format regex in a column does not exceed a set threshold using string_match_date_regex_percent check. | Link |
Percentage of negative values | This example shows how to detect that the percentage of negative values in a column does not exceed a set threshold using negative_percent check. | Link |
Percentage of valid currency codes | This example shows how to detect that the percentage of valid currency codes in a column does not fall below a set threshold using string_valid_currency_code_percent check. | DQOps dataset |
Percentage of rows passing SQL condition | This example shows how to detect that the percentage of passed sql condition in a column does not fall below a set threshold using sql_condition_passed_percent check. | Link |
Percentage of valid UUID | This example shows how to detect that th percentage of valid UUID values in a column does not fall below a set threshold using string_valid_uuid_percent check. | DQOps dataset |
Data reasonability | ||
Percentage of values in range | This example shows how to detect that the percentage of values within a set range in a column does not exceed a set threshold using values_in_range_integers_percent check. | Link |
A string not exceeding a set length | This example shows how to check that the length of the string does not exceed the indicated value using string_max_length check. | Link |
Percentage of false values | This example shows how to detect that the percentage of false values remains above a set threshold using bool_false_percent check. | Link |
Stability | ||
Table availability | This example shows how to verify that a query can be executed on a table and that the server does not return errors using table_availability check. | Link |
Data quality monitoring | ||
Running checks with a scheduler | This example shows how to set different schedules on multiple checks. | Link |
Data consistency | ||
Percent of rows having a string column value in an expected set | This example shows how to verify that the percentage of strings from a set in a column does not fall below a set threshold using string_value_in_set_percent check. | Link |