In this article, we would like to show you how to start working with a DQO.ai on the BigQuery example.
After the initial setup, we will show how to run examples.
Then we will walk you through the process of adding connections, importing and editing tables, and defining and running checks.
Table of Contents
Examples prerequisites
The sample uses BigQuery public datasets hosted in bigquery-public-data public GCP project. You have to download and install Google Cloud CLI .
After installing Google Cloud CLI, log in to your GCP account (you can start one for free), by running:
gcloud auth application-default login
After setting up your GCP account, create your GCP project. That will be your GCP billing project used to run SQL sensors on the public datasets provided by Google.
The examples are using the name of your GCP billing project, received as an environment variable GCP_PROJECT.
You have to set and export this variable before starting the DQO shell.
Running examples
We will show how to run a date_type_percent check from the examples (the rest of them is done analogously). Open up a terminal, and go to the directory where dqo.ai is installed.
Define an environmental variable that will correspond to your GCP project ID and billing project ID.
Switch the directory to the one with example and run the app. You can just copy and paste the following command to your terminal, and the application will start.
cd examples/bigquery-column-date-type-percent
..\..\dqo.cmd
Now you can go ahead and run the check by executing
check run
The result is
dqo.ai> check run
Check evaluation summary per table:
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
|Connection |Table |Checks|Sensor results|Valid results|Alerts (low)|Alerts (medium)|Alerts (high)|
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
|publicdata_bq_1|labeled_patents.extracted_data|1 |1 |0 |0 |0 |1 |
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
You can run this check in a debug mode:
check run -m=debug
and the result will show step by step execution
dqo.ai> check run -m=debug
**************************************************
Executing quality checks on table labeled_patents.extracted_data from connection publicdata_bq_1
**************************************************
**************************************************
Executing a sensor for a check date_type_percent on the table labeled_patents.extracted_data using a sensor definition column/validity/date_type_percent
**************************************************
**************************************************
Calling Jinja2 rendering template column/validity/date_type_percent/bigquery.sql.jinja2
**************************************************
**************************************************
Jinja2 engine has rendered the following template:
column/validity/date_type_percent/bigquery.sql.jinja2
**************************************************
**************************************************
Executing SQL on connection publicdata_bq_1 (bigquery)
SQL to be executed on the connection:
SELECT
100.0 * SUM(
CASE
WHEN SAFE_CAST(analyzed_table.`publication_date` AS FLOAT64) IS NOT NULL
OR SAFE_CAST(analyzed_table.`publication_date` AS DATE) IS NOT NULL
OR SAFE.PARSE_DATE('%d.%m.%Y', analyzed_table.`publication_date`) IS NOT NULL THEN 1
ELSE 0
END
) / COUNT(*) AS actual_value, CAST(CURRENT_TIMESTAMP() AS date) AS time_period
FROM `bigquery-public-data`.`labeled_patents`.`extracted_data` AS analyzed_table
GROUP BY time_period
ORDER BY time_period
**************************************************
**************************************************
Finished executing a sensor for a check date_type_percent on the table labeled_patents.extracted_data using a sensor definition column/validity/date_type_percent, sensor result count: 1
Results returned by the sensor:
+----------------+-----------+
|actual_value |time_period|
+----------------+-----------+
|71.7201166180758|2022-05-04 |
+----------------+-----------+
**************************************************
**************************************************
Finished executing rules (thresholds) for a check date_type_percent on the table labeled_patents.extracted_data, verified rules count: 1
Rule evaluation results:
+----------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+---------------+--------------+-------------------+----------------+-------------------+-----------------+-----------------+---------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|actual_value |expected_value|time_period |time_gradient|dimension_id|connection_hash |connection_name|provider|table_hash |schema_name |table_name |column_hash |column_name |check_hash |check_name |quality_dimension|sensor_name |executed_at |duration_ms|severity|rule_hash |rule_name|high_lower_bound|medium_lower_bound|low_lower_bound|
+----------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+---------------+--------------+-------------------+----------------+-------------------+-----------------+-----------------+---------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|71.7201166180758|90.0 |2022-05-04T00:00|day |0 |3506421218842057901|publicdata_bq_1|bigquery|7226004687294253978|labeled_patents|extracted_data|3815616989618007012|publication_date|3360128111520269927|date_type_percent|validity |column/validity/date_type_percent|2022-05-04T11:49:24.040Z|2908 |3 |8286102381750941021|min_count|90.0 |95.0 |98.0 |
+----------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+---------------+--------------+-------------------+----------------+-------------------+-----------------+-----------------+---------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
**************************************************
Check evaluation summary per table:
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
|Connection |Table |Checks|Sensor results|Valid results|Alerts (low)|Alerts (medium)|Alerts (high)|
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
|publicdata_bq_1|labeled_patents.extracted_data|1 |1 |0 |0 |0 |1 |
+---------------+------------------------------+------+--------------+-------------+------------+---------------+-------------+
Adding connections manually
Let’s define the same check as in the example but manually.
Let’s begin with adding a connection, provide your information in the following command
connection add -n={name} -t=bigquery -P=bigquery-source-project-id={project_ID} -P=bigquery-billing-project-id={billing_ID} -P=bigquery-quota-project-id={quota_ID} -P=bigquery-authentication-mode=google_application_credentials -hl
with all the necessary information, the connection will be added in a headless mode. Run the following command to import tables.
table import -c={connection_name}
and choose number 112 from the list. Then open table configuration with
table edit -c={connection_name} -t=labeled_patents.extracted_data
A code editor should open, feel free to copy and paste (or add the highlighted “check” section in the right place) in your YAML file.
# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
target:
schema_name: labeled_patents
table_name: extracted_data
time_series:
mode: current_time
time_gradient: day
checks: {}
columns:
gcs_path:
type_snapshot:
column_type: STRING
nullable: true
issuer:
type_snapshot:
column_type: STRING
nullable: true
language:
type_snapshot:
column_type: STRING
nullable: true
publication_date:
type_snapshot:
column_type: STRING
nullable: true
checks:
validity:
date_type_percent:
parameters:
custom_date_format: "%d.%m.%Y"
rules:
min_count:
low:
min_value: 98.0
medium:
min_value: 95.0
high:
min_value: 90.0
class_international:
type_snapshot:
column_type: STRING
nullable: true
class_us:
type_snapshot:
column_type: STRING
nullable: true
application_number:
type_snapshot:
column_type: STRING
nullable: true
filing_date:
type_snapshot:
column_type: STRING
nullable: true
priority_date_eu:
type_snapshot:
column_type: STRING
nullable: true
representative_line_1_eu:
type_snapshot:
column_type: STRING
nullable: true
applicant_line_1:
type_snapshot:
column_type: STRING
nullable: true
inventor_line_1:
type_snapshot:
column_type: STRING
nullable: true
title_line_1:
type_snapshot:
column_type: STRING
nullable: true
number:
type_snapshot:
column_type: STRING
nullable: true
Save the configuration and run the command to execute checks just as in the example.