Running checks with a scheduler
In this example, you will learn how to set different schedules on multiple checks.
PROBLEM
TheLook is a fictitious eCommerce clothing dataset developed by Looker team. The dataset contains information about customers, products, orders, logistics, web events and digital marketing campaigns.
To make better use of that data, and accurately predict operational issues so that we can resolve them before they negatively impact our business, we want to monitor various data quality aspects of the dataset:
- the availability of the
users
table every 30 minutes, - the uniqueness of customer IDs in the
id
column every 15 minutes past the hour, - the uniqueness and validity of customer email addresses in the
email
column daily at 8:00 AM - the gender (F or M) in the
gender
column daily at 8:00 AM - the age to be within the 10-80 range in the
age
column daily at 8:00 AM (set but temporarily disabled).
SOLUTION
We will set six data quality recurring checks on bigquery-public-data.thelook_ecommerce.users
dataset:
-
daily_table_availability check on
users
table with max failures thresholds levels:- warning: 1
- error: 5
- fatal: 10
-
daily_distinct_percent check on
id
column with minimum percent thresholds levels- warning: 99.0%
- error: 98.0%
- fatal: 95.0%
-
daily_distinct_percent check on
email
column with minimum percent thresholds levels- warning: 99.0%
- error: 98.0%
- fatal: 95.0%
-
daily_valid_email_percent check on
email
column with minimum percent thresholds levels- warning: 99.0%
- error: 98.0%
- fatal: 95.0%
-
daily_string_value_in_set_percent check on
gender
column with values parameters "F" and "M" and minimum percent thresholds levels- warning: 99.0%
- error: 98.0%
- fatal: 95.0%
-
daily_values_in_range_numeric_percent check on
age
column with values parameters "min_value=10" and "max_value=80" and minimum percent thresholds levels- warning: 99.0%
- error: 98.0%
- fatal: 95.0%
To set a schedule, we will create a connection level schedule that will apply to all enabled checks at 8:00 AM. For two checks that require more frequent runs, we will set individual check-level schedules that will override the connection level settings. The schedule for the daily_values_in_range_numeric_percent check will be temporarily disabled.
If you want to learn more about checks and threshold levels, please refer to the DQO concept section. You can read more about scheduling here.
VALUE
By setting up a schedule, data can be constantly monitored, and errors can be triggered to alert about potential issues with the data quality or operational processes.
Data structure
The following is a fragment of the bigquery-public-data.thelook_ecommerce.users
dataset. Some columns were omitted for clarity.
id | first_name | last_name | age | gender | created_at | |
---|---|---|---|---|---|---|
1543 | Bob | Caldwell | bobcaldwell@example.net | 61 | M | 3/9/2021 15:49:00 |
2586 | Brittany | Flores | brittanyflores@example.org | 40 | F | 1/24/2021 1:25:00 |
38699 | Christina | Johnson | christinajohnson@example.com | 63 | F | 6/7/2021 6:48:00 |
49224 | Latoya | Bird | latoyabird@example.com | 56 | F | 6/22/2019 5:54:00 |
57933 | Charles | Martin | charlesmartin@example.net | 28 | M | 7/25/2022 8:11:00 |
58471 | Amanda | Collins | amandacollins@example.com | 45 | F | 8/24/2019 4:56:00 |
73890 | Jose | Marsh | josemarsh@example.net | 39 | M | 11/21/2022 6:21:00 |
82328 | Diane | Leonard | dianeleonard@example.net | 62 | F | 12/12/2021 6:47:00 |
84680 | Mitchell | Williams | mitchellwilliams@example.net | 13 | M | 3/1/2019 12:36:00 |
YAML configuration file
The YAML connection configuration file stores data source configurations and allows setting connection-level schedule.
In the YAML data source configuration file below, the highlighted sections indicate the area where the cron expression for the schedule is set to run every day at 8:00 AM (0 8 * ).
If you want to learn more about cron formatting, please refer to the Working with DQO section.
# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/ConnectionYaml-schema.json
apiVersion: dqo/v1
kind: source
spec:
provider_type: bigquery
bigquery:
source_project_id: bigquery-public-data
authentication_mode: google_application_credentials
schedules:
recurring_daily:
cron_expression: 0 8 * * *
incident_grouping:
grouping_level: table_dimension_category
minimum_severity: warning
max_incident_length_days: 60
mute_for_days: 60
In the YAML table configuration file below, the highlighted section indicate that the table-level daily_table_availability
check is set with schedule_override section. This means that this check will be run every 30 min (/30 * * *).
The id
column also includes an updated schedule on a daily_distinct_percent check. This check will run 15 min past every hour.
The remaining checks will be run every day at 8:00 AM (0 8 * ) as indicated in the YAML connection configuration mentioned earlier.
It is important to keep in mind that the daily_values_in_range_numeric_percent check will not be run since the "disabled:"
parameter has been set to "true" in the "schedule_override" section.
# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
incremental_time_window:
daily_partitioning_recent_days: 7
monthly_partitioning_recent_months: 1
recurring_checks:
daily:
availability:
daily_table_availability:
schedule_override:
cron_expression: '*/30 * * * *'
warning:
max_failures: 1
error:
max_failures: 5
fatal:
max_failures: 10
columns:
id:
type_snapshot:
column_type: INT64
nullable: true
recurring_checks:
daily:
uniqueness:
daily_distinct_percent:
schedule_override:
cron_expression: 15 * * * *
warning:
min_percent: 99.0
error:
min_percent: 98.0
fatal:
min_percent: 95.0
first_name:
type_snapshot:
column_type: STRING
nullable: true
last_name:
type_snapshot:
column_type: STRING
nullable: true
email:
type_snapshot:
column_type: STRING
nullable: true
recurring_checks:
daily:
uniqueness:
daily_distinct_percent:
warning:
min_percent: 99.0
error:
min_percent: 98.0
fatal:
min_percent: 95.0
pii:
daily_valid_email_percent:
warning:
min_percent: 99.0
error:
min_percent: 98.0
fatal:
min_percent: 95.0
age:
type_snapshot:
column_type: INT64
nullable: true
recurring_checks:
daily:
numeric:
daily_values_in_range_numeric_percent:
schedule_override:
disabled: true
parameters:
min_value: 10.0
max_value: 80.0
warning:
min_percent: 99.0
error:
min_percent: 98.0
fatal:
min_percent: 95.0
gender:
type_snapshot:
column_type: STRING
nullable: true
recurring_checks:
daily:
strings:
daily_string_value_in_set_percent:
parameters:
values:
- F
- M
warning:
min_percent: 99.0
error:
min_percent: 98.0
fatal:
min_percent: 95.0
state:
type_snapshot:
column_type: STRING
nullable: true
street_address:
type_snapshot:
column_type: STRING
nullable: true
postal_code:
type_snapshot:
column_type: STRING
nullable: true
city:
type_snapshot:
column_type: STRING
nullable: true
country:
type_snapshot:
column_type: STRING
nullable: true
latitude:
type_snapshot:
column_type: FLOAT64
nullable: true
longitude:
type_snapshot:
column_type: FLOAT64
nullable: true
traffic_source:
type_snapshot:
column_type: STRING
nullable: true
created_at:
type_snapshot:
column_type: TIMESTAMP
nullable: true
Evaluating the schedule settings using the graphical interface
To evaluate schedule settings that have been configured in YAML files open the graphical interface
To view the connection-level schedule:
-
Go to the Data Source section and select the "thelook_ecommerce" connection from the tree view on the left.
-
In the main workspace select the Schedule tab and the Recurring Daily tab. Here, you can see that a schedule has been set to "Run every day at 8:00".
In the example, we want to verify the availability of the users
table every 30 minutes using the daily_table_availability check.
DQO allows to set check-level schedules and override the connection level settings.
To view and modify individual check-level schedules:
-
Go to the section with a list of checks. In our example, we have set recurring checks, so go to the Recurring Checks section.
-
In the main workspace select the table "users" from the tree view on the left.
-
Click on the Settings icon to the left of the daily_table_availability check name.
-
Select the Schedule Override tab to see that the check-level schedule is set to run every 30 minutes.
Starting a scheduler using the graphical interface
To start a scheduler using the graphical interface, simply select the Notification icon in the upper right corner and toggle the button next to the Jobs scheduler.
The scheduler has been initiated and will run checks based on the set schedules. Data synchronization will take place every 10 minutes.
Starting a scheduler using DQO Shell
To initiate a scheduler in the DQO Shell, simply enter the command scheduler start
. To stop the scheduler, use the
command scheduler stop
.
For further information on the scheduler
commands, please refer to the Command-line interface section.
Scheduler can also be started in a server mode that continuously run a job scheduler and synchronize the data every 10 minutes. To do this, simply enter the command below in your terminal:
To terminate dqo running in the background, simply use the Ctrl+C.For more information on the run
command, please refer to the Command-line interface section.
Running the checks in the example and evaluating the results
The detailed explanation of how to run the example is described here.
Even if we have started scheduler we can run all the checks to verify the results. To execute the check prepared in the example, run the following command in DQO Shell:
You should see the results as the one below. Results from all checks appear to be valid except for one that indicates a fatal error.
Check evaluation summary per table:
+-----------------+-----------------------+------+--------------+-------------+--------+------+------------+----------------+
|Connection |Table |Checks|Sensor results|Valid results|Warnings|Errors|Fatal errors|Execution errors|
+-----------------+-----------------------+------+--------------+-------------+--------+------+------------+----------------+
|thelook_ecommerce|thelook_ecommerce.users|6 |6 |5 |0 |0 |1 |0 |
+-----------------+-----------------------+------+--------------+-------------+--------+------+------------+----------------+
For a more detailed insight of how the check is run, you can initiate the check in debug mode by executing the following command, as in the other examples:
Next step
Now that you have set up a schedule and get first results, you can evaluate them on dashboards. You can find instructions on how to do this here.