Running data quality checks
DQOps supports running data quality checks from the DQOps command-line shell, user interface, Python client and triggered directly by a call to a REST API endpoint run_checks.
Data quality checks can be also scheduled to run in regular intervals to continuously monitor the data sources. DQOps uses an internal job scheduler that is based on a popular Quartz library. The schedules are defined as CRON expressions compatible with the Linux CRON format.
Targeting data sources and tables
The data quality checks are configured on tables in the .dqotable.yaml table specification files. The structure of a simplified DQOps user home folder with one data source sales-dwh and one configured table public.fact_sales is shown below. We will run all data quality checks configured on that table.
- The target data source defined as the nested folder name inside the $DQO_USER_HOME/sources folder.
- The data source configuration file.
- The configuration of the data quality checks for the public.fact_sales table. The target table is identified by the file name. The file extension .dqotable.yaml identifies a table data quality specification file.
This DQOps user home folder has one data source sales-dwh that is identified as a folder name inside the $DQO_USER_HOME/sources folder. The DQOps user home stores the configuration of one table public.fact_sales. The configuration of the data quality checks for this table is stored in the file with a .dqotable.yaml extension. The name of the target table is not defined inside the public.fact_sales.dqotable.yaml. Instead, the target table is identified by the file name, followed by the .dqotable.yaml file name extension.
- The target data source name. Supports patterns such as name_prefix_*.
- The target table name provided as a schema.table name. The name supports patterns both in the schema name part and in the table name part. To run checks in all tables inside a schema, use target_schema.*. Other examples of supported patterns are: schema_prefix_*.target_table, *.fact_*.
The parameters passed to the check run command are:
||Data source name, matches the folder name inside the $DQO_USER_HOME/sources folder.||--connection=sales-dwh|
||The target table name in the format:
Targeting multiple tables
The next example shows a DQOps user home with multiple tables, sharing the same table name prefix.
- The target data source
- The first table that matches the public.dim_* pattern
- The second table that matches the public.dim_* pattern
Running all data quality checks for all tables in the public schema that are dimension tables (table names beginning with a dim_ prefix) is shown below.
- The target table name as a pattern that matches only the public schema and all tables whose names begin with a dim_ name prefix.
Selecting checks to run
DQOps supports also running checks only on selected columns, running only one data quality check on multiple tables or columns. We can also target one type of checks, profiling, monitoring or partitioned.
The check run command for running checks on a column, running only one check or all checks from one type is shown below.
- The data quality check name, for example daily_nulls_percent.
- The column name, supports also patters such as column_name_prefix_*, *_column_name_suffix or prefix*suffix.
- The data quality check type, supported values are profiling, monitoring and partitioned.
- The time scale for monitoring and partitioned checks. Supported values are daily and monthly.
- The category of checks to run.
A selection of the most important targeting filters is shown in the following table. The list of targeting filters is not limited to the filters described below. Please consult the check run command-line command to see the full list of supported targeting filters.
||The check name. All data quality checks are described in the Checks section.||--check=daily_nulls_percent|
||The target column name. This parameter supports patterns such as *_id.||--column=customer_id|
||The check type to run only checks of that type. Supported values are profiling, monitoring and partitioned.||--check-type=monitoring|
||The time scale for monitoring and partitioned checks. Supported values are daily and monthly.||--time-scale=daily|
||The name of the category of checks to run.||--category=nulls|
Running table and column checks
The following .dqotable.yaml example shows a column on which we want to run all the data quality checks.
When a column name is provided to filter the checks, only column-level checks are executed.
The table-level check are ignored. The
--column= filter will select column(s) to analyze.
The filter supports also patterns such as
- The column name, supports patterns.
Targeting checks by name
The following .dqotable.yaml example shows two configured data quality checks.
- The name of the data quality check to run. Other check names are listed in the Checks reference.
Targeting checks by type
Instead of running checks one by one, we can run all checks from one type of checks.
Running profiling checks
The next example .dqotable.yaml shows a mixture of profiling and monitoring checks. We want to run only profiling checks.
The type of checks is filtered using the
--check-type= parameter. In order to run only
profiling checks, we will use the following command.
- Selects the profiling checks to run.
Running monitoring checks
Instead of profiling checks, we can run only monitoring checks. If the time scale is not configured to select only daily or monthly checks, both types are run if they are configured.
The following example highlights the group of checks that we want to run.
The following command line runs all monitoring checks.
- Run the monitoring checks only.
Running daily monitoring checks
We can limit running the checks only to the daily monitoring checks, by providing both the
--time-scale= parameter that targets only daily or monthly checks.
The following YAML example highlights the daily monitoring checks that we want to run.
check run command accepts a
--time-scale=daily parameter that filters only by daily checks.
- Selects the daily checks. Another supported value is the monthly checks.
Running partitioned checks
Running partitioned checks to analyze partitions is similar to running monitoring checks. We can also select the time scale.
The following example highlights the partitioned checks that will be run.
The daily partitioned checks are run with the following command.
- Selects the partitioned checks.
Run all daily checks
DQOps supports any combination of targeting. For example, we can run only daily checks, which will target both monitoring and partitioned checks. Because the profiling checks are not using a time scale, they will be excluded.
The following YAML example highlights the daily checks that will be run.
check run command will use only the
--time-scale=daily parameter to run the daily checks.
- All types of daily checks are run.
Running table and column checks
We will use the previous example that showed daily monitoring checks configured both at the table and column levels.
The previous command that runs only daily monitoring checks will run both the table and column level checks.
Targeting a category of checks
We can also run all checks in a category, skipping other categories.
Filtering by category and check type
Please be aware that the same categories are present also in other types of checks (profiling, monitoring or partitioned), so using just the category name may not be enough. The check name were unique in the previous example of targeting a single check by its name, but the check type and optionally the check time scale should be specified to avoid running unexpected data quality checks.
The following example shows both checks defined in the nulls and in the schema category on a column. The highlighted section shows only the schema checks that we want to run on all columns.
We will use the
--category=schema filter to run the schema checks. Additionally, following the note above,
only daily monitoring checks will be run.
- The check category filter.