Editing DQOps YAML files with Visual Studio Code
YAML schema overview
In DQOps, the configuration of data quality checks is defined in the YAML files. YAML is a human-readable data serialization language that is often used for writing configuration files.
Defining data quality checks in the YAML files allows to store the data quality check configuration
in a source code repository. The data quality checks can be
versioned along with any other pipeline code or machine learning code.
When the DQOps YAML files are edited in Visual Studio Code, syntax highlighting and code completion is supported.
In order to enable support for the DQOps file schema in Visual Studio Code, the first line of the YAML file must have a reference to a YAML file schema that is published by DQOps, as showing on the following example. Visual Studio Code uses the schema file for validation and syntax highlighting.
Because DQOps validates the
apiVersion and the
kind fields, the best way to create a new DQOps compliant
YAML file is by using the DQOps user inteface
to create a new file, and edit it in Visual Studio Code later.
List of schema files
The urls to all supported schema files are listed below. The file paths are relative to the DQOps user home folder.
Preparing Visual Studio Code
Before Visual Studio Code can use the DQOps YAML schema for validation and syntax highlighting, a few extensions must be installed.
First, please install the YAML extension by Red Hat as shown on the following example.
DQOps sensors are defined as Jinja2 templates of SQL queries and are easier to edit with this extension.
The following screenshot shows how the .dqotable.yaml file is shown in Visual Studio Code when the YAML schema support is enabled.
To add a new table-level node, make one empty line, place the cursor in the line at the column where a sibling or a child node should be placed.
CTRL+Space to expand the code completion dialog, showing available elements at this level.
The above example shows how to add a
comments section at the table level. Comments are used to track changes inside
the YAML file and are presented in the DQOps user interface.
Add a check category
Data quality checks are grouped in categories as described in the configuring checks article.
The list of check type nodes where categories are listed below. Those nodes are available both at a table root level to configure table-level checks and below a named column to configure column-level checks.
When the cursor is within one of the nodes for check types, press
CTRL+Space to expand the list of categories
available for that check type.
Add a data quality check
The data quality checks are defined below the category nodes.
Place the cursor below the category node, adding required indenting. Press
CTRL+Space to see a list of checks
available within that category.
You may notice an extra node
custom_checks that was suggested at this level. The
is a dictionary of custom checks defined by the user within that category. When configuring a custom check,
the element that is added below the
custom_checks is the name of the custom check. The name must match
the name of a custom check defined in the Configuration section in the DQOps user interface.
Code completion for custom checks is limited to the structure, the list of possible sensor's and rule's parameter
values are not validated.
Adding an alerting threshold
The data quality issue alerting thresholds are defined by configuring data quality rules. The data quality rules to configure are:
warningconfigures the least severe, warning severity rule
errorconfigures a regular, error severity rule
fatalconfigures a fatal severity rule used to pause a data pipeline
Place the cursor below the check node, adding required indenting. Press
CTRL+Space to see a list of available nodes.
Each data quality check has several parameters, allowing to customize the check further. The following elements are supported, not including the data quality issue thresholds presented before.
commentssupports managing a list of comments that are usable to track changes within the file and in the DQOps UI
data_groupingsupports configuring a custom data grouping for that check. Data grouping adds a GROUP BY clause to the SQL queries, capturing multiple check results, separately for each group of rows.
disabledenables disabling a configured check temporarily from running, but preserving the configuration in the YAML file
exclude_from_kpiboolean flag when set to true will set a reverse value, false in the include_in_kpi field stored in the check_results parquet table, not counting the result of this check in the data quality KPI.
include_in_slaboolean flag is the value stored in the include_in_sla field stored in the check_results parquet table. Data quality SLAs can be used to group data quality checks that must pass to meet a Data Contract.
parametersis an important node that contains the data quality sensor's parameters. Not all sensors used by data quality checks have parameters and the node does not need to be configured.
quality_dimensionis a text field used to override the default value of the data quality dimension stored in the parquet tables. Changing the default data quality dimension name allows to report some issues under a different dimension.
schedule_overrideis a configuration of the CRON schedule for a single data quality check. The check could be configured to run using its own schedule, more or less frequently than the default scheduling configuration at the table or connection levels.
Configuring rule parameters
Most of the data quality rules have parameters. The list of parameters is expanded within the rule's node.
CTRL+Space inside the
fatal nodes, Visual Studio Code will show the available
The following example shows how to configure rules for the table availability check.
The assigned value to the rule parameter is shown below.
When setting 0 to the
max_failures in the
table availability check, DQOps will raise a data quality
issue instantly when an error is detected while trying to run a special availability testing query on the table.
All nodes defined in the DQOps YAML schema are documented, allowing to preview the definition of the data quality checks and their parameters directly in Visual Studio Code.
The help is shown when the mouse cursor is placed over a node for a while.
Invalid syntax highlighting
When a node added to a YAML file is invalid and not included in the DQOps YAML schema, Visual Studio Code will underline the invalid node as shown below.