Editing DQOps YAML files with Visual Studio Code

YAML schema overview

In DQOps, the configuration of data quality checks is defined in the YAML files. YAML is a human-readable data serialization language that is often used for writing configuration files.

Defining data quality checks in the YAML files allows to store the data quality check configuration in a source code repository. The data quality checks can be
versioned along with any other pipeline code or machine learning code.

When the DQOps YAML files are edited in Visual Studio Code, syntax highlighting and code completion is supported.

In order to enable support for the DQOps file schema in Visual Studio Code, the first line of the YAML file must have a reference to a YAML file schema that is published by DQOps, as showing on the following example. Visual Studio Code uses the schema file for validation and syntax highlighting.

# yaml-language-server: $schema=
apiVersion: dqo/v1
kind: table
              max_percent: 1.0
        - This is the column that is analyzed for data quality issues

Because DQOps validates the apiVersion and the kind fields, the best way to create a new DQOps compliant YAML file is by using the DQOps user inteface to create a new file, and edit it in Visual Studio Code later.

List of schema files

The urls to all supported schema files are listed below. The file paths are relative to the DQOps user home folder.

Type File location YAML schema url
Data source connection sources/<connection>/connection.dqoconnection.yaml
Custom check definition checks/**/*.dqocheck.yaml
Custom dashboard list settings/dashboardslist.dqodashboards.yaml
Default notification webhooks settings/defaultnotifications.dqonotifications.yaml
Default data observability checks settings/defaultchecks.dqochecks.yaml
Default CRON schedules settings/defaultschedules.dqoschedules.yaml
Local settings .localsettings.dqosettings.yaml
Database specific sensor definition sensors/**/*.dqoprovidersensor.yaml
Data quality rule definition rules/*/*.dqorule.yaml
Sensor definition sensors/**/*.dqosensor.yaml
Table schema sources/<connection>/*.dqotable.yaml

Preparing Visual Studio Code

Before Visual Studio Code can use the DQOps YAML schema for validation and syntax highlighting, a few extensions must be installed.

First, please install the YAML extension by Red Hat as shown on the following example.

YAML extension

If you intend to create custom data quality sensors, please also install the Better Jinja extension by Samuel Colvin.

DQOps sensors are defined as Jinja2 templates of SQL queries and are easier to edit with this extension.

Better Jinja extension

Code completion

The following screenshot shows how the .dqotable.yaml file is shown in Visual Studio Code when the YAML schema support is enabled.

Validated DQOps YAML file in Visual Studio Code

Table-level elements

To add a new table-level node, make one empty line, place the cursor in the line at the column where a sibling or a child node should be placed.

Press CTRL+Space to expand the code completion dialog, showing available elements at this level.

Edit DQOps table YAML in VS Code

The above example shows how to add a comments section at the table level. Comments are used to track changes inside the YAML file and are presented in the DQOps user interface.

Add a check category

Data quality checks are grouped in categories as described in the configuring checks article.

The list of check type nodes where categories are listed below. Those nodes are available both at a table root level to configure table-level checks and below a named column to configure column-level checks.

  • profiling_checks
  • monitoring_checks.daily
  • monitoring_checks.monthly
  • partitioned_checks.daily
  • partitioned_checks.monthly

When the cursor is within one of the nodes for check types, press CTRL+Space to expand the list of categories available for that check type.

Expanding list of DQOps check categories in VS Code

Add a data quality check

The data quality checks are defined below the category nodes.

Place the cursor below the category node, adding required indenting. Press CTRL+Space to see a list of checks available within that category.

Adding data quality checks in DQOps using VS Code

You may notice an extra node custom_checks that was suggested at this level. The custom_checks is a dictionary of custom checks defined by the user within that category. When configuring a custom check, the element that is added below the custom_checks is the name of the custom check. The name must match the name of a custom check defined in the Configuration section in the DQOps user interface. Code completion for custom checks is limited to the structure, the list of possible sensor's and rule's parameter values are not validated.

Adding an alerting threshold

The data quality issue alerting thresholds are defined by configuring data quality rules. The data quality rules to configure are:

  • warning configures the least severe, warning severity rule
  • error configures a regular, error severity rule
  • fatal configures a fatal severity rule used to pause a data pipeline

Place the cursor below the check node, adding required indenting. Press CTRL+Space to see a list of available nodes.

Adding error severity rule in DQOps using VS Code

Each data quality check has several parameters, allowing to customize the check further. The following elements are supported, not including the data quality issue thresholds presented before.

  • comments supports managing a list of comments that are usable to track changes within the file and in the DQOps UI
  • data_grouping supports configuring a custom data grouping for that check. Data grouping adds a GROUP BY clause to the SQL queries, capturing multiple check results, separately for each group of rows.
  • disabled enables disabling a configured check temporarily from running, but preserving the configuration in the YAML file
  • exclude_from_kpi boolean flag when set to true will set a reverse value, false in the include_in_kpi field stored in the check_results parquet table, not counting the result of this check in the data quality KPI.
  • include_in_sla boolean flag is the value stored in the include_in_sla field stored in the check_results parquet table. Data quality SLAs can be used to group data quality checks that must pass to meet a Data Contract.
  • parameters is an important node that contains the data quality sensor's parameters. Not all sensors used by data quality checks have parameters and the node does not need to be configured.
  • quality_dimension is a text field used to override the default value of the data quality dimension stored in the parquet tables. Changing the default data quality dimension name allows to report some issues under a different dimension.
  • schedule_override is a configuration of the CRON schedule for a single data quality check. The check could be configured to run using its own schedule, more or less frequently than the default scheduling configuration at the table or connection levels.

Configuring rule parameters

Most of the data quality rules have parameters. The list of parameters is expanded within the rule's node.

After pressing CTRL+Space inside the warning, error, or fatal nodes, Visual Studio Code will show the available parameters.

The following example shows how to configure rules for the table availability check.

Configuring data quality rule parameters in DQOps using VS Code

The assigned value to the rule parameter is shown below.

When setting 0 to the max_failures in the table availability check, DQOps will raise a data quality issue instantly when an error is detected while trying to run a special availability testing query on the table.

Configuring table availability check in DQOps using VS Code

Getting help

All nodes defined in the DQOps YAML schema are documented, allowing to preview the definition of the data quality checks and their parameters directly in Visual Studio Code.

The help is shown when the mouse cursor is placed over a node for a while.

Data quality check documentation preview in VS Code

Invalid syntax highlighting

When a node added to a YAML file is invalid and not included in the DQOps YAML schema, Visual Studio Code will underline the invalid node as shown below.

Syntax issues when configuring data quality check with DQOps using VS Code

