Editing DQOps YAML files with Visual Studio Code
YAML schema overview
In DQOps, the configuration of data quality checks is defined in the YAML files. YAML is a human-readable data serialization language that is often used for writing configuration files.
Defining data quality checks in the YAML files allows to store the data quality check configuration
in a source code repository. The data quality checks can be
versioned along with any other pipeline code or machine learning code.
When the DQOps YAML files are edited in Visual Studio Code, syntax highlighting and code completion is supported.
In order to enable support for the DQOps file schema in Visual Studio Code, the first line of the YAML file must have a reference to a YAML file schema that is published by DQOps, as showing on the following example. Visual Studio Code uses the schema file for validation and syntax highlighting.
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
columns:
target_column:
profiling_checks:
nulls:
nulls_percent:
warning:
max_percent: 1.0
labels:
- This is the column that is analyzed for data quality issues
Because DQOps validates the apiVersion
and the kind
fields, the best way to create a new DQOps compliant
YAML file is by using the DQOps user inteface
to create a new file, and edit it in Visual Studio Code later.
List of schema files
The urls to all supported schema files are listed below. The file paths are relative to the DQOps user home folder.
Preparing Visual Studio Code
Before Visual Studio Code can use the DQOps YAML schema for validation and syntax highlighting, a few extensions must be installed.
First, please install the YAML extension by Red Hat as shown on the following example.
If you intend to create custom data quality sensors, please also install the Better Jinja extension by Samuel Colvin.
DQOps sensors are defined as Jinja2 templates of SQL queries and are easier to edit with this extension.
Code completion
The following screenshot shows how the .dqotable.yaml file is shown in Visual Studio Code when the YAML schema support is enabled.
Table-level elements
To add a new table-level node, make one empty line, place the cursor in the line at the column where a sibling or a child node should be placed.
Press CTRL+Space
to expand the code completion dialog, showing available elements at this level.
The above example shows how to add a comments
section at the table level. Comments are used to track changes inside
the YAML file and are presented in the DQOps user interface.
Add a check category
Data quality checks are grouped in categories as described in the configuring checks article.
The list of check type nodes where categories are listed below. Those nodes are available both at a table root level to configure table-level checks and below a named column to configure column-level checks.
profiling_checks
monitoring_checks.daily
monitoring_checks.monthly
partitioned_checks.daily
partitioned_checks.monthly
When the cursor is within one of the nodes for check types, press CTRL+Space
to expand the list of categories
available for that check type.
Add a data quality check
The data quality checks are defined below the category nodes.
Place the cursor below the category node, adding required indenting. Press CTRL+Space
to see a list of checks
available within that category.
You may notice an extra node custom_checks
that was suggested at this level. The custom_checks
is a dictionary of custom checks defined by the user within that category. When configuring a custom check,
the element that is added below the custom_checks
is the name of the custom check. The name must match
the name of a custom check defined in the Configuration section in the DQOps user interface.
Code completion for custom checks is limited to the structure, the list of possible sensor's and rule's parameter
values are not validated.
Adding an alerting threshold
The data quality issue alerting thresholds are defined by configuring data quality rules. The data quality rules to configure are:
warning
configures the least severe, warning severity ruleerror
configures a regular, error severity rulefatal
configures a fatal severity rule used to pause a data pipeline
Place the cursor below the check node, adding required indenting. Press CTRL+Space
to see a list of available nodes.
Each data quality check has several parameters, allowing to customize the check further. The following elements are supported, not including the data quality issue thresholds presented before.
comments
supports managing a list of comments that are usable to track changes within the file and in the DQOps UIdata_grouping
supports configuring a custom data grouping for that check. Data grouping adds a GROUP BY clause to the SQL queries, capturing multiple check results, separately for each group of rows.disabled
enables disabling a configured check temporarily from running, but preserving the configuration in the YAML fileexclude_from_kpi
boolean flag when set to true will set a reverse value, false in the include_in_kpi field stored in the check_results parquet table, not counting the result of this check in the data quality KPI.include_in_sla
boolean flag is the value stored in the include_in_sla field stored in the check_results parquet table. Data quality SLAs can be used to group data quality checks that must pass to meet a Data Contract.parameters
is an important node that contains the data quality sensor's parameters. Not all sensors used by data quality checks have parameters and the node does not need to be configured.quality_dimension
is a text field used to override the default value of the data quality dimension stored in the parquet tables. Changing the default data quality dimension name allows to report some issues under a different dimension.schedule_override
is a configuration of the CRON schedule for a single data quality check. The check could be configured to run using its own schedule, more or less frequently than the default scheduling configuration at the table or connection levels.
Configuring rule parameters
Most of the data quality rules have parameters. The list of parameters is expanded within the rule's node.
After pressing CTRL+Space
inside the warning
, error
, or fatal
nodes, Visual Studio Code will show the available
parameters.
The following example shows how to configure rules for the table availability check.
The assigned value to the rule parameter is shown below.
When setting 0 to the max_failures
in the
table availability check, DQOps will raise a data quality
issue instantly when an error is detected while trying to run a special availability testing query on the table.
Getting help
All nodes defined in the DQOps YAML schema are documented, allowing to preview the definition of the data quality checks and their parameters directly in Visual Studio Code.
The help is shown when the mouse cursor is placed over a node for a while.
Invalid syntax highlighting
When a node added to a YAML file is invalid and not included in the DQOps YAML schema, Visual Studio Code will underline the invalid node as shown below.
What's next
- Learn how configure data quality checks in DQOps YAML files.
- Read how the YAML files are organized in the DQOps user home folder.