DQOps user home
DQOps depends on the local file system to store the configuration files and the data files.
The folder on the disk where the files are stored is called a DQOps user home
.
Five most important kinds of files are stored in the DQOps user home
.
-
YAML files with the configuration of enabled data quality checks on tables, data source connection settings and several other configuration files used for defining custom sensors, rules and checks.
-
Shared credentials are just regular files, both text and binary. Shared credentials can be referenced in the YAML files. They are synchronized to the DQOps Cloud Data Lake, but they are ignored in the .gitignore file, preventing from committing secrets and password to Git.
-
Parquet files with a local copy of the data quality data lake. Storing all historical sensor readouts (metrics captured by the data quality checks) enables running anomaly detection rules locally, without using any SaaS cloud resources.
-
Jinja2 SQL templates are templates of SQL queries that are executed on the data sources to capture metrics. The Jinja2 SQL templates are either custom sensor definitions or overrides of built-in sensors, shadowing the original sensor template by using the same file name as the sensor definition in the DQOps home sensors folder.
-
Python rules are custom Python functions that are called to evaluate the sensor readouts to decide if the value is valid or not. The Python rules in the
DQOps user home
folder are either custom rule definitions or overrides of built-in rules, shadowing the original rules by using the same file name as the rule definition in the DQOps home rules folder.
DQOps user home location
When DQOps is started as a Python module, the current working folder becomes the DQOps user home
.
Alternatively, an environment variable $DQO_USER_HOME should be set that points to the DQOps user home
location.
The $DQO_USER_HOME environment variable is used along this documentation to point to the DQOps user home
,
whenever this folder is referenced.
It is advised to create a new, empty folder that will serve as the DQOps user home
before starting DQOps.
If DQOps is started as a docker image for production use, the DQOps user home
folder should be
mounted to the /dqo/userhome folder inside a DQOps docker image as described in
the DQOps docker installation manual.
DQOps user home structure
DQOps initializes the DQOps user home
folder on startup, creating empty folders and adding default files.
The following folders and root level files are created.
$DQO_USER_HOME
├───.DQO_USER_HOME(1)
├───.gitignore(2)
├───.localsettings.dqosettings.yaml(3)
├───.credentials(4)
├───.data(5)
├───.index(6)
├───.logs(7)
├───checks(8)
├───rules(9)
├───sensors(10)
├───settings(11)
└───sources(12)
- A marker file that is created only to identify the
DQOps user home
root and confirm that the folder was fully initialized. - Git ignore file that lists files and folders that should not be stored in Git.
- The .localsettings.dqosettings.yaml file is ignored because it contains the DQOps Cloud Pairing key
- The .data folder is ignored, because it contains Parquet data files that change frequently.
- The .credentials folder is ignored, because it contains secrets and passwords.
- The .index and .logs folders are ignored, because they are only required by a local DQOps instance.
- .localsettings.dqosettings.yaml file contains settings that are private for the current DQOps instance and should not be stored in the Git repository or shared with other DQOps instances. The most important parameters in the local settings file are a DQOps Cloud Pairing Key and a local instance key used to sign DQOps API keys.
- The .credentials folder stores secrets and passwords as regular text or binary files. This folder should not be committed to Git.
- The .data folder is a local copy of the data quality data lake, storing all current and historical data quality results, statistics, execution errors and incidents. The content of this folder is replicated to the DQOps Cloud Data Lake as documented in the DQOps architecture. The content of the folder is described in the data storage concept manual.
- The .index folder is used internally by DQOps to track the file synchronization status between the local
DQOps user home
folder and the DQOps Cloud Data Lake. The files in this folder should not be modified manually. - The .logs folder stores error logs locally. The files in the folder are rotated to save space. In case that an error is reported when running DQOps, the content of the folder should be sent to the DQOps support. Please review all the --logging.* and --dqo.logging.* parameters passed to DQOps as the entry point parameters to learn how to configure logging.
- The .checks folder stores the definition of custom data quality checks.
- The .rules folder stores the definition of custom and overwritten data quality rules.
- The .sensors folder stores the definition of custom and overwritten data quality sensors.
- The .settings folder stores shared settings that can be committed to Git. The shared settings include the list of custom data quality dashboards or the default configuration of data observability checks that are applied on all imported data sources.
- The .sources folder is the most important folder in the
DQOps user home
. It is the folder where DQOps stores the connection parameters to the data sources and the data quality checks configuration for all monitored tables.
The files stored directly in the DQOps user home
folder and all folders are described below.
File or folder name | Description | Stored in Git |
---|---|---|
.DQO_USER_HOME | A marker file that is created only to identify the DQOps user home root and confirm that the folder was fully initialized. |
|
.gitignore | Git ignore file that lists files and folders that should not be stored in Git. | |
.localsettings.dqosettings.yaml | .localsettings.dqosettings.yaml file contains settings that are private for the current DQOps instance and should not be stored in the Git repository or shared with other DQOps instances. The most important parameters in the local settings file are a DQOps Cloud Pairing Key and a local instance key used to sign DQOps API keys. | |
.credentials | The .credentials folder stores secrets and passwords as regular text or binary files. | |
.data | The .data folder is a local copy of the data quality data lake, storing all current and historical data quality results, statistics, execution errors and incidents. The content of this folder is replicated to the DQOps Cloud Data Lake as documented in the DQOps architecture. The content of the folder is described in the data storage concept manual. | |
.index | The .index folder is used internally by DQOps to track the file synchronization status between the local DQOps user home folder and the DQOps Cloud Data Lake. The files in this folder should not be modified manually. |
|
.logs | The .logs folder stores error logs locally. The files in the folder are rotated to save space. In case that an error is reported when running DQOps, the content of the folder should be sent to the DQOps support. Please review all the --logging.* and --dqo.logging.* parameters passed to DQOps as the entry point parameters to learn how to configure logging. | |
checks | The .checks folder stores the definition of custom data quality checks. | |
rules | The .rules folder stores the definition of custom and overwritten data quality rules. | |
sensors | The .sensors folder stores the definition of custom and overwritten data quality sensors. | |
settings | The .settings folder stores shared settings that can be committed to Git. The shared settings include the list of custom data quality dashboards or the default configuration of data observability checks that are applied on all imported data sources. | |
sources | The .sources folder is the most important folder in the DQOps user home . It is the folder where DQOps stores the connection parameters to the data sources and the data quality checks configuration for all monitored tables. |
Data sources
The data sources are defined in the sources folder as shown below.
$DQO_USER_HOME
├───...
└───sources(1)
├───prod-landing-zone(2)
│ ├───connection.dqoconnection.yaml(3)
│ └───...
├───prod-data-lake
│ ├───connection.dqoconnection.yaml
│ └───...
└─...
- The sources folder stores data sources as nested folders.
- Each folder inside the sources folder is a connection name to a data source.
- Each data source's folder contains a file connection.dqoconnection.yaml that specifies the connection parameters to the data source.
Each folder inside the sources folder is a connection name to a data source. Each data source's folder contains a file connection.dqoconnection.yaml that specifies the connection parameters to the data source.
Monitored tables
DQOps uses one file per monitored table. The file names use the .dqotable.yaml file extension.
The file name format is <schema_name>.<table_name>.dqotable.yaml.
DQOps also encodes selected characters (including \
, /
, %
) as %AB, where AB is the ASCII code of the character.
The following example shows a folder structure with several tables.
$DQO_USER_HOME
├───...
└───sources
├───prod-data-lake
│ ├───connection.dqoconnection.yaml
│ ├───country_codes.country_codes.dqotable.yaml
│ ├───crypto_dogecoin.blocks.dqotable.yaml
│ ├───crypto_dogecoin.inputs.dqotable.yaml
│ ├───crypto_dogecoin.outputs.dqotable.yaml
│ └───<schema_name>.<table_name>.dqotable.yaml(1)
└─...
- The .dqotable.yaml files are named as the <schema_name>.<table_name>.dqotable.yaml.
Storing the configuration of the data quality checks in a file named after the table name simplifies migration of the table versions or between environments. When a similar table is present in another data source or within the current data source, the whole .dqotable.yaml file could be easily copied and renamed.
The following scenarios are supported by copying the .dqotable.yaml file manually:
-
Table definitions are moved between environments (production, test, development). The file with a configuration of the data quality checks on the development environment is copied to another data source folder that is connected to the production environment.
-
A similar table is created in the same data source, but in a different schema. The .dqotable.yaml file is copied and renamed to replace the schema_name part of the file name.
-
Another similar table is created in the same schema. The .dqotable.yaml file is copied and renamed to replace the table_name part of the file name.
Shared credentials
Shared credentials are secrets and passwords that should not be stored in Git.
Shared credentials are referenced in the connection.dqoconnection.yaml files by using a special field value format ${credential://file_name} where the file_name is a file name inside the .credentials folder.
$DQO_USER_HOME
├───...
├───.credentials
│ ├───db_password.txt(1)
│ └───GCP_application_default_credentials.json(2)
├───...
- A sample shared credential named db_password.txt. It can be referenced in the connection.dqoconnection.yaml file as ${credential://db_password.txt}.
- The name of the default GCP application credentials that is used by the BigQuery connector if the GCP default credentials are not available otherwise. It should be a key generated for a GCP service account. The service account whose key was generated must have correct permissions to run queries on the BigQuery data set that is monitored. This file must be created manually, it is not created during the DQOps user home initialization.
The example db_password.txt credential can be referenced as a ${credential://db_password.txt} expression used in the connection.dqoconnection.yaml file.
Custom sensors
Custom data quality sensors are defined in the sensors folder. The folder structure is not strict, but it is advised to follow the order:
-
the target object: table or column
-
the category of sensor
-
the short name of the sensor
DQOps uses the folder structure inside the sensors folder to identify data quality sensors. The sensor on the following example is named column/nulls/null_count according to the sensor's naming convention. The highlighted lines show the minimum set of sensor's configuration files.
$DQO_USER_HOME
├───...
├───sensors
│ └───column(1)
│ └───nulls(2)
│ └───null_count(3)
│ ├───bigquery.dqoprovidersensor.yaml(4)
│ ├───bigquery.sql.jinja2(5)
│ ├───mysql.dqoprovidersensor.yaml
│ ├───mysql.sql.jinja2
│ ├───oracle.dqoprovidersensor.yaml
│ ├───oracle.sql.jinja2
│ ├───postgresql.dqoprovidersensor.yaml
│ ├───postgresql.sql.jinja2
│ ├───redshift.dqoprovidersensor.yaml
│ ├───redshift.sql.jinja2
│ ├───sensordefinition.dqosensor.yaml(6)
│ ├───snowflake.dqoprovidersensor.yaml
│ ├───snowflake.sql.jinja2
│ ├───sqlserver.dqoprovidersensor.yaml
│ └───sqlserver.sql.jinja2
└───...
- The sensor target, should be table or column.
- The sensor category that is a logical grouping of similar sensors.
- The short sensor name within the category. This folder contains the sensor configuration files.
- The database specific configuration of the sensor.
- Jinja2 SQL template of the sensor.
- The main sensor definition file that configures the list of sensor's parameters that are shown in the DQOps check editor screen.
DQOps supports both creating a custom sensors or changing the Jinja2 templates for built-in sensors. Updating built-in sensors has one limitation. The list of sensor's parameters stored in the sensordefinition.dqosensor.yaml cannot be modified by adding or changing sensor's parameters.
Sensors that are overwritten from the built-in sensors are a copy of the sensor definition file from the DQOps home sensors folder. The easiest way to customize a built-in sensor is to edit the Jinja2 file from the Configuration -> Sensors screen in the DQOps user interface. DQOps will copy the default definition from its distribution to the DQOps user home's sensors folder.
A custom sensor must have at least three files:
-
sensordefinition.dqosensor.yaml file that provides the list of parameters
-
.dqoprovidersensor.yaml file named as <database_type>.dqoprovidersensor.yaml. which confirms that there is a sensor definition (and a query template) for the database_type.
-
Jinja2 SQL template of the sensor named as <database_type>.sql.jinja2.
Custom rules
Custom data quality rules are defined as two files.
The .dqorule.yaml file with the rule parameters and configuration.
The second file is a Python module that must have a evaluate_rule
function.
The rule names also follow a naming convention, but in contrary to the sensors, multiple rules can be defined in a single rule category folder.
The full rule name is defined as comparison/max_count in the following example.
$DQO_USER_HOME
├───...
├───rules
│ ├───requirements.txt(1)
│ ├───comparison(2)
│ │ ├───max_count.dqorule.yaml(3)
│ │ └───max_count.py(4)
│ └─...
└───...
- The requirements.txt file with a list of custom Python packages that should be installed when DQOps is started as a docker container.
- Rule category name where a custom or an overwritten rule is defined. Custom rules can be defined in any category.
- The .dqorule.yaml rule definition file that specifies a list of rule parameters shown on the check's editor screen and the time window requirements of historical sensor readouts required by rules that use historical values for change or anomaly detection.
- Python module with a
evaluate_rule
function that is called by DQOps to evaluate the sensor readout.
The custom rules also support both overriding built-in rules that are copied from the DQOps home rules folder. Also, the easiest way to alter a built-in rule is to edit the Python rule file on the Configuration -> Rules screen in the DQOps user interface.
The same limitation is in effect that the list of parameters of built-in rules cannot be changed. If a new version of a rule is required that has a different set of parameters, the built-in rule should be copied and altered as a custom rule. The Configuration -> Rules screen has a copy button that supports this use case.
Custom rules can also use additional Python packages from PyPi. The list of dependencies must be configured in the rules/requirements.txt file. It is important to understand where and when the additional packages are installed.
DQOps instance that is started for development as a Python module by running python -m dqops
will not use the
rules/requirements.txt file at all. Instead, DQOps will require that all necessary packages were already installed
in the Python's system or virtual environment that is used to start the python -m dqops
command.
A production DQOps instance that was started from docker, DQOps will detect changes to the rules/requirements.txt file on startup and will reinstall required packages.
Custom checks
Custom data quality checks in DQOps are simply a pair of a sensor and a rule. A check can use any combination of custom and built-in sensors and rules.
The folder structure for custom data quality checks is strictly limited, because the folder name affects the location on the check editor screen in the DQOps user interface. The custom checks must be defined in a three-level deep folder structure. The folder names on the folder tree are:
-
The target object, must be table or column.
-
The target type of checks, must be one of profiling, monitoring, or partitioned.
-
The name of an existing check category within the built-in check structure. The folder structure is shown within the reference of the data quality checks in this documentation. A custom check could be appended to an existing category of checks or added to the category named custom.
The following example shows two custom data quality checks. One in the custom category and another appended to an existing volume category.
$DQO_USER_HOME
├───...
├───checks
│ └───table(1)
│ └───profiling(2)
│ ├───custom(3)
│ │ └───custom_profile_row_count.dqocheck.yaml(4)
│ └───volume(5)
│ └───profile_max_row_count.dqocheck.yaml
└───...
- The check target, table or column.
- The check type, must be one of profiling, monitoring, or partitioned.
- The custom check category for adding new custom checks.
- The check definition file that contains the configuration of the check.
- An existing category of checks where a custom check is appended.
The custom checks are defined in <check_name>.dqocheck.yaml files. The check names must be unique even between categories. Otherwise, the results shown on the data quality dashboards will not identify the correct check name. The check_name used in the .dqocheck.yaml is the name of the check that is used to run it.
Differently from the customization of sensors and rules, it is not possible to overwrite a built-in check by creating a check with the same name in the checks folder.
Shared settings
Some configuration files are safe to store to Git and should be shared between on-premise and cloud DQOps instances in a hybrid deployment model.
The default settings files are created when the DQOps user home is initialized for the first time.
The following example shows the list of the files in the settings folder.
$DQO_USER_HOME
├───...
├───settings
│ ├───dashboardslist.dqodashboards.yaml(1)
│ ├───defaultchecks.dqochecks.yaml(2)
│ ├───defaultnotifications.dqonotifications.yaml(3)
│ └───defaultschedules.dqoschedules.yaml(4)
└───...
- A list (tree) of custom or overwritten data quality dashboards.
- The configuration of the default data quality checks that are activated on imported tables and columns to detect common issues and observe the data source.
- The configuration of the default incident notification webhooks.
- The configuration of the default schedules for running data quality checks daily or monthly.
The default configuration files are listed below.
File name | Description |
---|---|
dashboardslist.dqodashboards.yaml | The configuration of custom data quality dashboards. Adding custom dashboards is documented in the creating custom dashboards manual. |
defaultchecks.dqochecks.yaml | The configuration of the default checks that are activated on imported tables and columns to detect common issues and observe the data source. |
defaultnotifications.dqonotifications.yaml | The configuration of the webhooks where the notification of incidents are POST'ed when data quality incidents are created or reassigned. |
defaultschedules.dqoschedules.yaml | The default configuration of CRON schedules for running data quality checks in regular intervals. NOTE: The CRON schedules defined in this file are copied to the connection.dqoconnection.yaml file when a new connection is imported in DQOps. Changes to this file will not change the schedules of running checks for already imported data sources. |