ConnectionYaml
MysqlParametersSpec
MySql connection parameters.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
host | MySQL host name. Supports also a ${MYSQL_HOST} configuration with a custom environment variable. | string | |||
port | MySQL port number. The default port is 3306. Supports also a ${MYSQL_PORT} configuration with a custom environment variable. | string | |||
database | MySQL database name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
user | MySQL user name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
password | MySQL database password. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
options | MySQL connection 'options' initialization parameter. For example setting this to -c statement_timeout=5min would set the statement timeout parameter for this session to 5 minutes. Supports also a ${MYSQL_OPTIONS} configuration with a custom environment variable. | string | |||
sslmode | SslMode MySQL connection parameter. | enum | DISABLED PREFERRED VERIFY_IDENTITY VERIFY_CA REQUIRED |
ConnectionSpec
Data source (connection) specification.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
provider_type | Database provider type (required). | enum | snowflake oracle postgresql redshift sqlserver mysql bigquery |
||
bigquery | BigQuery connection parameters. Specify parameters in the bigquery section. | BigQueryParametersSpec | |||
snowflake | Snowflake connection parameters. Specify parameters in the snowflake section or set the url (which is the Snowflake JDBC url). | SnowflakeParametersSpec | |||
postgresql | PostgreSQL connection parameters. Specify parameters in the postgresql section or set the url (which is the PostgreSQL JDBC url). | PostgresqlParametersSpec | |||
redshift | Redshift connection parameters. Specify parameters in the redshift section or set the url (which is the Redshift JDBC url). | RedshiftParametersSpec | |||
sqlserver | SQL Server connection parameters. Specify parameters in the sqlserver section or set the url (which is the SQL Server JDBC url). | SqlServerParametersSpec | |||
mysql | MySQL connection parameters. Specify parameters in the sqlserver section or set the url (which is the MySQL JDBC url). | MysqlParametersSpec | |||
oracle | Oracle connection parameters. Specify parameters in the postgresql section or set the url (which is the Oracle JDBC url). | OracleParametersSpec | |||
parallel_runs_limit | The concurrency limit for the maximum number of parallel SQL queries executed on this connection. | integer | |||
default_grouping_configuration | Default data grouping configuration for all tables. The configuration may be overridden on table, column and check level. Data groupings are configured in two cases: (1) the data in the table should be analyzed with a GROUP BY condition, to analyze different datasets using separate time series, for example a table contains data from multiple countries and there is a 'country' column used for partitioning. a static dimension is assigned to a table, when the data is partitioned at a table level (similar tables store the same information, but for different countries, etc.). (2) a static dimension is assigned to a table, when the data is partitioned at a table level (similar tables store the same information, but for different countries, etc.). | DataGroupingConfigurationSpec | |||
schedules | Configuration of the job scheduler that runs data quality checks. The scheduler configuration is divided into types of checks that have different schedules. | RecurringSchedulesSpec | |||
incident_grouping | Configuration of data quality incident grouping. Configures how failed data quality checks are grouped into data quality incidents. | ConnectionIncidentGroupingSpec | |||
comments | Comments for change tracking. Please put comments in this collection because YAML comments may be removed when the YAML file is modified by the tool (serialization and deserialization will remove non tracked comments). | CommentsListSpec | |||
labels | Custom labels that were assigned to the connection. Labels are used for searching for tables when filtered data quality checks are executed. | LabelSetSpec |
SqlServerParametersSpec
Microsoft SQL Server connection parameters.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
host | SQL Server host name. Supports also a ${SQLSERVER_HOST} configuration with a custom environment variable. | string | |||
port | SQL Server port number. The default port is 1433. Supports also a ${SQLSERVER_PORT} configuration with a custom environment variable. | string | |||
database | SQL Server database name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
user | SQL Server user name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
password | SQL Server database password. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
options | SQL Server connection 'options' initialization parameter. For example setting this to -c statement_timeout=5min would set the statement timeout parameter for this session to 5 minutes. Supports also a ${SQLSERVER_OPTIONS} configuration with a custom environment variable. | string | |||
disable_encryption | Disable SSL encryption parameter. The default value is false. You may need to disable encryption when SQL Server is started in Docker. | boolean |
DataGroupingDimensionSpec
Single data grouping dimension configuration. A data grouping dimension may be configured as a hardcoded value or a mapping to a column.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
source | The source of the data grouping dimension value. The default grouping dimension source is a tag. Assign a tag when there are multiple similar tables that store the same data for different areas (countries, etc.). This could be a country name if a table or partition stores information for that country. | enum | tag column_value |
||
tag | The value assigned to a data quality grouping dimension when the source is 'tag'. Assign a hardcoded (static) data grouping dimension value (tag) when there are multiple similar tables that store the same data for different areas (countries, etc.). This could be a country name if a table or partition stores information for that country. | string | |||
column | Column name that contains a dynamic data grouping dimension value (for dynamic data-driven data groupings). Sensor queries will be extended with a GROUP BY {data group level colum name}, sensors (and alerts) will be calculated for each unique value of the specified column. Also a separate time series will be tracked for each value. | column_name | |||
name | Data grouping dimension name. | string |
BigQueryParametersSpec
BigQuery connection parameters.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
source_project_id | Source GCP project ID. This is the project that has datasets that will be imported. | string | |||
billing_project_id | Billing GCP project ID. This is the project used as the default GCP project. The calling user must have a bigquery.jobs.create permission in this project. | string | |||
authentication_mode | Authentication mode to the Google Cloud. | enum | json_key_content json_key_path google_application_credentials |
||
json_key_content | JSON key content. Use an environment variable that contains the content of the key as ${KEY_ENV} or a name of a secret in the GCP Secret Manager: ${sm://key-secret-name}. Requires the authentication-mode: json_key_content. | string | |||
json_key_path | A path to the JSON key file. Requires the authentication-mode: json_key_path. | string | |||
quota_project_id | Quota GCP project ID. | string |
RecurringSchedulesSpec
Container of all recurring schedules (cron expressions) for each type of checks. Data quality checks are grouped by type (profiling, whole table checks, time period partitioned checks). Each group of checks could be divided additionally by time scale (daily, monthly, etc). Each time scale has a different recurring schedule used by the job scheduler to run the checks. These schedules are defined in this object.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
profiling | Schedule for running profiling data quality checks. | RecurringScheduleSpec | |||
recurring_daily | Schedule for running daily recurring checks. | RecurringScheduleSpec | |||
recurring_monthly | Schedule for running monthly recurring checks. | RecurringScheduleSpec | |||
partitioned_daily | Schedule for running daily partitioned checks. | RecurringScheduleSpec | |||
partitioned_monthly | Schedule for running monthly partitioned checks. | RecurringScheduleSpec |
ConnectionIncidentGroupingSpec
Configuration of data quality incident grouping on a connection level. Defines how similar data quality issues are grouped into incidents.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
grouping_level | Grouping level of failed data quality checks for creating higher level data quality incidents. The default grouping level is by a table, a data quality dimension and a check category (i.e. a datatype data quality incident detected on a table X in the numeric checks category). | enum | table_dimension_category_type table_dimension table table_dimension_category table_dimension_category_name |
||
minimum_severity | Minimum severity level of data quality issues that are grouped into incidents. The default minimum severity level is 'warning'. Other supported severity levels are 'error' and 'fatal'. | enum | warning error fatal |
||
divide_by_data_groups | Create separate data quality incidents for each data group, creating different incidents for different groups of rows. By default, data groups are ignored for grouping data quality issues into data quality incidents. | boolean | |||
max_incident_length_days | The maximum length of a data quality incident in days. When a new data quality issue is detected after max_incident_length_days days since a similar data quality was first seen, a new data quality incident is created that will capture all following data quality issues for the next max_incident_length_days days. The default value is 60 days. | integer | |||
mute_for_days | The number of days that all similar data quality issues are muted when a a data quality incident is closed in the 'mute' status. | integer | |||
disabled | Disables data quality incident creation for failed data quality checks on the data source. | boolean | |||
webhooks | Configuration of Webhook URLs for new or updated incident notifications. | IncidentWebhookNotificationsSpec |
OracleParametersSpec
Oracle connection parameters.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
host | Oracle host name. Supports also a ${ORACLE_HOST} configuration with a custom environment variable. | string | |||
port | Oracle port number. The default port is 1521. Supports also a ${ORACLE_PORT} configuration with a custom environment variable. | string | |||
database | Oracle database name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
user | Oracle user name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
password | Oracle database password. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
options | Oracle connection 'options' initialization parameter. For example setting this to -c statement_timeout=5min would set the statement timeout parameter for this session to 5 minutes. Supports also a ${ORACLE_OPTIONS} configuration with a custom environment variable. | string | |||
initialization_sql | Custom SQL that is executed after connecting to Oracle. This SQL script can configure the default language, for example: alter session set NLS_DATE_FORMAT='YYYY-DD-MM HH24:MI:SS' | string |
IncidentWebhookNotificationsSpec
Configuration of Webhook URLs used for new or updated incident's notifications. Specifies the URLs of webhooks where the notification messages are sent.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
incident_opened_webhook_url | Webhook URL where the notification messages describing new incidents are pushed using a HTTP POST request. The format of the JSON message is documented in the IncidentNotificationMessage object. | string | |||
incident_acknowledged_webhook_url | Webhook URL where the notification messages describing acknowledged messages are pushed using a HTTP POST request. The format of the JSON message is documented in the IncidentNotificationMessage object. | string | |||
incident_resolved_webhook_url | Webhook URL where the notification messages describing resolved messages are pushed using a HTTP POST request. The format of the JSON message is documented in the IncidentNotificationMessage object. | string | |||
incident_muted_webhook_url | Webhook URL where the notification messages describing muted messages are pushed using a HTTP POST request. The format of the JSON message is documented in the IncidentNotificationMessage object. | string |
PostgresqlParametersSpec
Postgresql connection parameters.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
host | PostgreSQL host name. Supports also a ${POSTGRESQL_HOST} configuration with a custom environment variable. | string | |||
port | PostgreSQL port number. The default port is 5432. Supports also a ${POSTGRESQL_PORT} configuration with a custom environment variable. | string | |||
database | PostgreSQL database name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
user | PostgreSQL user name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
password | PostgreSQL database password. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
options | PostgreSQL connection 'options' initialization parameter. For example setting this to -c statement_timeout=5min would set the statement timeout parameter for this session to 5 minutes. Supports also a ${POSTGRESQL_OPTIONS} configuration with a custom environment variable. | string | |||
sslmode | Sslmode PostgreSQL connection parameter. The default value is disabled. | enum | allow prefer disable require verify-full verify-ca |
RedshiftParametersSpec
Redshift connection parameters.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
host | Redshift host name. Supports also a ${REDSHIFT_HOST} configuration with a custom environment variable. | string | |||
port | Redshift port number. The default port is 5432. Supports also a ${REDSHIFT_PORT} configuration with a custom environment variable. | string | |||
database | Redshift database name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
user | Redshift user name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
password | Redshift database password. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
options | Redshift connection 'options' initialization parameter. For example setting this to -c statement_timeout=5min would set the statement timeout parameter for this session to 5 minutes. Supports also a ${REDSHIFT_OPTIONS} configuration with a custom environment variable. | string |
SnowflakeParametersSpec
Snowflake connection parameters.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
account | Snowflake account name, e.q. <account>, <account>-<locator>, <account>.<region> or <account>.<region>.<platform>.. Supports also a ${SNOWFLAKE_ACCOUNT} configuration with a custom environment variable. | string | |||
warehouse | Snowflake warehouse name. Supports also a ${SNOWFLAKE_WAREHOUSE} configuration with a custom environment variable. | string | |||
database | Snowflake database name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
user | Snowflake user name. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
password | Snowflake database password. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution. | string | |||
role | Snowflake role name. Supports also ${SNOWFLAKE_ROLE} configuration with a custom environment variable. | string |
LabelSetSpec
Collection of unique labels assigned to items (tables, columns, checks) that could be targeted for a data quality check execution.
DataGroupingConfigurationSpec
Configuration of the data groupings that is used to calculate data quality checks with a GROUP BY clause. Data grouping levels may be hardcoded if we have different (but similar) tables for different business areas (countries, product groups). We can also pull data grouping levels directly from the database if a table has a column that identifies a business area. Data quality results for new groups are dynamically identified in the database by the GROUP BY clause. Sensor values are extracted for each data group separately, a time series is build for each data group separately.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
level_1 | Data grouping dimension level 1 configuration. | DataGroupingDimensionSpec | |||
level_2 | Data grouping dimension level 2 configuration. | DataGroupingDimensionSpec | |||
level_3 | Data grouping dimension level 3 configuration. | DataGroupingDimensionSpec | |||
level_4 | Data grouping dimension level 4 configuration. | DataGroupingDimensionSpec | |||
level_5 | Data grouping dimension level 5 configuration. | DataGroupingDimensionSpec | |||
level_6 | Data grouping dimension level 6 configuration. | DataGroupingDimensionSpec | |||
level_7 | Data grouping dimension level 7 configuration. | DataGroupingDimensionSpec | |||
level_8 | Data grouping dimension level 8 configuration. | DataGroupingDimensionSpec | |||
level_9 | Data grouping dimension level 9 configuration. | DataGroupingDimensionSpec |
ConnectionYaml
Connection definition for a data source connection that is covered by data quality checks.
The structure of this object is described below
Property name | Description | Data type | Enum values | Default value | Sample values |
---|---|---|---|---|---|
api_version | string | ||||
kind | enum | table dashboards source sensor check rule file_index settings provider_sensor |
|||
spec | ConnectionSpec |