Setting up data grouping
Data group is a group of rows that were loaded from a single or different sources and aggregated into one table.
In DQO, data grouping in a single table can be identified by a discriminator column containing data such as location, business units, vendor, data provider or subsidiary. Data from separate tables can also be aggregated into single data group by adding a tag with the name of the data source.
Data groups are used to calculate separate data quality KPI scores for different groups of rows.
For more information about data grouping, see data grouping in the DQO concepts section.
In DQO, you can set up data grouping globally at the data source level, in which case the configuration will be copied to the data grouping configuration of the tables that will be imported. Note that this configuration does not affect tables that have already been imported.
The data grouping should be configured individually for each table, which allows to select columns from the monitored table that will be used in the GROUP BY clause to identify each group of data coming from a different data source.
Set up data grouping using the graphical interface
To set up the default data grouping configuration template at the data source level, follow the steps below.
-
In DQO graphical interface, go to the Data Source section, select the data source of interest in the tree view on the left side, and select Default Grouping Template tab.
-
In the Default Data Stream Template tab, write a tag for a data source or select the discriminator column that would identify the data source. You can configure up to 9 different data grouping dimension levels.
-
Once you have set the data grouping configuration, click on the Save button to save your changes.
The default data grouping configuration added at the data source level will be copied to the data grouping configuration of all the tables that will be imported in the future. Note that this configuration does not affect tables that have already been imported.
To set up data grouping configuration at the table level:
-
In DQO graphical interface, go to the Data Source section, select the table of interest in the tree view on the left side, and select Data Grouping tab.
-
Click on the New data grouping configuration button.
-
Write a Data grouping name and select a tag for a data grouping or select the discriminator column that would identify the data source.
-
Once you have set the data grouping configuration, click on the Save button to save your changes.
-
After adding the new data grouping configuration you can set which one is the default one by clicking on the Make Default button. The default data grouping configuration is used for running all data quality checks on the table. If the data grouping configuration uses grouping by discriminator columns, all data quality checks started when the configuration is active will use a GROUP BY clause, calculating separate data quality results for each group of data.
On this screen you can also edit or delete data grouping configurations.
Set up data grouping configuration using the DQO Shell
Data quality checks are stored in YAML configuration files. YAML configuration files are located in the ./sources
folder.
The complete DQO YAML schema can be found here.
The YAML files in DQO support code completion in code editors such as Visual Studio Code. Remember to install the YAML extension by RedHat and Better Jinja by Samuel Colvin.
To set up a default data grouping configuration at the data source level, follow the steps below.
-
Run the following command in DQO Shell to edit YAML configuration file and define data stream.
-
Provide the connection name.
After providing the above data Visual Studio Code will be automatically launched.
-
Add the default data grouping configuration to the YAML file using Visual Studio Code editor and save the file.
- Add the
default_grouping_configuration:
parameter. - Add the level of the data groupings (dimensions) such as
level_1:
, and specify thesource:
astag
orcolumn_value
- Add a name for the
tag:
orcolumn:
parameters.
Below are examples of the YAML files showing a sample configuration of a default data grouping configuration with a tag or column.
- Add the
apiVersion: dqo/v1
kind: source
spec:
provider_type: bigquery
bigquery:
source_project_id: bigquery-public-data
authentication_mode: google_application_credentials
default_grouping_configuration:
level_1:
source: tag
tag: tag1
incident_grouping:
grouping_level: table_dimension_category
minimum_severity: warning
max_incident_length_days: 60
mute_for_days: 60
apiVersion: dqo/v1
kind: source
spec:
provider_type: bigquery
bigquery:
source_project_id: bigquery-public-data
authentication_mode: google_application_credentials
default_grouping_configuration:
level_1:
source: column_value
column: country
incident_grouping:
grouping_level: table_dimension_category
minimum_severity: warning
max_incident_length_days: 60
mute_for_days: 60
To set up a data grouping configuration at the table level
-
Run the following command in DQO Shell to edit YAML configuration file and define data stream.
-
Provide the connection name and full table name in a schema.table format.
After entering the above data, Visual Studio Code will be automatically launched. -
Add the data grouping configuration to the YAML file using Visual Studio Code editor and save the file.
- Add the
groupings:
parameter above thecolumn:
section. - Add a name of the data grouping configuration
- Add the level of the data grouping (dimension) such as
level_1:
, and specify thesource:
astag
orcolumn_value
- Add a name for the
tag:
orcolumn:
parameters.
Below is an examples of the YAML files showing a sample configuration of 2 data grouping configurations:
- "Column_data_stream" with data grouping set on category column
- "Tag_data_stream" with data grouping by assigning a tag
landing_zone
to all data quality results.
apiVersion: dqo/v1 kind: table spec: incremental_time_window: daily_partitioning_recent_days: 7 monthly_partitioning_recent_months: 1 data_streams: Column_data_stream: level_1: source: column_value column: category Tag_data_stream: level_1: source: tag tag: landing_zone columns: unique_key: type_snapshot: column_type: INT64 nullable: true address: type_snapshot: column_type: STRING nullable: true
- Add the