Last updated: July 22, 2025
DQOps error_samples parquet table schema
The parquet file schema for the error_samples table stored in the $DQO_USER_HOME/.data/error_samples folder in DQOps.
Table description
The error samples table that stores sample column values that failed data quality checks that operate on rows (mostly Validity and Consistency checks). The error samples are stored in the errors table is located in the $DQO_USER_HOME/.data/error_samples folder that contains uncompressed parquet files. The table is partitioned using a Hive compatible partitioning folder structure. When the $DQO_USER_HOME is not configured, it is the folder where DQOps was started (the DQOps user's home folder).
The folder partitioning structure for this table is: c=[connection_name]/t=[schema_name.table_name]/m=[first_day_of_month]/, for example: c=myconnection/t=public.analyzedtable/m=2023-01-01/. The date used for monthly partitioning is calculated from the executed_at column value.
Parquet table schema
The columns of this table are described below.
Column name | Description | Hive data type |
---|---|---|
id |
The check result id (primary key), it is a uuid of the check hash, collected at, sample index and the data grouping id. This value identifies a single row. | STRING |
collected_at |
Column for the time when the error samples were captured. All error samples results started as part of the same error sampling session will share the same time. The parquet files are time partitioned by this column. | TIMESTAMP |
scope |
String column that says if the result is for a whole table (the "table" value) or for each data group separately (the "data_group" value). | STRING |
grouping_level_1 |
Data group value at a single level. | STRING |
grouping_level_2 |
Data group value at a single level. | STRING |
grouping_level_3 |
Data group value at a single level. | STRING |
grouping_level_4 |
Data group value at a single level. | STRING |
grouping_level_5 |
Data group value at a single level. | STRING |
grouping_level_6 |
Data group value at a single level. | STRING |
grouping_level_7 |
Data group value at a single level. | STRING |
grouping_level_8 |
Data group value at a single level. | STRING |
grouping_level_9 |
Data group value at a single level. | STRING |
data_group_hash |
The data grouping hash, it is a hash of the data grouping level values. | BIGINT |
data_group_name |
The data grouping name, it is a concatenated name of the data grouping dimension values, created from [grouping_level_1] / [grouping_level_2] / ... | STRING |
data_grouping_configuration |
The data grouping configuration name, it is a name of the named data grouping configuration that was used to run the data quality check. | STRING |
connection_hash |
A hash calculated from the connection name (the data source name). | BIGINT |
connection_name |
The connection name (the data source name). | STRING |
provider |
The provider name, which is the type of the data source. | STRING |
table_hash |
The table name hash. | BIGINT |
schema_name |
The database schema name. | STRING |
table_name |
The monitored table name. | STRING |
table_stage |
The stage name of the table. This is a free-form text configured at the table level that can identify the layers of the data warehouse or a data lake, for example: "landing", "staging", "cleansing", etc. | STRING |
table_priority |
The table priority value copied from the table's definition. The table priority can be used to sort tables according to their importance. | INTEGER |
column_hash |
The hash of a column. | BIGINT |
column_name |
The column name for which the results are stored. | STRING |
check_hash |
The hash of a data quality check. | BIGINT |
check_name |
The data quality check name. | STRING |
check_display_name |
The user configured display name for a data quality check, used when the user wants to use custom, user-friendly data quality check names. | STRING |
check_type |
The data quality check type (profiling, monitoring, partitioned). | STRING |
time_gradient |
The time gradient (daily, monthly) for monitoring checks (checkpoints) and partition checks. It is a "milliseconds" for profiling checks. When the time gradient is daily or monthly, the time_period is truncated at the beginning of the time gradient. | STRING |
check_category |
The data quality check category name. | STRING |
quality_dimension |
The data quality dimension name. The popular dimensions are: Timeliness, Completeness, Consistency, Validity, Reasonableness, Uniqueness. | STRING |
table_comparison |
The name of a table comparison configuration used for a data comparison (accuracy) check. | STRING |
sensor_name |
The data quality sensor name. | STRING |
time_series_id |
The time series id (uuid). Identifies a single time series. A time series is a combination of the check_hash and data_group_hash. | STRING |
result_type |
The sample's result data type. | STRING |
result_string |
The sample value when it is a string value. | STRING |
result_integer |
The sample value when it is an integer value. It is a long (64 bit) value where we store all short, integer, long values. | BIGINT |
result_float |
The sample value when it is a numeric value with. It is a double value where we store all double, float, numeric and decimal values. | DOUBLE |
result_boolean |
The sample value when it is a boolean value. | BOOLEAN |
result_date |
The sample value when it is a local date value. | DATE |
result_date_time |
The sample value when it is a local date time value. | TIMESTAMP |
result_instant |
The sample value when it is an absolute (UTC timezone) instant. | TIMESTAMP |
result_time |
The sample value when it is time value. | INTERVAL |
sample_index |
The 1-based index of the collected sample. | INTEGER |
sample_filter |
The sample filtering formula that was used in the where filter. | STRING |
row_id_1 |
Data group value at a single level. | STRING |
row_id_2 |
Data group value at a single level. | STRING |
row_id_3 |
Data group value at a single level. | STRING |
row_id_4 |
Data group value at a single level. | STRING |
row_id_5 |
Data group value at a single level. | STRING |
executed_at |
The UTC timestamp, when the data sensor was executed. | TIMESTAMP |
duration_ms |
The sensor (query) execution duration in milliseconds. | INTEGER |
created_at |
The timestamp when the row was created at. | TIMESTAMP |
updated_at |
The timestamp when the row was updated at. | TIMESTAMP |
created_by |
The login of the user that created the row. | STRING |
updated_by |
The login of the user that updated the row. | STRING |
What's more
- You can find more information on how the Parquet files are partitioned in the data quality results storage concept.