Last updated: July 22, 2025
DQOps statistics parquet table schema
The parquet file schema for the statistics table stored in the $DQO_USER_HOME/.data/statistics folder in DQOps.
Table description
The basic profiling results (statistics) table that stores basic profiling statistical values. The statistics are stored in the errors table is located in the $DQO_USER_HOME/.data/statistics folder that contains uncompressed parquet files. The table is partitioned using a Hive compatible partitioning folder structure. When the $DQO_USER_HOME is not configured, it is the folder where DQOps was started (the DQOps user's home folder).
The folder partitioning structure for this table is: c=[connection_name]/t=[schema_name.table_name]/m=[first_day_of_month]/, for example: c=myconnection/t=public.analyzedtable/m=2023-01-01/.
Parquet table schema
The columns of this table are described below.
Column name | Description | Hive data type |
---|---|---|
id |
Column for a statistics result id (primary key), it is a uuid generated by combining and hashing the | |
connection_name, table_name, column name, collector_hash that identifies the type of the basic profiling collector, data_group_hash when the statistics used data grouping, | ||
and the executed_at value. This value identifies a single row. | STRING | |
collected_at |
Column for the time when the statistics were captured. All statistics results started as part of the same statistics collection session will share the same time. The parquet files are time partitioned by this column. | TIMESTAMP |
status |
Column for a statistics collector status ('success' or 'error'). | STRING |
result_type |
Column for a statistics collector result data type. | STRING |
result_string |
Column for a statistics collector result when it is a string value. | STRING |
result_integer |
Column for a statistics collector result when it is an integer value. It is a long (64 bit) value where we store all short, integer, long values. | BIGINT |
result_float |
Column for a statistics collector result when it is a numeric value with. It is a double value where we store all double, float, numeric and decimal values. | DOUBLE |
result_boolean |
Column for a statistics collector result when it is a boolean value. | BOOLEAN |
result_date |
Column for a statistics collector result when it is a local date value. | DATE |
result_date_time |
Column for a statistics collector result when it is a local date time value. | TIMESTAMP |
result_instant |
Column for a statistics collector result when it is an absolute (UTC timezone) instant. | TIMESTAMP |
result_time |
Column for a statistics collector result when it is time value. | INTERVAL |
sample_index |
The index of the sample for statistics collector that collect data samples. | INTEGER |
sample_count |
The count of the samples for statistics collector that collect data samples. | BIGINT |
scope |
String column that says if the result is for a whole table (the "table" value) or for each data group separately (the "data_group" value). | STRING |
grouping_level_1 |
Data group value at a single level. | STRING |
grouping_level_2 |
Data group value at a single level. | STRING |
grouping_level_3 |
Data group value at a single level. | STRING |
grouping_level_4 |
Data group value at a single level. | STRING |
grouping_level_5 |
Data group value at a single level. | STRING |
grouping_level_6 |
Data group value at a single level. | STRING |
grouping_level_7 |
Data group value at a single level. | STRING |
grouping_level_8 |
Data group value at a single level. | STRING |
grouping_level_9 |
Data group value at a single level. | STRING |
data_group_hash |
Column for a data group hash, it is a hash of the data group level values. | BIGINT |
data_group_name |
The data group name, it is a concatenated name of the data group dimensions, created from [grouping_level_1] / [grouping_level_2] / ... | STRING |
data_grouping_configuration |
The data grouping configuration name, it is a name of the named data grouping configuration that was used to run the data quality check. | STRING |
connection_hash |
Column for a connection hash. | BIGINT |
connection_name |
Column for a connection name. | STRING |
provider |
Column for a provider name. | STRING |
table_hash |
Column for a table hash. | BIGINT |
schema_name |
Column for a table schema. | STRING |
table_name |
Column for a table name. | STRING |
table_stage |
Column for a table stage. | STRING |
column_hash |
Column for a column hash. | BIGINT |
column_name |
Column for a column name. | STRING |
collector_hash |
Column for a statistics collector hash. | BIGINT |
collector_name |
Column for a statistics collector name. | STRING |
collector_target |
Column for a statistics collector target (table, column). | STRING |
collector_category |
Column for a statistics collector category. | STRING |
sensor_name |
Column for a sensor name. | STRING |
time_series_id |
Column for a time series id (uuid). Identifies a single time series. A time series is a combination of the profiler_hash and data_group_hash. | STRING |
executed_at |
Column for a statistics collector executed at timestamp. | TIMESTAMP |
duration_ms |
Column for a sensor duration in milliseconds. | INTEGER |
error_message |
Column for an optional error message when the status is 'error'. | STRING |
created_at |
The timestamp when the row was created at. | TIMESTAMP |
updated_at |
The timestamp when the row was updated at. | TIMESTAMP |
created_by |
The login of the user that created the row. | STRING |
updated_by |
The login of the user that updated the row. | STRING |
What's more
- You can find more information on how the Parquet files are partitioned in the data quality results storage concept.