Skip to content

Last updated: July 22, 2025

DQOps statistics parquet table schema

The parquet file schema for the statistics table stored in the $DQO_USER_HOME/.data/statistics folder in DQOps.

Table description

The basic profiling results (statistics) table that stores basic profiling statistical values. The statistics are stored in the errors table is located in the $DQO_USER_HOME/.data/statistics folder that contains uncompressed parquet files. The table is partitioned using a Hive compatible partitioning folder structure. When the $DQO_USER_HOME is not configured, it is the folder where DQOps was started (the DQOps user's home folder).

The folder partitioning structure for this table is: c=[connection_name]/t=[schema_name.table_name]/m=[first_day_of_month]/, for example: c=myconnection/t=public.analyzedtable/m=2023-01-01/.

Parquet table schema

The columns of this table are described below.

Column name Description Hive data type
id Column for a statistics result id (primary key), it is a uuid generated by combining and hashing the
connection_name, table_name, column name, collector_hash that identifies the type of the basic profiling collector, data_group_hash when the statistics used data grouping,
and the executed_at value. This value identifies a single row. STRING
collected_at Column for the time when the statistics were captured. All statistics results started as part of the same statistics collection session will share the same time. The parquet files are time partitioned by this column. TIMESTAMP
status Column for a statistics collector status ('success' or 'error'). STRING
result_type Column for a statistics collector result data type. STRING
result_string Column for a statistics collector result when it is a string value. STRING
result_integer Column for a statistics collector result when it is an integer value. It is a long (64 bit) value where we store all short, integer, long values. BIGINT
result_float Column for a statistics collector result when it is a numeric value with. It is a double value where we store all double, float, numeric and decimal values. DOUBLE
result_boolean Column for a statistics collector result when it is a boolean value. BOOLEAN
result_date Column for a statistics collector result when it is a local date value. DATE
result_date_time Column for a statistics collector result when it is a local date time value. TIMESTAMP
result_instant Column for a statistics collector result when it is an absolute (UTC timezone) instant. TIMESTAMP
result_time Column for a statistics collector result when it is time value. INTERVAL
sample_index The index of the sample for statistics collector that collect data samples. INTEGER
sample_count The count of the samples for statistics collector that collect data samples. BIGINT
scope String column that says if the result is for a whole table (the "table" value) or for each data group separately (the "data_group" value). STRING
grouping_level_1 Data group value at a single level. STRING
grouping_level_2 Data group value at a single level. STRING
grouping_level_3 Data group value at a single level. STRING
grouping_level_4 Data group value at a single level. STRING
grouping_level_5 Data group value at a single level. STRING
grouping_level_6 Data group value at a single level. STRING
grouping_level_7 Data group value at a single level. STRING
grouping_level_8 Data group value at a single level. STRING
grouping_level_9 Data group value at a single level. STRING
data_group_hash Column for a data group hash, it is a hash of the data group level values. BIGINT
data_group_name The data group name, it is a concatenated name of the data group dimensions, created from [grouping_level_1] / [grouping_level_2] / ... STRING
data_grouping_configuration The data grouping configuration name, it is a name of the named data grouping configuration that was used to run the data quality check. STRING
connection_hash Column for a connection hash. BIGINT
connection_name Column for a connection name. STRING
provider Column for a provider name. STRING
table_hash Column for a table hash. BIGINT
schema_name Column for a table schema. STRING
table_name Column for a table name. STRING
table_stage Column for a table stage. STRING
column_hash Column for a column hash. BIGINT
column_name Column for a column name. STRING
collector_hash Column for a statistics collector hash. BIGINT
collector_name Column for a statistics collector name. STRING
collector_target Column for a statistics collector target (table, column). STRING
collector_category Column for a statistics collector category. STRING
sensor_name Column for a sensor name. STRING
time_series_id Column for a time series id (uuid). Identifies a single time series. A time series is a combination of the profiler_hash and data_group_hash. STRING
executed_at Column for a statistics collector executed at timestamp. TIMESTAMP
duration_ms Column for a sensor duration in milliseconds. INTEGER
error_message Column for an optional error message when the status is 'error'. STRING
created_at The timestamp when the row was created at. TIMESTAMP
updated_at The timestamp when the row was updated at. TIMESTAMP
created_by The login of the user that created the row. STRING
updated_by The login of the user that updated the row. STRING

What's more