Skip to content

Last updated: July 22, 2025

What are Data Quality Dimensions? Definitions, Examples and Best Practices

Data quality dimensions are standards for evaluating the quality of data, to make sure it's suitable for its intended use. They are measured by quality checks.

What is a data quality dimension?

Data quality dimension is a term adopted by the data quality field to identify these aspects of data that can be measured and through which its quality can be quantified. While different experts have proposed different data quality dimensions and there is no standardization of their names or descriptions, almost all of them include some version of accuracy, completeness, consistency, timeliness, uniqueness, and validity.

DQOps additionally uses Integrity, Availability, and Reasonableness data quality dimensions because they identify issues that affect unstructured data in data lakes.

Data quality dimension types

The following table explains data quality dimensions supported by DQOps, showing both the definition of the dimension, and examples of typical data quality issues.

Data quality dimension name Data quality dimension definition Potential data quality issues
Accuracy The degree of closeness of data values to real values, often measured by comparison with a known source of correct information (the source of truth). The data in a downstream data source does not match the original data from an upstream data source.
Aggregated values computed in values do not match similar values from a different data source (financial data).
Availability The degree to which the data source is available for usage with no access issues. The table is missing.
Credentials to the data source expired.
Table is physically corrupted due to an unrecoverable hard disk failure.
Completeness The degree to which all required
- records in the dataset,
- data values
are present with no missing information.
Required columns are empty.
Tables are empty or too small.
The percentage of null values exceeds an accepted threshold.
Consistency The degree to which data values of two sets of attributes
- within a record,
- within a data file,
- between data files,
- within a record at different points in time
comply with a rule.
This dimension confirms that the quality of values is consistent across the time periods.
An abnormal value outside regular boundaries is found using anomaly detection.
The aggregated value (sum, min, max, avg) for one time period differs too much from an earlier period (yesterday, last month, etc.).
Integrity The degree to which relational data is structurally correct. Lookup by a foreign key value did not find a matching record in a dimension or dictionary table.
Reasonableness The degree to which data values have are reasonable and make sense. A sum of values in an aggregable column is within an accepted range. For example, the total revenue per day is within reasonable boundaries.
Timeliness The degree to which the period between the time of creation of the real value and the time that the dataset is available is appropriate (the data is fresh). The data is not up-to-date. The most recent record is not older than an accepted delay.
Uniqueness The degree to which records occur only once in a data set and are not duplicated. Duplicate values found in a key column that must contain only unique values.
Validity The degree to which data values comply with pre-defined business rules such as the format, patterns, type, and range. E.g. zip codes. e-mails Invalid phone format. Values not matching regular expression patterns.

The list of data quality checks for each data quality dimension is found in sections below.

How data quality dimensions are measured

DQOps measures data quality by calculating a data quality KPI score. It is as a percentage of passed data quality checks out of all executed checks. The data quality KPI is represented as a percentage value. The following formula calculates the score.

Data quality KPI formula simple

The data quality KPI is calculated by grouping data quality check results by the data quality dimension. The following image shows a data quality KPI dashboard from DQOps, showing the results for each measured data quality dimension.

Data quality KPIs divided by data quality dimensions

Best practices for monitoring data quality dimensions

Selecting data quality dimensions is one of the steps in the process of defining business needs for data quality monitoring. This is usually done by the Data Owner who understands the purpose of the data, the data model, and the business processes in their area of responsibility.

Before you can select data quality dimensions that are relevant for your organization, you must first identify current goals and scope in terms of data quality monitoring. This will make it easier to come up with the metrics to measure its quality.

Next, you need to identify the data elements that are critical or required for a specific business process that needs to be monitored. This data is typically referred to as critical data elements (CDEs). The Data Owner should also collect and define the expectations of data consumers regarding the condition of the data, that ensure its suitability for particular purposes.

It is also a good idea to review the list of previous data quality issues that the Data Owner would like to eliminate in the future.

Only after completing the previous steps, you can assess data quality dimensions that are important for your organization. For example, if the data must arrive on time and without delays - the organization should prioritize timeliness. If it is more important that the data arrives in a certain format - the organization should prioritize validity.

You can learn more about defining data quality requirements and how to set up the whole data quality monitoring process in our eBook "A step-by-step guide to improve data quality".

The purpose of data quality dimensions

Measuring data quality dimensions ensures data meets its purpose, establishes benchmarks for improvement, and builds trust in data-driven decisions. By quantifying key aspects, you can identify and prioritize improvements, ensuring reliable insights and data-driven decision making. It also helps track progress and justify data quality investments, fostering a shared understanding across the organization.

In short, measuring data quality dimensions transforms a vague concept like "good data" into concrete, actionable metrics that are essential for ensuring the trustworthiness of your data-driven processes.

Assess data current status

Measuring data quality dimensions allows you to assess data fitness for use by uncovering issues and determining its suitability for specific purposes. Data quality dimensions help identify problems like missing values (completeness), misspellings and wrong values (validity), inconsistencies and data anomalies (consistency), and outdated information (timeliness).

You can also use these dimensions to determine if your data is good enough for a particular task, such as basic reporting versus complex predictive modeling, since different use cases may have varying requirements across data quality dimensions.

The following image shows the data quality KPI scores divided by data sources, schemas, and tables. When a data quality dimension is selected on the dashboard, the results are filtered to show the scores only for that dimension.

Data quality KPIs divided by data quality dimensions

Establish benchmark and track improvements

Data quality dimensions provide measurable attributes, transforming data quality assessments from subjective judgments to quantifiable metrics. This allows you to monitor progress over time and track whether your data cleansing initiatives, governance policies, or operational changes are positively impacting the quality of your data.

The following image shows the data quality KPI calculated for each data quality dimension. The initial scores (the benchmark) can be calculated during the data profiling phase.

Data quality KPI per data quality dimension with effort

We can also track data quality improvement for each data quality dimension over time. The following chart shows the data quality KPIs for each day of the month and data quality dimension.

Data quality KPI score per data quality dimension daily chart

Facilitate data-driven decision making

Measuring data quality dimensions empowers data-driven decision making in two key ways. First, high-quality data builds trust in the insights and visualizations derived from it. You can be confident that your conclusions are based on reliable information, not flawed data.

Second, understanding the strengths and limitations of your data across various dimensions allows you to make informed business decisions. Data quality insights guide your actions and prevent you from basing critical choices on potentially inaccurate information.

Prioritize data quality effort

By measuring data quality dimensions, you can identify the areas where data quality issues have the biggest impact. This allows you to focus your resources on making improvements that will yield the greatest benefits. Additionally, tracking changes in data quality over time helps you demonstrate the return on investment (ROI) of data quality initiatives. This can be crucial for justifying continued investment in maintaining and improving data quality.

The best method to prioritize the data cleansing process is to filter data quality results by a data quality dimension with a low KPI score. Then, cleanse the tables with the lowest data quality score or those whose KPI score has dropped since the previous month.

Priority of tables to fix data quality issues guided by the data quality KPI score

Enhance trust and collaboration

Clearly defined and measured data quality dimensions establish transparent standards and build trust among data consumers and those who manage it.
This creates a common vocabulary for discussing data quality, making collaboration between data producers and users more effective. Everyone involved has a clear understanding of what constitutes high-quality data, which fosters better communication and collaboration.

Data quality dimensions reference

The following chapters describe every data quality dimension in detail and include a list of data quality checks that measure that dimension.

Data Accuracy

Data accuracy definition

The following table provides an overview of the accuracy data quality dimension.

Dimension Description How DQOps measures data accuracy
Accuracy The data accuracy dimension measures how much the data matches the source of truth.
Simply speaking, it is about comparing tables.
This process is also called data reconciliation.
DQOps detects data accuracy issues by comparing data across tables and data sources.

Table accuracy checks

The following table lists data quality checks that detect accuracy issues on tables.

Data quality check name Friendly name Check category Description Standard check
total_row_count_match_percent Maximum percentage of difference in total row count between tables accuracy A table-level check that compares the row count of the current (tested) table with the row count of another table that is referenced. This check ensures that the difference between the row counts is below the maximum accepted percentage of difference. This check runs an SQL query with an INNER JOIN clause to join another (referenced) table that must be defined in the same database.
row_count_match Maximum percentage of difference between row count of compared tables comparisons Table level comparison check that compares the row count of the current (parent) table with the row count of the reference table.
column_count_match Maximum percentage of difference between column count of compared tables comparisons Table level comparison check that compares the column count of the current (parent) table with the column count of the reference table.

Column accuracy checks

The following table lists data quality checks that detect accuracy issues on columns.

Data quality check name Friendly name Check category Description Standard check
total_sum_match_percent The maximum difference in percent between the total sums of this column and the reference column accuracy A column-level check that ensures that the difference between the sum of all values in the tested column and the sum of values in another column in a referenced table is below a maximum accepted percentage of difference. This check runs an SQL query with an INNER JOIN clause to join another (referenced) table that must be defined in the same database.
total_min_match_percent The maximum difference in percent between the minimum value of this column and the reference column accuracy A column-level check that ensures that the difference between the minimum value in the tested column and the minimum value in another column in a referenced table is below a maximum accepted percentage of difference. This check runs an SQL query with an INNER JOIN clause to join another (referenced) table that must be defined in the same database.
total_max_match_percent The maximum difference in percent between the maximum value of this column and the reference column accuracy A column-level check that ensures that the difference between the maximum value in the tested column and the maximum value in another column in a referenced table is below a maximum accepted percentage of difference. This check runs an SQL query with an INNER JOIN clause to join another (referenced) table that must be defined in the same database.
total_average_match_percent The maximum difference in percent between the average value of this column and the reference column accuracy A column-level check that ensures that the difference between the average value in the tested column and the average value of another column in the referenced table is below the maximum accepted percentage of difference. This check runs an SQL query with an INNER JOIN clause to join another (referenced) table that must be defined in the same database.
total_not_null_count_match_percent The maximum difference in percent between the count of not null values of this column and the reference column accuracy A column-level check that ensures that the difference between the count of null values in the tested column and the count of null values in another column in a referenced table is below a maximum accepted percentage of difference. This check runs an SQL query with an INNER JOIN clause to join another (referenced) table that must be defined in the same database.
sum_match Maximum percentage of difference between sums of compared columns comparisons A column-level check that ensures that compares the sum of the values in the tested column to the sum of values in a reference column from the reference table. Compares the sum of values for each group of data. The data is grouped using a GROUP BY clause and groups are matched between the tested (parent) table and the reference table (the source of truth).
min_match Maximum percentage of difference between minimum values of compared columns comparisons A column-level check that ensures that compares the minimum value in the tested column to the minimum value in a reference column from the reference table. Compares the minimum values for each group of data. The data is grouped using a GROUP BY clause and groups are matched between the tested (parent) table and the reference table (the source of truth).
max_match Maximum percentage of difference between maximum values of compared columns comparisons A column-level check that ensures that compares the maximum value in the tested column to maximum value in a reference column from the reference table. Compares the maximum values for each group of data. The data is grouped using a GROUP BY clause and groups are matched between the tested (parent) table and the reference table (the source of truth).
mean_match Maximum percentage of difference between mean (average) values of compared columns comparisons A column-level check that ensures that compares the mean (average) of the values in the tested column to the mean (average) of values in a reference column from the reference table. Compares the mean (average) value for each group of data. The data is grouped using a GROUP BY clause and groups are matched between the tested (parent) table and the reference table (the source of truth).
not_null_count_match Maximum percentage of difference between the counts of not null values in compared columns comparisons A column-level check that ensures that compares the count of not null values in the tested column to the count of not null values in a reference column from the reference table. Compares the count of not null values for each group of data. The data is grouped using a GROUP BY clause and groups are matched between the tested (parent) table and the reference table (the source of truth).
null_count_match Maximum percentage of difference between the counts of null values in compared columns comparisons A column-level check that ensures that compares the count of null values in the tested column to the count of null values in a reference column from the reference table. Compares the count of null values for each group of data. The data is grouped using a GROUP BY clause and groups are matched between the tested (parent) table and the reference table (the source of truth).
distinct_count_match Maximum percentage of difference between the counts of distinct values in compared columns comparisons A column-level check that ensures that compares the count of distinct values in the tested column to the count of distinct values in a reference column from the reference table. Compares the count of distinct values for each group of data. The data is grouped using a GROUP BY clause and groups are matched between the tested (parent) table and the reference table (the source of truth).

Data Availability

Data availability definition

The following table provides an overview of the availability data quality dimension.

Dimension Description How DQOps measures data availability
Availability The data availability dimension shows how reliable the data source is regarding accessibility.
Simply speaking, the table can be queried during business hours.
DQOps detects data availability issues by running a simple SQL query as part of the default configuration of data observability checks.

Table availability checks

The following table lists data quality checks that detect availability issues on tables.

Data quality check name Friendly name Check category Description Standard check
table_availability Table availability availability A table-level check that ensures a query can be successfully executed on a table without server errors. It also verifies that the table exists and is accessible (queryable). The actual value (the result of the check) indicates the number of failures. If the table is accessible and a simple query can be executed without errors, the result will be 0.0. A sensor result (the actual value) of 1.0 indicates that there is a failure. Any value greater than 1.0 is stored only in the check result table and represents the number of consecutive failures in the following days.

Data Completeness

Data completeness definition

The following table provides an overview of the completeness data quality dimension.

Dimension Description How DQOps measures data completeness
Completeness The data completeness dimension identifies missing data.
Simply speaking, tables are empty, or columns contain null values.
DQOps detects data completeness issues by running a row count query to detect empty or too small tables.
The column completeness is measured by detecting null values or values that are probably null values, but a null placeholder text was found in a column.
Common null placeholder values are 'n/a', 'null', 'undefined', 'None'.

Table completeness checks

The following table lists data quality checks that detect completeness issues on tables.

Data quality check name Friendly name Check category Description Standard check
column_count Expected column count schema A table-level check that retrieves the metadata of the monitored table from the data source, counts the number of columns and compares it to an expected number of columns.
row_count Minimum row count (empty or too small table) volume This check detects empty or too-small tables. It captures the row count of a tested table. This check raises a data quality issue when the row count is below a minimum accepted value. The default value of the rule parameter min_count is 1 (row), which detects empty tables. When the data grouping is configured, this check will count rows using a GROUP BY clause and verify that each data grouping has an expected minimum number of rows.

Column completeness checks

The following table lists data quality checks that detect completeness issues on columns.

Data quality check name Friendly name Check category Description Standard check
nulls_count Maximum count of rows containing null values (incomplete column) nulls Detects incomplete columns that contain any null values. Counts the number of rows having a null value. Raises a data quality issue when the count of null values is above a max_count threshold.
nulls_percent Maximum percentage of rows containing null values (incomplete column) nulls Detects incomplete columns that contain any null values. Measures the percentage of rows having a null value. Raises a data quality issue when the percentage of null values is above a max_percent threshold. Configure this check to accept a given percentage of null values by setting the max_percent parameter.
nulls_percent_anomaly Abnormal change in percentage of null values. Measured as a percentile of anomalous measures. nulls Detects day-to-day anomalies in the percentage of null values. Measures the percentage of rows having a null value. Raises a data quality issue when the rate of null values increases or decreases too much.
not_nulls_count Minimum count of rows containing non-null values nulls Verifies that a column contains a minimum number of non-null values. The default value of the min_count parameter is 1 to detect at least one value in a monitored column.
not_nulls_percent Maximum percentage of rows containing non-null values nulls Verifies that a column contains some null values by measuring the maximum percentage of rows that have non-null values. Raises a data quality issue when the percentage of non-null values is above max_percentage, which means that a column that is expected to have null values is The default value of the max_percentage parameter is 0.0, but DQOps supports setting a higher value to verify that the percentage of null values is not above a threshold.
empty_column_found Find an empty column nulls Detects empty columns that contain only null values. Counts the number of rows that have non-null values. Raises a data quality issue when the count of non-null values is below min_count. The default value of the min_count parameter is 1, but DQOps supports setting a higher number to assert that a column has at least that many non-null values.
nulls_percent_change Maximum percentage of change in the count of null values nulls Detects relative increases or decreases in the percentage of null values since the last measured percentage. Measures the percentage of null values for each day. Raises a data quality issue when the change in the percentage of null values is above max_percent of the previous percentage.
nulls_percent_change_1_day Maximum percentage of change in the count of null values vs 1 day ago nulls Detects relative increases or decreases in the percentage of null values since the previous day. Measures the percentage of null values for each day. Raises a data quality issue when the change in the percentage of null values is above max_percent of the previous percentage.
nulls_percent_change_7_days Maximum percentage of change in the count of null values vs 7 day ago nulls Detects relative increases or decreases in the percentage of null values since the last week (seven days ago). Measures the percentage of null values for each day. Raises a data quality issue when the change in the percentage of null values is above max_percent of the previous percentage.
nulls_percent_change_30_days Maximum percentage of change in the count of null values vs 30 day ago nulls Detects relative increases or decreases in the percentage of null values since the last month (30 days ago). Measures the percentage of null values for each day. Raises a data quality issue when the change in the percentage of null values is above max_percent of the previous percentage.
column_exists Verify if the column exists schema A column-level check that reads the metadata of the monitored table and verifies if the column still exists in the data source. The data quality sensor returns a value of 1.0 when the column is found or 0.0 when the column is not found.
empty_text_found Maximum count of empty text values whitespace This check detects empty texts that are not null. Empty texts have a length of zero. The database treats them as values different than nulls, and some databases allow the storage of both null and empty values. This check counts empty texts and raises a data quality issue when the number of empty values exceeds a max_count parameter value.
whitespace_text_found Maximum count of whitespace values whitespace This check detects empty texts containing only spaces and other whitespace characters. This check counts whitespace-only texts and raises a data quality issue when their count exceeds a max_count parameter value.
null_placeholder_text_found Maximum count of null placeholder values (i.e., None, n/a, null) whitespace This check detects text values that are well-known equivalents (placeholders) of a null value, such as null, None, n/a. This check counts null placeholder values and raises a data quality issue when their count exceeds a max_count parameter value.
empty_text_percent Maximum percentage of empty texts values whitespace This check detects empty texts that are not null. Empty texts have a length of zero. This check measures the percentage of empty texts and raises a data quality issue when the rate of empty values exceeds a max_percent parameter value.
whitespace_text_percent Maximum percentage of whitespace values whitespace This check detects empty texts containing only spaces and other whitespace characters. This check measures the percentage of whitespace-only texts and raises a data quality issue when their rate exceeds a max_percent parameter value.
null_placeholder_text_percent Maximum percentage of null placeholder values (i.e., None, n/a, null) whitespace This check detects text values that are well-known equivalents (placeholders) of a null value, such as null, None, n/a. This check measures the percentage of null placeholder values and raises a data quality issue when their rate exceeds a max_percent parameter value.

Data Consistency

Data consistency definition

The following table provides an overview of the consistency data quality dimension.

Dimension Description How DQOps measures data consistency
Consistency The data consistency dimension ensures that the data stays consistent across different time periods.
Simply speaking, there are no anomalies detected by time series analysis.
The Britannica Dictionary provides the simplest definition of consistency, which says: "the quality or fact of staying the same at different times".
Data consistency is the most misunderstood data quality dimension. Every data quality vendor uses a different definition. According to some definitions, data consistency refers to the accuracy and uniformity of data stored across multiple data sources. However, this definition is too close to data accuracy, which is measured by comparing tables across data sources (data reconciliation).

Other versions define data consistency as representing data in the same format across time or different systems. This definition is also too close to the data validity dimension, which validates whether the data is in the right format.

DQOps has chosen to follow a dictionary definition of consistency, which perfectly fits into the modern approach to data quality monitoring. DQOps uses time series anomaly detection, which is known as Data Observability. DQOps uses anomaly detection and data comparison within the same table to compare measures across time periods. For example, the daily row count today is close to the daily row count from yesterday.

Table consistency checks

The following table lists data quality checks that detect consistency issues on tables.

Data quality check name Friendly name Check category Description Standard check
column_count_changed Detect change of column count schema A table-level check that detects if the number of columns in the table has changed since the last time the check (checkpoint) was run. This check retrieves the metadata of the monitored table from the data source, counts the number of columns and compares it to the last known number of columns that was captured and is stored in the data quality check results database.
column_list_changed Detect if columns were added or removed schema A table-level check that detects if the list of columns has changed since the last time the check was run. This check will retrieve the metadata of a tested table and calculate a hash of the column names. The hash will not depend on the order of columns, only on the column names. A data quality issue will be detected if new columns were added or columns that existed during the previous test were dropped.
column_list_or_order_changed Detect if the column list or order has changed schema A table-level check that detects if the list of columns and the order of columns have changed since the last time the check was run. This check will retrieve the metadata of a tested table and calculate a hash of the column names. The hash will depend on the order of columns. A data quality issue will be detected if new columns were added, columns that existed during the previous test were dropped or the columns were reordered.
column_types_changed Detect if the column list or data type has changed schema A table-level check that detects if the column names or column types have changed since the last time the check was run. This check calculates a hash of the column names and all the components of the column's data type: the data type name, length, scale, precision and nullability. A data quality issue will be detected if the hash of the column data types has changed. This check does not depend on the order of columns, the columns can be reordered as long as all columns are still present and the data types match since the last time they were tested.
row_count_anomaly Abnormal change in row count since the last known value volume This check detects anomalies in the day-to-day changes to the table volume (the row count). It captures the row count for each day and compares the row count change (increase or decrease) since the previous day. This check raises a data quality issue when the change is in the top anomaly_percent percentage of the biggest day-to-day changes.
row_count_change Maximum relative change in the row count since the last known value volume This check compares the current table volume (the row count) to the last known row count. It raises a data quality issue when the change in row count (increase or decrease) exceeds a maximum accepted percentage of change.
row_count_change_1_day Maximum relative change in the row count vs 1 day ago volume This check compares the current table volume (the row count) to the row count from the previous day. It raises a data quality issue when the change in row count (increase or decrease) since yesterday exceeds a maximum accepted percentage of change.
row_count_change_7_days Maximum relative change in the row count vs 7 days ago volume This check compares the current table volume (the row count) to the row count seven days ago. This check compares the table volume to a value a week ago to overcome weekly seasonability and to compare Mondays to Mondays, Tuesdays to Tuesdays, etc. It raises a data quality issue when the change in row count (increase or decrease) since a week ago exceeds a maximum accepted percentage of change.
row_count_change_30_days Maximum relative change in the row count vs 30 days ago volume This check compares the current table volume (the row count) to the row count 30 days ago. This check compares the table volume to a month ago value to overcome monthly seasonability. It raises a data quality issue when the change in row count (increase or decrease) since a value 30 days ago exceeds a maximum accepted percentage of change.

Column consistency checks

The following table lists data quality checks that detect consistency issues on columns.

Data quality check name Friendly name Check category Description Standard check
sum_anomaly Abnormal change in the sum of numeric values. Measured as a percentile of anomalous values. anomaly This check calculates a sum of values in a numeric column and detects anomalies in a time series of previous sums. It raises a data quality issue when the sum is in the top anomaly_percent percentage of the most outstanding values in the time series. This data quality check uses a 90-day time window and requires a history of at least 30 days.
mean_anomaly Abnormal change in the mean (average) of numeric values. Measured as a percentile of anomalous values. anomaly This check calculates a mean (average) of values in a numeric column and detects anomalies in a time series of previous averages. It raises a data quality issue when the mean is in the top anomaly_percent percentage of the most outstanding values in the time series. This data quality check uses a 90-day time window and requires a history of at least 30 days.
median_anomaly Abnormal change in the median of numeric values. Measured as a percentile of anomalous values. anomaly This check calculates a median of values in a numeric column and detects anomalies in a time series of previous medians. It raises a data quality issue when the median is in the top anomaly_percent percentage of the most outstanding values in the time series. This data quality check uses a 90-day time window and requires a history of at least 30 days.
min_anomaly Abnormal change in the minimum of numeric values. Measured as a percentile of anomalous values. anomaly This check finds a minimum value in a numeric column and detects anomalies in a time series of previous minimum values. It raises a data quality issue when the current minimum value is in the top anomaly_percent percentage of the most outstanding values in the time series (it is a new minimum value, far from the previous one). This data quality check uses a 90-day time window and requires a history of at least 30 days.
max_anomaly Abnormal change in the maximum of numeric values. Measured as a percentile of anomalous values. anomaly This check finds a maximum value in a numeric column and detects anomalies in a time series of previous maximum values. It raises a data quality issue when the current maximum value is in the top anomaly_percent percentage of the most outstanding values in the time series (it is a new maximum value, far from the previous one). This data quality check uses a 90-day time window and requires a history of at least 30 days.
mean_change Maximum relative change in the mean (average) of numeric values since the last known value anomaly This check detects that the mean (average) of numeric values has changed more than max_percent from the last measured mean.
mean_change_1_day Maximum relative change in the mean (average) of numeric values vs 1 day ago anomaly This check detects that the mean (average) of numeric values has changed more than max_percent from the mean value measured one day ago (yesterday).
mean_change_7_days Maximum relative change in the mean (average) of numeric values vs 7 days ago anomaly This check detects that the mean (average) value of numeric values has changed more than max_percent from the mean value measured seven days ago. This check aims to overcome a weekly seasonability and compare Mondays to Mondays, Tuesdays to Tuesdays, etc.
mean_change_30_days Maximum relative change in the mean (average) of numeric values vs 30 days ago anomaly This check detects that the mean (average) of numeric values has changed more than max_percent from the mean value measured thirty days ago. This check aims to overcome a monthly seasonability and compare a value to a similar value a month ago.
median_change Maximum relative change in the median of numeric values since the last known value anomaly This check detects that the median of numeric values has changed more than max_percent from the last measured median.
median_change_1_day Maximum relative change in the median of numeric values vs 1 day ago anomaly This check detects that the median of numeric values has changed more than max_percent from the median value measured one day ago (yesterday).
median_change_7_days Maximum relative change in the median of numeric values vs 7 days ago anomaly This check detects that the median of numeric values has changed more than max_percent from the median value measured seven days ago. This check aims to overcome a weekly seasonability and compare Mondays to Mondays, Tuesdays to Tuesdays, etc.
median_change_30_days Maximum relative change in the median of numeric values vs 30 days ago anomaly This check detects that the median of numeric values has changed more than max_percent from the median value measured thirty days ago. This check aims to overcome a monthly seasonability and compare a value to a similar value a month ago.
sum_change Maximum relative change in the sum of numeric values since the last known value anomaly This check detects that the sum of numeric values has changed more than max_percent from the last measured sum.
sum_change_1_day Maximum relative change in the sum of numeric values vs 1 day ago anomaly This check detects that the sum of numeric values has changed more than max_percent from the sum measured one day ago (yesterday).
sum_change_7_days Maximum relative change in the sum of numeric values vs 7 days ago anomaly This check detects that the sum of numeric values has changed more than max_percent from the sum measured seven days ago. This check aims to overcome a weekly seasonability and compare Mondays to Mondays, Tuesdays to Tuesdays, etc.
sum_change_30_days Maximum relative change in the sum of numeric values vs 30 days ago anomaly This check detects that the sum of numeric values has changed more than max_percent from the sum measured thirty days ago. This check aims to overcome a monthly seasonability and compare a value to a similar value a month ago.
detected_datatype_in_text Detect and verify the data type of all values in a text column datatype A column-level check that scans all values in a text column and detects the data type of all values in a monitored column. The actual_value returned from the sensor can be one of seven codes: 1 - integers, 2 - floats, 3 - dates, 4 - datetimes, 5 - timestamps, 6 - booleans, 7 - strings, 8 - mixed data types. The check compares the data type detected in all non-null columns to an expected data type. The rule compares the value using equals and requires values in the range 1..8, which are the codes of detected data types.
detected_datatype_in_text_changed Detect a change of the data type of all values in a text column datatype A column-level check that scans all values in a text column, finds the right data type and detects when the desired data type changes. The actual_value returned from the sensor can be one of seven codes: 1 - integers, 2 - floats, 3 - dates, 4 - datetimes, 5 - timestamps, 6 - booleans, 7 - strings, 8 - mixed data types. The check compares the data type detected during the current run to the last known data type detected during a previous run. For daily monitoring checks, it compares the value to yesterday's value (or an earlier date). For partitioned checks, it compares the current data type to the data type in the previous daily or monthly partition. The last partition with data is used for comparison.
column_type_changed Verify if the column data type has changed schema A column-level check that detects if the data type of the column has changed since the last retrieval. This check calculates the hash of all the components of the column's data type: the data type name, length, scale, precision and nullability. A data quality issue will be detected if the hash of the column's data types has changed.
distinct_count_anomaly Abnormal change in the number of distinct values. Measured as a percentile of anomalous measures. uniqueness This check monitors the count of distinct values and detects anomalies in the changes of the distinct count. It monitors a 90-day time window. The check is configured by setting a desired percentage of anomalies to identify as data quality issues.
distinct_percent_anomaly Abnormal change in the percentage of distinct values. Measured as a percentile of anomalous measures. uniqueness This check monitors the percentage of distinct values and detects anomalies in the changes in this percentage. It monitors a 90-day time window. The check is configured by setting a desired percentage of anomalies to identify as data quality issues.
distinct_count_change Maximum relative change in the count of distinct values since the last known value uniqueness This check monitors the count of distinct values and compares it to the last known value. It raises a data quality issue when the change exceeds an accepted threshold.
distinct_count_change_1_day Maximum relative change in the count of distinct values vs 1 day ago uniqueness This check monitors the count of distinct values and compares it to the measure from the previous day. It raises a data quality issue when the change exceeds an accepted threshold.
distinct_count_change_7_days Maximum relative change in the count of distinct values vs 7 days ago uniqueness This check monitors the count of distinct values and compares it to the measure seven days ago to overcome the weekly seasonability impact. It raises a data quality issue when the change exceeds an accepted threshold.
distinct_count_change_30_days Maximum relative change in the count of distinct values vs 30 days ago uniqueness This check monitors the count of distinct values and compares it to the measure thirty days ago to overcome the monthly seasonability impact. It raises a data quality issue when the change exceeds an accepted threshold.
distinct_percent_change Maximum relative change in the percentage of distinct values since the last known value uniqueness This check monitors the percentage of distinct values and compares it to the last known value. It raises a data quality issue when the change exceeds an accepted threshold.
distinct_percent_change_1_day Maximum relative change in the percentage of distinct values vs 1 day ago uniqueness This check monitors the percentage of distinct values and compares it to the measure from the previous day. It raises a data quality issue when the change exceeds an accepted threshold.
distinct_percent_change_7_days Maximum relative change in the percentage of distinct values vs 7 days sago uniqueness This check monitors the percentage of distinct values and compares it to the measure seven days ago to overcome the weekly seasonability impact. It raises a data quality issue when the change exceeds an accepted threshold.
distinct_percent_change_30_days Maximum relative change in the percentage of distinct values vs 30 days ago uniqueness This check monitors the percentage of distinct values and compares it to the measure thirty days ago to overcome the monthly seasonability impact. It raises a data quality issue when the change exceeds an accepted threshold.
text_surrounded_by_whitespace_found Maximum count of text surrounded by whitespace characters whitespace This check detects text values that contain additional whitespace characters before or after the text. This check counts text values surrounded by whitespace characters (on any side) and raises a data quality issue when their count exceeds a max_count parameter value. Whitespace-surrounded texts should be trimmed before loading to another table.
text_surrounded_by_whitespace_percent Maximum percentage of text surrounded by whitespace characters whitespace This check detects text values that contain additional whitespace characters before or after the text. This check measures the percentage of text value surrounded by whitespace characters (on any side) and raises a data quality issue when their rate exceeds a max_percent parameter value.

Data Integrity

Data integrity definition

The following table provides an overview of the integrity data quality dimension.

Dimension Description How DQOps measures data integrity
Integrity The data integrity dimension ensures that relational data is structurally valid and no keys are missing. Traditional OLTP databases such as Oracle or SQL Server can enforce data integrity with foreign key constraints.
This simple and reliable method is unavailable for non-transactional data platforms, such as data lakes that store data in S3 or similar buckets.
DQOps measures data integrity by running value lookup joins that join the tested table and a reference table using a left outer join to identify missing records that do not match the key.

Column integrity checks

The following table lists data quality checks that detect integrity issues on columns.

Data quality check name Friendly name Check category Description Standard check
lookup_key_not_found Maximum count of rows containing values not found in a reference table (foreign key lookup) integrity This check detects invalid values that are not present in a dictionary table. The lookup uses an outer join query within the same database. This check counts the number of values not found in the dictionary table. It raises a data quality issue when too many missing keys are discovered.
lookup_key_found_percent Minimum percentage of rows containing values not found in a reference table (foreign key lookup) integrity This check detects invalid values that are not present in a dictionary table. The lookup uses an outer join query within the same database. This check measures the percentage of valid keys found in the dictionary table. It raises a data quality issue when a percentage of valid keys is below a minimum accepted threshold.

Data Reasonableness

Data reasonableness definition

The following table provides an overview of the reasonableness data quality dimension.

Dimension Description How DQOps measures data reasonableness
Reasonableness The data reasonableness dimension verifies if the data makes sense and the values are within reasonable ranges. DQOps uses data quality range checks to assess the reasonableness of data.
For example, you can validate that the prices in the product table are within reasonable ranges.
You can validate the length of text fields or aggregated values such as a sum, mean, median, or percentile.

Table reasonableness checks

The following table lists data quality checks that detect reasonableness issues on tables.

Data quality check name Friendly name Check category Description Standard check
sql_aggregate_expression_on_table Custom aggregated SQL expression within range custom_sql A table-level check that calculates a given SQL aggregate expression on a table and verifies if the value is within a range of accepted values.

Column reasonableness checks

The following table lists data quality checks that detect reasonableness issues on columns.

Data quality check name Friendly name Check category Description Standard check
expected_text_values_in_use_count Maximum number of expected text values that are not in use accepted_values A column-level check that counts unique values in a text column and counts how many values out of a list of expected string values were found in the column. The check raises a data quality issue when the threshold for the maximum number of missing has been exceeded (too many expected values were not found in the column). This check is useful for analysing columns with a low number of unique values, such as status codes, to detect whether all status codes are used in any row.
expected_texts_in_top_values_count Verify that the most popular text values match the list of expected values accepted_values A column-level check that counts how many expected text values are among the TOP most popular values in the column. The check will first count the number of occurrences of each column's value and will pick the TOP X most popular values (configurable by the 'top' parameter). Then, it will compare the list of most popular values to the given list of expected values that should be most popular. This check will verify how many supposed most popular values (provided in the 'expected_values' list) were not found in the top X most popular values in the column. This check is helpful in analyzing string columns with frequently occurring values, such as country codes for countries with the most customers.
expected_numbers_in_use_count Maximum number of expected numeric values that are not in use accepted_values A column-level check that counts unique values in a numeric column and counts how many values out of a list of expected numeric values were found in the column. The check raises a data quality issue when the threshold for the maximum number of missing has been exceeded (too many expected values were not found in the column). This check is useful for analysing columns with a low number of unique values, such as status codes, to detect whether all status codes are used in any row.
true_percent Percentage of true values in bool column bool This check measures the percentage of true values in a boolean column. It raises a data quality issue when the measured percentage is outside the accepted range.
false_percent Percentage of false values in bool column bool This check measures the percentage of false values in a boolean column. It raises a data quality issue when the measured percentage is outside the accepted range.
sql_aggregate_expression_on_column Custom aggregated SQL expression within range custom_sql A column-level check that calculates a given SQL aggregate expression on a column and verifies if the value is within a range of accepted values.
number_in_range_percent Maximum percentage of rows containing numeric values in the range numeric This check verifies that values in a numeric column are within an accepted range. It measures the percentage of values within the valid range and raises a data quality issue when the rate of valid values is below a minimum accepted percentage.
integer_in_range_percent Maximum percentage of rows containing integer values in the range numeric This check verifies that numeric values are within a range of accepted values. It measures the percentage of values in the range and raises a data quality issue when the percentage of valid values is below an accepted rate.
min_in_range Minimum numeric value is in the range numeric This check finds a minimum value in a numeric column. It verifies that the minimum value is within the range of accepted values and raises a data quality issue when it is not within a valid range.
max_in_range Maximum numeric value is in the range numeric This check finds a maximum value in a numeric column. It verifies that the maximum value is within the range of accepted values and raises a data quality issue when it is not within a valid range.
sum_in_range Sum of numeric values is in the range numeric This check calculates a sum of numeric values. It verifies that the sum is within the range of accepted values and raises a data quality issue when it is not within a valid range.
mean_in_range Mean (average) of numeric values is in the range numeric This check calculates a mean (average) value in a numeric column. It verifies that the average value is within the range of accepted values and raises a data quality issue when it is not within a valid range.
median_in_range Median of numeric values is in the range numeric This check finds a median value in a numeric column. It verifies that the median value is within the range of accepted values and raises a data quality issue when it is not within a valid range.
percentile_in_range The selected percentile of numeric values is in the range numeric This check finds a requested percentile value of numeric values. The percentile is configured as a value in the range [0, 1]. This check verifies that the given percentile is within the range of accepted values and raises a data quality issue when it is not within a valid range.
percentile_10_in_range The selected 10th percentile of numeric values is in the range numeric This check finds the 10th percentile value in a numeric column. The 10th percentile is a value greater than 10% of the smallest values and smaller than the remaining 90% of other values. This check verifies that the 10th percentile is within the range of accepted values and raises a data quality issue when it is not within a valid range.
percentile_25_in_range The selected 25th percentile of numeric values is in the range numeric This check finds the 25th percentile value in a numeric column. The 10th percentile is a value greater than 25% of the smallest values and smaller than the remaining 75% of other values. This check verifies that the 25th percentile is within the range of accepted values and raises a data quality issue when it is not within a valid range.
percentile_75_in_range The selected 75th percentile of numeric values is in the range numeric This check finds the 75th percentile value in a numeric column. The 75th percentile is a value greater than 75% of the smallest values and smaller than the remaining 25% of other values. This check verifies that the 75th percentile is within the range of accepted values and raises a data quality issue when it is not within a valid range.
percentile_90_in_range The selected 90th percentile of numeric values is in the range numeric This check finds the 90th percentile value in a numeric column. The 90th percentile is a value greater than 90% of the smallest values and smaller than the remaining 10% of other values. This check verifies that the 90th percentile is within the range of accepted values and raises a data quality issue when it is not within a valid range.
sample_stddev_in_range Standard deviation of numeric values is in the range numeric This check calculates the standard deviation of numeric values. It verifies that the standard deviation is within the range of accepted values and raises a data quality issue when it is not within a valid range.
population_stddev_in_range Population standard deviation of numeric values is in the range numeric This check calculates the population standard deviation of numeric values. It verifies that the population standard deviation is within the range of accepted values and raises a data quality issue when it is not within a valid range.
sample_variance_in_range Variance of numeric values is in the range numeric This check calculates a sample variance of numeric values. It verifies that the sample variance is within the range of accepted values and raises a data quality issue when it is not within a valid range.
population_variance_in_range Population variance of numeric values is in the range numeric This check calculates a population variance of numeric values. It verifies that the population variance is within the range of accepted values and raises a data quality issue when it is not within a valid range.o
text_min_length Verify that the minimum length of the text column is in the range text This check finds the length of the shortest text in a column. DQOps validates the shortest length using a range rule. DQOps raises an issue when the minimum text length is outside a range of accepted values.
text_max_length Verify that the maximum length of the text is in the range text This check finds the length of the longest text in a column. DQOps validates the maximum length using a range rule. DQOps raises an issue when the maximum text length is outside a range of accepted values.
text_mean_length Verify that the mean length of the text is in the range text This check calculates the average text length in a column. DQOps validates the mean length using a range rule. DQOps raises an issue when the mean text length is outside a range of accepted values.
text_length_below_min_length Find text values shorter than the minimum accepted length text This check finds texts that are shorter than the minimum accepted text length. It counts the number of texts that are too short and raises a data quality issue when too many invalid texts are found.
text_length_below_min_length_percent Measure the percentage of rows containing text values shorter than the minimum accepted length text This check finds texts that are shorter than the minimum accepted text length. It measures the percentage of too short texts and raises a data quality issue when too many invalid texts are found.
text_length_above_max_length Find text values longer than the maximum accepted length text This check finds texts that are longer than the maximum accepted text length. It counts the number of texts that are too long and raises a data quality issue when too many invalid texts are found.
text_length_above_max_length_percent Measure the percentage of rows containing text values longer than the maximum accepted length text This check finds texts that are longer than the maximum accepted text length. It measures the percentage of texts that are too long and raises a data quality issue when too many invalid texts are found.
text_length_in_range_percent Measure the percentage of rows containing text values in the expected range text This check verifies that the minimum and maximum lengths of text values are in the range of accepted values. It measures the percentage of texts with a valid length and raises a data quality issue when an insufficient number of texts have a valid length.
min_word_count Verify that the minimum word count of the text column is in the range text This check finds the lowest count of words in text in a column. DQOps validates the shortest length using a range rule. DQOps raises an issue when the minimum word count is outside a range of accepted values.
max_word_count Verify that the maximum word count of the text is in the range text This check finds the highest count of words in text in a column. DQOps validates the maximum length using a range rule. DQOps raises an issue when the maximum word count is outside a range of accepted values.

Data Timeliness

Data timeliness definition

The following table provides an overview of the timeliness data quality dimension.

Dimension Description How DQOps measures data timeliness
Timeliness The data timeliness dimension verifies if the data source contains fresh data and if the data processing lag is acceptable. DQOps captures the most recent timestamps from the monitored table and measures the time between now and the timestamp of the most recent record.
The data timeliness dimension is divided into:
- data freshness tracks how old the data is
- data staleness tracks when the data was loaded for the last time
- ingestion delay measures the time between the most recent record and the time when it was loaded, showing the time consumed by data processing.
Check out the data timeliness formula to see how they differ.

Table timeliness checks

The following table lists data quality checks that detect timeliness issues on tables.

Data quality check name Friendly name Check category Description Standard check
data_freshness Data freshness (Maximum age of the most recent row) timeliness A table-level check that calculates the time difference between the most recent row in the table and the current time. The timestamp column that is used for comparison is defined as the timestamp_columns.event_timestamp_column on the table configuration. This check is also known as "Data Freshness".
data_freshness_anomaly Data freshness anomaly (Abnormal delay in data delivery) timeliness This check calculates the most recent rows value and the current time and detects anomalies in a time series of previous averages. The timestamp column that is used for comparison is defined as the timestamp_columns.event_timestamp_column on the table configuration. It raises a data quality issue when the mean is in the top anomaly_percent percentage of the most outstanding values in the time series. This data quality check uses a 90-day time window and requires a history of at least 30 days.
data_staleness Data staleness (Maximum number of days since the recent ingestion) timeliness A table-level check that calculates the time difference between the last timestamp when any data was loaded into a table and the current time. This check can only be use when a data pipeline, ETL process, or trigger in the data warehouse is filling an extra column with the timestamp when the data loading job was loaded. The ingestion column used for comparison is defined as the timestamp_columns.ingestion_timestamp_column on the table configuration. This check is also known as "Data Staleness".
data_ingestion_delay Data ingestion delay (Maximum number of days between the last record has been created and loaded) timeliness A table-level check that calculates the time difference between the most recent row in the table and the most recent timestamp when the last row was loaded into the data warehouse or data lake. To identify the most recent row, the check finds the maximum value of the timestamp column that should contain the last modification timestamp from the source. The timestamp when the row was loaded is identified by the most recent (maximum) value a timestamp column that was filled by the data pipeline, for example: "loaded_at", "updated_at", etc. This check requires that the data pipeline is filling an extra column with the timestamp when the data loading job has been executed. The names of both columns used for comparison should be specified in the "timestamp_columns" configuration entry on the table.
reload_lag Reload lag (Maximum delay to load the last record for each partition) timeliness A table-level check that calculates the maximum difference in days between ingestion timestamp and event timestamp values on any row. This check should be executed only as a partitioned check because this check finds the longest delay between the time that the row was created in the data source and the timestamp when the row was loaded into its daily or monthly partition. This check detects that a daily or monthly partition was reloaded, setting also the most recent timestamps in the created_at, loaded_at, inserted_at or other similar columns filled by the data pipeline or an ETL process during data loading.

Data Uniqueness

Data uniqueness definition

The following table provides an overview of the uniqueness data quality dimension.

Dimension Description How DQOps measures data uniqueness
Uniqueness The data uniqueness dimension verifies that no records are duplicate in the dataset. DQOps measures the number and the percentage of distinct and duplicate values in a column.

Table uniqueness checks

The following table lists data quality checks that detect uniqueness issues on tables.

Data quality check name Friendly name Check category Description Standard check
duplicate_record_count Maximum count of duplicate records uniqueness This check counts duplicate records values. It raises a data quality issue when the number of duplicates is above a minimum accepted value. The default configuration detects duplicate rows by enforcing that the min_count of duplicates is zero.
duplicate_record_percent Maximum percentage of duplicate records uniqueness This check measures the percentage of duplicate records values. It raises a data quality issue when the percentage of duplicates is above a minimum accepted value. The default threshold is 0% duplicate values.

Column uniqueness checks

The following table lists data quality checks that detect uniqueness issues on columns.

Data quality check name Friendly name Check category Description Standard check
distinct_count Verify that the count of distinct values is in the expected range uniqueness This check counts distinct values and verifies if the distinct count is within an accepted range. It raises a data quality issue when the distinct count is below or above the accepted range.
distinct_percent The minimum ratio of distinct values to the count of non null values (detect duplicate values) uniqueness This check measures the percentage of distinct values in all non-null values. It verifies that the percentage of distinct values meets a minimum and maximum values. The default value of 100% distinct values ensures the column has no duplicate values.
duplicate_count Maximum count of duplicate values uniqueness This check counts duplicate values. It raises a data quality issue when the number of duplicates is above a minimum accepted value. The default configuration detects duplicate values by enforcing that the min_count of duplicates is zero.
duplicate_percent Maximum percentage of duplicate values uniqueness This check measures the percentage of duplicate values in all non-null values. It raises a data quality issue when the percentage of duplicates is above an accepted threshold. The default threshold is 0% duplicate values.

Data Validity

Data validity definition

The following table provides an overview of the validity data quality dimension.

Dimension Description How DQOps measures data validity
Validity The data validity dimension verifies that the values conform to agreed data formats and low cardinality columns (columns containing only known values) store only values from an accepted data dictionary. DQOps uses regular expressions to validate values that should match a pattern, such as an email or a phone number.
Columns that should contain only values from a data dictionary use an IN SQL condition.
DQOps provides several built-in dictionaries, such as country and currency codes.
The other dictionaries can be uploaded as a CSV file.

Table validity checks

The following table lists data quality checks that detect validity issues on tables.

Data quality check name Friendly name Check category Description Standard check
sql_condition_failed_on_table Maximum count of rows that failed SQL conditions custom_sql A table-level check that uses a custom SQL expression on each row to verify (assert) that all rows pass a custom condition defined as an SQL condition. Use the {alias} token to reference the tested table. This data quality check can be used to compare columns on the same table. For example, the condition can verify that the value in the col_price column is higher than the col_tax column using an SQL expression: `{alias}.col_price > {alias}.col_tax`. Use an SQL expression that returns a true value for valid values and a false one for invalid values, because it is an assertion.
sql_condition_passed_percent_on_table Minimum percentage of rows that passed SQL condition custom_sql A table-level check that ensures that a minimum percentage of rows passed a custom SQL condition (expression). Measures the percentage of rows passing the condition. Raises a data quality issue when the percent of valid rows is below the min_percent parameter.
sql_invalid_record_count_on_table Custom SELECT SQL that returns invalid records custom_sql A table-level check that uses a custom SQL query that return invalid values from column. Use the {table} token to reference the tested table. This data quality check can be used to compare columns on the same table. For example, when this check is applied on a age column, the condition can find invalid records in which the age is lower than 18 using an SQL query: `SELECT age FROM {table} WHERE age < 18`.
import_custom_result_on_table Import custom data quality results on table custom_sql A table-level check that uses a custom SQL SELECT statement to retrieve a result of running a custom data quality check that was hardcoded in the data pipeline, and the result was stored in a separate table. The SQL query that is configured in this external data quality results importer must be a complete SELECT statement that queries a dedicated table (created by the data engineers) that stores the results of custom data quality checks. The SQL query must return a severity column with values: 0 - data quality check passed, 1 - warning issue, 2 - error severity issue, 3 - fatal severity issue.

Column validity checks

The following table lists data quality checks that detect validity issues on columns.

Data quality check name Friendly name Check category Description Standard check
text_found_in_set_percent Minimum percentage of rows having text values outside the list of expected values accepted_values A column-level check that calculates the percentage of rows for which the tested text column contains a value from a set of expected values. Columns with null values are also counted as a passing value (the sensor assumes that a 'null' is also an expected and accepted value). The check raises a data quality issue when the percentage of rows with a not null column value that is not expected (not one of the values in the expected_values set) is below the expected threshold. For example, 99% of rows should have values from the defined domain. This data quality check is useful for checking text columns that have a small number of unique values, and all the values should come from a set of expected values. For example, testing country, state, currency, gender, type, and department columns whose expected values are known.
number_found_in_set_percent Minimum percentage of rows having numeric values outside the list of expected values accepted_values A column-level check that calculates the percentage of rows for which the tested numeric column contains a value from a set of expected values. Columns with null values are also counted as a passing value (the sensor assumes that a 'null' is also an expected and accepted value). The check raises a data quality issue when the percentage of rows with a not null column value that is not expected (not one of the values in the expected_values set) is below the expected threshold. For example, 99% of rows should have values from the defined domain. This data quality check is useful for checking numeric columns that store numeric codes (such as status codes) to see if the only values found in the column are from the set of expected values.
text_valid_country_code_percent Minimum percentage of rows containing valid country codes accepted_values This check measures the percentage of text values that are valid two-letter country codes. It raises a data quality issue when the percentage of valid country codes (excluding null values) falls below a minimum accepted rate.
text_valid_currency_code_percent Minimum percentage of rows containing valid currency codes accepted_values This check measures the percentage of text values that are valid currency names. It raises a data quality issue when the percentage of valid currency names (excluding null values) falls below a minimum accepted rate.
text_parsable_to_boolean_percent Minimum percentage of rows containing a text parsable to a boolean value conversions Verifies that values in a text column are convertible to a boolean value. Texts are convertible to a boolean value when they are one of the well-known boolean placeholders: '0', '1', 'true', 'false', 'yes', 'no', 'y', 'n'. This check measures the percentage of valid values and raises a data quality issue when the percentage of valid values is below an accepted rate.
text_parsable_to_integer_percent Minimum percentage of rows containing a text parsable to integer conversions Verifies that values in a text column can be parsed and converted to an integer type. This check measures the percentage of valid values and raises a data quality issue when the percentage of valid values is below an accepted rate.
text_parsable_to_float_percent Minimum percentage of rows containing a text parsable to float conversions Verifies that values in a text column can be parsed and converted to a float (or numeric) type. This check measures the percentage of valid values and raises a data quality issue when the percentage of valid values is below an accepted rate.
text_parsable_to_date_percent Minimum percentage of rows containing a text parsable to date conversions Verifies that values in a text column can be parsed and converted to a date type. This check measures the percentage of valid values and raises a data quality issue when the percentage of valid values is below an accepted rate.
sql_condition_failed_on_column Maximum count of rows that failed SQL conditions custom_sql A column-level check that uses a custom SQL expression on each column to verify (assert) that all rows pass a custom condition defined as an SQL expression. Use the {alias} token to reference the tested table, and the {column} to reference the column that is tested. This data quality check can be used to compare columns on the same table. For example, when this check is applied on a col_price column, the condition can verify that the col_price is higher than the col_tax using an SQL expression: `{alias}.{column} > {alias}.col_tax` Use an SQL expression that returns a true value for valid values and false for invalid values, because it is an assertion.
sql_condition_passed_percent_on_column Minimum percentage of rows that passed SQL condition custom_sql A table-level check that ensures that a minimum percentage of rows passed a custom SQL condition (expression). Measures the percentage of rows passing the condition. Raises a data quality issue when the percent of valid rows is below the min_percent parameter.
sql_invalid_value_count_on_column Custom SELECT SQL that returns invalid values custom_sql A column-level check that uses a custom SQL query that return invalid values from column. This check is used for setting testing queries or ready queries used by users in their own systems (legacy SQL queries). Use the {table} token to reference the tested table, and the {column} to reference the column that is tested. For example, when this check is applied on a column. The condition can find invalid values in the column which have values lower than 18 using an SQL query: `SELECT {column} FROM {table} WHERE {column} < 18`.
import_custom_result_on_column Import custom data quality results on column custom_sql Column level check that uses a custom SQL SELECT statement to retrieve a result of running a custom data quality check on a column by a custom data quality check, hardcoded in the data pipeline. The result is retrieved by querying a separate logging table, whose schema is not fixed. The logging table should have columns that identify a table and a column for which they store custom data quality check results, and a severity column of the data quality issue. The SQL query that is configured in this external data quality results importer must be a complete SELECT statement that queries a dedicated logging table, created by the data engineering team.
date_values_in_future_percent Maximum percentage of rows containing dates in future datetime Detects dates in the future in date, datetime and timestamp columns. Measures a percentage of dates in the future. Raises a data quality issue when too many future dates are found.
date_in_range_percent Minimum percentage of rows containing dates within an expected range datetime Verifies that the dates in date, datetime, or timestamp columns are within a reasonable range of dates. The default configuration detects fake dates such as 1900-01-01 and 2099-12-31. Measures the percentage of valid dates and raises a data quality issue when too many dates are found.
text_match_date_format_percent Minimum percentage of rows containing text values that match a date format datetime Verifies that the values in text columns match one of the predefined date formats, such as an ISO 8601 date. Measures the percentage of valid date strings and raises a data quality issue when too many invalid date strings are found.
number_below_min_value Find invalid numeric values below an expected minimum value numeric This check finds numeric values smaller than the minimum accepted value. It counts the values that are too small. This check raises a data quality issue when the count of too small values exceeds the maximum accepted count.
number_above_max_value Find invalid numeric values above an expected maximum value numeric This check finds numeric values bigger than the maximum accepted value. It counts the values that are too big. This check raises a data quality issue when the count of too big values exceeds the maximum accepted count.
negative_values Maximum count of negative values numeric This check finds and counts negative values in a numeric column. It raises a data quality issue when the count of negative values is above the maximum accepted count.
negative_values_percent Maximum percentage of negative values numeric This check finds negative values in a numeric column. It measures the percentage of negative values and raises a data quality issue when the rate of negative values exceeds the maximum accepted percentage.
number_below_min_value_percent Maximum percentage of numeric values below the minimum value numeric This check finds numeric values smaller than the minimum accepted value. It measures the percentage of values that are too small. This check raises a data quality issue when the percentage of values that are too small exceeds the maximum accepted percentage.
number_above_max_value_percent Maximum percentage of numeric values above the maximum value numeric This check finds numeric values bigger than the maximum accepted value. It measures the percentage of values that are too big. This check raises a data quality issue when the percentage of values that are too big exceeds the maximum accepted percentage.
invalid_latitude Maximum count of rows containing invalid latitude values numeric This check finds numeric values that are not valid latitude coordinates. A valid latitude coordinate is in the range -90...90. It counts the values outside a valid range for a latitude. This check raises a data quality issue when the count of invalid values exceeds the maximum accepted count.
valid_latitude_percent Minimum percentage of rows containing valid latitude values numeric This check verifies that numeric values are valid latitude coordinates. A valid latitude coordinate is in the range -90...90. It measures the percentage of values within a valid range for a latitude. This check raises a data quality issue when the rate of valid values is below the minimum accepted percentage.
invalid_longitude Maximum count of rows containing invalid longitude values numeric This check finds numeric values that are not valid longitude coordinates. A valid longitude coordinate is in the range -180...180. It counts the values outside a valid range for a longitude. This check raises a data quality issue when the count of invalid values exceeds the maximum accepted count.
valid_longitude_percent Minimum percentage of rows containing valid longitude values numeric This check verifies that numeric values are valid longitude coordinates. A valid longitude coordinate is in the range --180...180. It measures the percentage of values within a valid range for a longitude. This check raises a data quality issue when the rate of valid values is below the minimum accepted percentage.
non_negative_values Maximum count of rows containing non negative values numeric This check finds and counts non negative values in a numeric column. It raises a data quality issue when the count of non-negative values is above the maximum accepted count.
non_negative_values_percent Maximum percentage of rows containing non negative values numeric This check finds non-negative values in a numeric column. It measures the percentage of non-negative values and raises a data quality issue when the rate of non-negative values exceeds the maximum accepted percentage.
text_not_matching_regex_found Maximum count of rows containing texts values not matching regex patterns This check validates text values using a pattern defined as a regular expression. It counts the number of invalid values and raises a data quality issue when the number exceeds a threshold.
texts_not_matching_regex_percent Maximum percent of rows containing texts values not matching regex patterns This check validates text values using a pattern defined as a regular expression. It measures the percentage of invalid values and raises a data quality issue when the rate is above a threshold.
invalid_email_format_found Maximum count of rows containing emails in invalid format patterns This check detects invalid email addresses in text columns using a regular expression. It counts the number of invalid emails and raises a data quality issue when the number is above a threshold.
invalid_email_format_percent Minimum percent of rows containing emails in invalid format patterns This check detects invalid email addresses in text columns using a regular expression. It calculated the percentage of invalid emails and raises a data quality issue when the percentage is above a threshold.
text_not_matching_date_pattern_found Maximum count of rows containing texts not matching an expected date pattern patterns This check detects dates in the wrong format inside text columns using a regular expression. It counts the number of incorrectly formatted dates and raises a data quality issue when the number exceeds a threshold.
text_not_matching_date_pattern_percent Maximum percentage of rows containing texts not matching an expected date pattern patterns This check validates the date format of dates stored in text columns. It measures the percentage of incorrectly formatted dates and raises a data quality issue when the rate is above a threshold.
text_not_matching_name_pattern_percent Maximum percentage of rows not containing texts that are names (e.g. numeric) patterns This check verifies if values stored in a text column contain only letters and are usable as literal identifiers. It measures the percentage of invalid literal identifiers and raises a data quality issue when the rate is above a threshold.
invalid_uuid_format_found Maximum count of rows containing invalid UUID values patterns This check detects invalid UUID identifiers in text columns using a regular expression. It counts the number of invalid UUIDs and raises a data quality issue when the number is above a threshold.
invalid_uuid_format_percent Maximum percentage of rows containing invalid UUID values patterns This check validates the format of UUID values in text columns. It measures the percentage of invalid UUIDs and raises a data quality issue when the rate is above a threshold.
invalid_ip4_address_format_found Maximum count of rows containing invalid IP4 address values patterns This check detects invalid IP4 internet addresses in text columns using a regular expression. It counts the number of invalid addresses and raises a data quality issue when the number is above a threshold.
invalid_ip6_address_format_found Maximum count of rows containing invalid IP6 address values patterns This check detects invalid IP6 internet addresses in text columns using a regular expression. It counts the number of invalid addresses and raises a data quality issue when the number is above a threshold.
invalid_usa_phone_format_found Maximum count of rows containing invalid USA phone number values patterns This check validates the format of USA phone numbers inside text columns. It counts the number of invalid phone number and raises a data quality issue when too many rows contain phone numbers.
invalid_usa_zipcode_format_found Maximum count of rows containing invalid USA zip code values patterns This check validates the format of a USA zip code inside text columns. It counts the number of invalid zip code and raises a data quality issue when the rate is below a threshold.
invalid_usa_phone_format_percent Maximum percentage of rows containing invalid USA phone number values patterns This check validates the format of USA phone numbers inside text columns. It measures the percentage of columns containing invalid phone numbers and raises a data quality issue when the rate is above a threshold.
invalid_usa_zipcode_format_percent Maximum percentage of rows containing invalid USA zip code values patterns This check validates the format of a USA zip code inside text columns. It measures the percentage of columns containing invalid zip codes and raises a data quality issue when the rate is above a threshold.
contains_usa_phone_percent Detect USA phone numbers inside text columns pii This check detects USA phone numbers inside text columns. It measures the percentage of columns containing a phone number and raises a data quality issue when too many rows contain phone numbers.
contains_email_percent Detect emails inside text columns pii This check detects emails inside text columns. It measures the percentage of columns containing an email and raises a data quality issue when too many rows contain emails.
contains_usa_zipcode_percent Detect USA zip codes inside text columns pii This check detects USA zip code inside text columns. It measures the percentage of columns containing a zip code and raises a data quality issue when too many rows contain zip codes.
contains_ip4_percent Detect IP4 addresses inside text columns pii This check detects IP4 addresses inside text columns. It measures the percentage of columns containing an IP4 address and raises a data quality issue when too many rows contain IP4 addresses.
contains_ip6_percent Detect IP6 addresses inside text columns pii This check detects IP6 addresses inside text columns. It measures the percentage of columns containing an IP6 address and raises a data quality issue when too many rows contain IP6 addresses.

FAQ

The following knowledgebase answers the most popular questions about data quality dimensions.

What are data quality dimensions?

Data quality dimensions are like a checklist for your data. They're a set of standards you use to see if your data is good enough to use. These dimensions include things like accuracy (is the data correct?), completeness (is any data missing?), consistency (is the same data stored the same way everywhere?), timeliness (is the data up-to-date?), and validity (does the data match the rules it should follow?). By checking your data against these dimensions, you can make sure it's reliable and useful for your needs.

What are data quality dimensions, with examples?

Data quality dimensions are standards like accuracy, completeness, and timeliness used to measure how good your data is. For example, completeness is measured by counting NULL values, timeliness by calculating the age of the data, and validity by checking if the values match patterns and ranges. DQOps provides a mapping of 150+ checks to dimensions, giving you a complete view of your data's health.

What are the 6 dimensions of data quality?

The 6 core dimensions of data quality are accuracy (data is correct), completeness (no data is missing), consistency (data is the same across all systems), timeliness (data is up-to-date), validity (data conforms to defined rules), and uniqueness (no duplicate data exists). These dimensions help ensure your data is reliable and usable.

Which dimension for measuring data quality means that the data conforms to a set of predefined standards and definitions such as type and format?

Validity is a data quality dimension that ensures data conforms to a set of predefined standards and definitions, such as data type, format, range, or business rules. For example, a valid email address must have an "@" symbol and a domain name, while a valid date must follow a specific format.

How to measure data quality?

Data quality is measured by testing your data with checks designed for each core dimension. These checks might count invalid records (like how many email addresses are formatted incorrectly) or measure how many rules your data passes.
The results of these checks give you a data quality score, which tells you how healthy your data is overall.

What is data completeness?

Data completeness is a data quality dimension that measures if your dataset is complete. This means all required columns contain values, those values aren't empty (like blanks or NULLs), the dataset has a minimum acceptable size, and related fields are present (for example, if you have a "country" field, you should also have a "state" field for relevant entries).

What are data quality measures?

Data quality measures are metrics that quantify the quality of your data for different dimensions, like completeness and validity. They provide numeric scores (often expressed as percentages) that are easy to understand and communicate to others. These measures can be tracked over time, allowing you to quickly notice if the overall data quality is decreasing and identify potential issues.

How to measure data integrity?

Data integrity issues often pop up when you have related data in different tables. Imagine one table lists customer orders and another lists customer details. If an order references a customer that doesn't exist in the customer table, that's a data integrity problem. These problems are common in data lakes where the structure isn't as strict as in traditional databases. To catch these issues, you need special data quality checks that look for invalid references between tables.

What is data quality consistency?

Data quality consistency checks if different copies of similar datasets match in time. This is important because data is often stored in multiple places, and you need to make sure those copies are the same. Consistency is tested with data reconciliation checks, which compare data across different systems or time periods. Additionally, modern data observability platforms can detect anomalies – unexpected changes in data patterns – which are another type of consistency issue.

What are data quality parameters?

Data quality parameters are like settings that define what "good" data looks like for your needs. They're used to configure your expectations for data quality checks. There are two main types:

  • Data quality rule parameters set overall expectations for your data, like a minimum row count or a maximum number of allowed errors.
  • Data quality check parameters provide specific instructions for individual checks, like a list of valid country codes or a regular expression to validate email formats.

By using these parameters, you can fine-tune your data quality checks and ensure they're accurately measuring the health of your data.

What are the types of data quality checks?

Data quality check types provide a more detailed categorization of data quality checks than dimensions. They focus on specific data quality issues and the methods used to detect them. Some common types include:

  • Pattern matching: Verifying data conforms to expected formats (e.g., email addresses, phone numbers).
  • Null value checks: Counting the number of missing values in a dataset.
  • Dataset comparison: Reconciling data between different sources or time periods.
  • Value range checks: Ensuring data falls within acceptable boundaries.

DQOps provides detailed documentation for 21 distinct types of data quality checks, allowing you to address a wide range of potential data quality issues.

How to determine data quality?

You determine data quality by profiling your datasets and testing them with data quality checks. This process, called data quality assessment, helps you establish a data quality baseline, which is like an initial health check for your data.

What are data quality checks?

Data quality checks are tests that verify the health of your data. They make sure your dataset is error-free and meets your data quality requirements. For example, a healthy dataset should have no missing or incorrect values. These checks are usually run by data quality tools, which automate the process and provide detailed reports on your data's health.

What are examples of data quality consistency?

Data consistency ensures that different copies of similar datasets match. Data consistency is typically tested with data reconciliation checks, which compare data across different systems. Modern data observability platforms also support anomaly detection, which helps ensure data consistency over time by identifying unexpected changes or patterns.

What does the consistency dimension refer to when data quality is being measured?

The consistency dimension in data quality ensures that different copies of your data match. This means if you have the same data in multiple places, like a data warehouse and a data lake, they should be identical. It also means that data behaves consistently over time, without unexpected changes or anomalies. To test consistency, you can use data reconciliation to compare different copies of data and anomaly detection to identify unusual patterns in time series data.

How to ensure data quality dimensions measurement strategy management and governance?

Ensuring effective data quality dimension measurement requires a clear strategy and robust governance. This involves a few key steps:

  • Choose your dimensions: Determine which data quality dimensions are most critical for your needs. This might depend on the type of data you're working with and how it's being used.

  • Define your measurement method: Outline how you'll measure each dimension. This could involve using data quality tools, implementing data observability solutions, or a combination of approaches.

  • Establish data governance policies: Your data governance policies should clearly define which databases and dataset types must be measured for data quality compliance. This ensures consistency and accountability across your organization.

By combining a well-defined strategy with strong governance, you can ensure that your data quality measurement process is effective and sustainable.

What's more