Data quality monitoring for Data Operations

Detect potential issues with data pipelines

How often does invalid data in one table spread to multiple downstream tables and only a full refresh can help?

Data monitoring is the process of observing data quality metrics for all source and target tables. Detect data quality issues in the source tables and ensure that the data pipeline has generated a target table that meets the requirements.

Source Data Quality Rules

Source Data Quality Rules

Monitor data quality for source data in one place. Detect problems and instability in data sources before they affect the entire data warehouse or data lake.

DQO stores the Data Quality definitions for tables as simple YAML files. All data quality rules for a source table can be edited in one place, using code complete in all the most popular text editors. Just copy the data quality definition file and make small changes to monitor the quality of another similar table.

  • The data quality of source tables is easy to define
  • All data quality rules for all source tables may be defined in the same way
  • Adding new tables to be observed is as simple as copying a YAML file

All downstream tables always correct

All downstream tables are always correct

Detect data transformation issues in the data pipeline that generated a target table that does not meet data quality requirements.

Data quality metrics can also be defined for target tables. DQO will check daily or after each data load to ensure that the tables are not missing data and that they meet the requirements.

  • Detect data pipeline issues by monitoring target tables
  • Ensure daily that your target tables meet the requirements
  • Release your data pipelines with data quality rules monitored by DQO to make sure your pipelines are working as expected

Cross-checks across tables

Cross-checks in the tables

Detect discrepancies between source and target tables to detect an unexpected issues in the data pipelines.

Define summary queries that extract the primary metrics of the source and target tables. Compare these metrics to detect discrepancies. The number of rows in the target table should not be less than the number of rows in the source table for each partition.

  • Compare summary metrics between related tables in the data lineage
  • Detect missing data at a partition level
  • Detect value mismatches between related tables by comparing additive columns (aggregable)

All data up to date

All data is up to date

Monitor data ingestion delay, data freshness or data staleness for all tables.

DQO comes with a verified set of timeliness checks to monitor the data freshness (the number of days since the most recent event timestamp), staleness (the time difference in days between the current date and the most recent data ingestion timestamp) and ingestion delay (the time difference in days between the most recent event timestamp and the most recent ingestion timestamp).

  • Detect tables that were not refreshed recently
  • Detect missing time ranges if an incremental data load missed a few days of data
  • Learn which tables are receiving updates inconsistently with a variable delay

Database Availability and response times

Database availability and response times

Detect that all tables have data and the response time to typical queries meets the KPIs.

Define data quality checks for availability. DQO can run simple queries on the database and data lake to ensure that tables are present and populated. Define typical queries that you run from the dashboards to check the database response time.

  • Ensure that all tables are available
  • Check that tables are populated with data
  • Monitor database response times for popular queries to ensure real-time dashboards are responsive to users

Downstream tables never corrupted

Downstream tables never corrupted

Monitor the data quality of source tables to detect that data quality issues may affect downstream tables if the pipelines are not stopped on time. Better to stop the data pipeline than run a full refresh later.

Dependencies between source and target tables are defined along with data quality rules. Your data pipeline can simply check if there are any unresolved data quality issues before they corrupt the target table..

  • Data lineage defined with the quality rules
  • Get a list of downstream tables affected by data quality issues
  • Track issues across databases and deep data lineage trees

Data discrepancies

Data discrepancies

Detect inconsistent behavior of source tables, which may indicate other problems.

The DQO data monitoring framework will monitor table metrics, such as data latency, daily row count changes, additive column averages (fact table measures), or unexpected changes in metrics. This may indicate missing partitions or incorrect order of data loading when the data pipeline skips some source data.

  • Detect possibly missing data by monitoring row counts
  • Detect problems by monitoring anomalies, such as a sharp drop in the number of rows
  • Learn about the dynamics of the source tables such as their growth rate