Data quality monitoring for Data Lake
Bring Data Governance to the Data Lake
How many tables in your data lake have inconsistent data formats?
Define data quality rules for data in the data lake. Continuously monitor data quality metrics to detect discrepancies.
Data Quality Monitoring
Data Quality Monitoring
Define data quality rules for external tables based on flat files (CSV, Parquet) to detect when files not matching data quality rules are loaded.
Define an external table with partitioning on a date column. Ingest files into a folder with the current date. DQO data quality rules can be executed for date partitioning to detect days with invalid source files.
- Detect days with invalid files
- Detect days with files that do not match data format, uniqueness, nullability or range checks
- Detect missing days with completeness tests
Unhealthy partitions
Unhealthy partitions
Detect partitions in a data lake that are corrupted by invalid files or unavailable HDFS nodes.
DQO will just run full table scan queries on all partitions to detect unreadable files. Availability data quality checks executed for each partition will detect unavailable partitions that must be repaired.
- Detect partitions that are unavailable due to corrupted parquet files
- Detect tables and partitions whose files are stored on offline or corrupted HDFS nodes
- Make sure some tables are always usable in the data lake
Trusted Data Lake Tables
Trusted Data Lake Tables
Identify tables in the data lake that are trusted and usable for analytics and data science by defining and checking data quality rules for these tables.
DQO data quality rules should be defined only for those external tables considered to be the source of truth. Data quality rules will document the quality requirements. Data quality will be ensured by running the quality rules on a daily basis.
- Document the data quality checks that are ensured for trustworthy tables
- Verify the data quality rules for important tables
- Let the data scientists and analysts use only verified tables
File format checks
File format checks
Detect when files that have a wrong format or missing columns were loaded to the data lake.
Define consistency checks that analyze the behavior and average values of key columns. An incorrect clear column count or an increase in the percentage of null values will indicate that the columns have been reversed or that some columns are missing from the new file.
- Detect missing columns in new files
- Detect when columns are reversed or missing in CSV files, which affected loading new data into the wrong columns
- Ensure that the external table always meets the data format and data range checks
Data Observability at Petabyte Scale
Data Observability at Petabyte Scale
Analyze petabyte-scale tables by analyzing only new or modified data.
DQO was built with partitioning in mind. Analyze data for time partitions or build custom data quality checks that analyze only partitions with new data. Identify new data by reading data processing logs.
- Observe data quality at a petabyte scale
- Analyze only new or modified data to avoid data lake pressure or high query processing cost
- Use your custom logs as a source to identify modified partitions that should be analyzed