Data observability for Data Ingestion
Ensure that your data sources are pulled correctly
How often do you hear that the data received from external sources is wrong, but it was correct in the past?
Data Observability is a way to define Data Quality rules to monitor your ingestion tables. Detect schema changes, data format changes, missing data or inconsistent delays in the data delivery.
Data format and ranges
Data format and ranges
Detect data format and data ranges issues in source data before the data pipeline fails on the transformation steps.
Validity checks like the data format, not null, data ranges or uniqueness checks are defined for each source table. Data Quality checks are executed after the source data was loaded into your ingestion tables. Any Data Quality issues are easy to understand. Order of columns in a CSV file has changes and the data was loaded into different columns? Data format or distinct column count checks will detect it.
- Define data format and data range rules for source data
- Run Data Quality checks at the end of the data ingestion pipeline
- Detect data format issues by detecting unexpected changes to distinct counts of column values
Schema change detection
Schema change detection
Monitor unexpected changes to the table schema that will affect the downstream. Find out that a column size was increased and some rows may not load to downstream tables.
DQO.ai captures the table schema when the table metadata is retrieved for the first time. The list of columns is hashed to detect schema changes. Additionally column data types are compared every time when the schema checks are executed.
- Detect column data type changes
- Detect column count changes, added or removed columns
- Detect a change to the order of columns that may turn SELECT * queries unusable
Data delays and stale tables
Data delays and stale tables
Analyze the growth (row count increase) and data lags (most recent rows) to detect tables that are not refreshed recently.
DQO.ai can analyze the timeliness of the data to measure the average lag of data. When your rows do not have a timestamp column, DQO.ai can analyze the growth of the row count by measuring the row count every day and learning the average table growth rate.
- Detect tables that are not refreshed recently
- Find out which tables are not updated frequently in their data sources (inconsistent refresh)
- Detect tables that receive less data than the average
Ground Truth Checks
Ground Truth Checks
Compare the source data with other trusted sources to ensure that your database shows accurate data.
DQO.ai can run accuracy checks that will compare the table with a real world reference data. Just define the other table or a query that returns the same aggregated data (sum, count, etc.) grouped by the same business dimension. Also compare the data with flat files that you can load to the Data Quality database.
- Compare the data with the real world, reference data
- Detect issues at a business relevant dimension (date, country, city, department, state, etc.)
- Ensure that you really trust the source data and you can proof it