Last updated: July 22, 2025
How to detect referential integrity issues and missing keys, examples
Read this guide to learn how to detect referential integrity issues, such as missing keys in dictionary tables or wrong foreign keys.
The data quality checks that detect missing keys are configured in the integrity
category in DQOps.
What is referential integrity
Referential integrity is used in relational databases to describe a relation between two tables. One table is a dictionary of entries such as products, countries, and customers. Each entry is identified by a unique key, which is also a primary key of the table. The other table contains references to the dictionary entries by storing a key that identifies that entry, called a foreign key.
Traditional, transactional OLTP databases such as Oracle, SQL Server, PostgreSQL, or MySQL can enforce relational integrity in the storage engine, ensuring that storing invalid foreign key values is not possible.
The problem of broken referential integrity is specific to data warehouses and data lakes using unstructured or semi-structured data, such as Parquet files.
When referential integrity is broken
We can identify several sources of broken referential integrity.
-
Data that is stored in a data lake as Parquet or CSV files. The data lake cannot enforce foreign key constraints, allowing the upload of Parquet files containing invalid keys.
-
Data loaded or updated without an ongoing transaction.
-
Data loaded to a table that didn't have an active foreign key constraint.
-
Data received in the landing zone as flat files.
-
Invalid transformation code in the data pipeline.
-
The columns were in a different order in some source files, such as a country name and state code were reversed.
In all these cases, post-load validation using data quality checks is the last line of defense to detect referential integrity issues.
Example of referential integrity
The following example shows the first ten rows of a public dataset containing the number of COVID-19 cases for each country and date.
date | country_code | country_name | iso_3166_1_alpha_2 | iso_3166_1_alpha_3 | new_confirmed |
---|---|---|---|---|---|
2022-02-27 | AR | Argentina | AR | ARG | 1818 |
2022-01-23 | CO | Colombia | CO | COL | 91 |
2022-07-11 | CZ | Czech Republic | CZ | CZE | 2036 |
2022-05-12 | CZ | Czech Republic | CZ | CZE | 521 |
2022-02-05 | ES | Spain | ES | ESP | 27997 |
2022-03-19 | FI | Finland | FI | FIN | 0 |
2022-05-21 | FI | Finland | FI | FIN | 0 |
2022-05-28 | FI | Finland | FI | FIN | 0 |
2022-01-11 | FR | France | FR | FRA | 368149 |
2022-02-13 | FR | France | FR | FRA | 86562 |
The data was retrieved using the following query.
SELECT country_code, country_name, iso_3166_1_alpha_2, iso_3166_1_alpha_3, new_confirmed
FROM `bigquery-public-data`.`covid19_open_data`.`covid19_open_data`
WHERE `date` >= '2022-01-01'
LIMIT 10
The dictionary table with a list of all country codes is also public. We can retrieve the list of countries using the following query.
We aim to verify that all country codes stored in the country_code column of the COVID-19 cases table have a matching two-letter country code in the dictionary of all countries.
The list of the first ten countries is shown below.
country_name | alpha_2_code | alpha_3_code |
---|---|---|
Afghanistan | AF | AFG |
Albania | AL | ALB |
Algeria | DZ | DZA |
American Samoa | AS | ASM |
Andorra | AD | AND |
Angola | AO | AGO |
Anguilla | AI | AIA |
Antarctica | AQ | ATA |
Antigua and Barbuda | AG | ATG |
Argentina | AR | ARG |
... | ... | ... |
Missing keys
The following example shows a possible referential integrity issue we want to identify. The country_code column may contain a full country name instead of an ISO 3166 country code, as shown below.
date | country_code | country_name | iso_3166_1_alpha_2 | iso_3166_1_alpha_3 | new_confirmed |
---|---|---|---|---|---|
... | ... | ... | ... | ... | ... |
2022-01-11 | FR | France | FR | FRA | 368149 |
2022-02-13 | FRANCE <- unknown key | France | FR | FRA | 86562 |
Detecting missing keys
DQOps detects missing keys by running an SQL query that joins the tested table with the dictionary table using a left outer join. Both tables must be on the data source because the join is performed inside the analyzed database. DQOps does not pull the data from tested data sources to detect missing keys.
The following two data checks detect missing keys.
-
The lookup_key_not_found check detects missing keys and counts the number of records containing invalid values.
-
The lookup_key_found_percent check measures the percentage of rows containing missing keys.
If the dictionary table is not present in the same database, DQOps offers two other options to detect missing keys in the table.
-
Upload the list of keys to DQOps as a CSV file, stored as a data dictionary. Then, use the text_found_in_set_percent check described in the guide for validating accepted values.
-
Use the table comparison checks to reconcile the tested table and the dictionary table. Configure the grouping of both tables by the key column. It will be the foreign key column in the tested table and the primary key column in the dictionary table. Then, set up a comparison of a MAX value for both the key columns.
DQOps will run a query using a GROUP BYclause on both tables. The table reconciliation report will reveal missing keys.
Configure key lookup checks in UI
The referential integrity checks are present in the integrity category. The following example shows how to enable the lookup_key_not_found and lookup_key_found_percent checks from the data quality check editor.
The checks use two parameters to identify the dictionary table and the key column in that table.
-
The foreign_table parameter should be a fully qualified table name. For the purpose of this example, we are using the `bigquery-public-data`.`country_codes`.`country_codes` table.
-
The foreign_column parameter is the name of the primary key column in the dictionary table.
The lookup_key_not_found check also takes a max_count rule parameter, and the lookup_key_found_percent check takes a max_percent rule parameter.
Both data quality checks failed, showing that 1982 records use unknown country codes., which accounts for 0.0087% of the dataset. Our target of 100% rows containing valid country codes was missed by 0.0087%.
Key lookup checks error sampling in UI
To assist with identifying the root cause of errors and cleaning up the data, DQOps offers error sampling for this check. You can view representative examples of data that do not meet the specified data quality criteria by clicking on the Error sampling tab in the results section.
For additional information about error sampling, please refer to the Data Quality Error Sampling documentation.
Configure key lookup check in YAML
The configuration of the lookup_key_not_found check is straightforward. The most important parameters are highlighted.
The configuration of the lookup_key_found_percent check is similar.
Use cases
Name of the example | Description |
---|---|
Detect missing keys between different tables | This example shows how to check the referential integrity of a column against a column in another table using lookup_key_found_percent check. |
List of integrity checks at a column level
Data quality check name | Friendly name | Data quality dimension | Description | Standard check |
---|---|---|---|---|
lookup_key_not_found | Maximum count of rows containing values not found in a reference table (foreign key lookup) | Integrity | This check detects invalid values that are not present in a dictionary table. The lookup uses an outer join query within the same database. This check counts the number of values not found in the dictionary table. It raises a data quality issue when too many missing keys are discovered. | |
lookup_key_found_percent | Minimum percentage of rows containing values not found in a reference table (foreign key lookup) | Integrity | This check detects invalid values that are not present in a dictionary table. The lookup uses an outer join query within the same database. This check measures the percentage of valid keys found in the dictionary table. It raises a data quality issue when a percentage of valid keys is below a minimum accepted threshold. |
Reference and samples
The full list of all data quality checks in this category is located in the column/integrity reference. The reference section provides YAML code samples that are ready to copy-paste to the .dqotable.yaml files, the parameters reference, and samples of data source specific SQL queries generated by data quality sensors that are used by those checks.
What's next
- Learn how to run data quality checks filtering by a check category name
- Learn how to configure data quality checks and apply alerting rules
- Read the definition of data quality dimensions used by DQOps