Skip to content

Last updated: July 22, 2025

How to detect referential integrity issues and missing keys, examples

Read this guide to learn how to detect referential integrity issues, such as missing keys in dictionary tables or wrong foreign keys.

The data quality checks that detect missing keys are configured in the integrity category in DQOps.

What is referential integrity

Referential integrity is used in relational databases to describe a relation between two tables. One table is a dictionary of entries such as products, countries, and customers. Each entry is identified by a unique key, which is also a primary key of the table. The other table contains references to the dictionary entries by storing a key that identifies that entry, called a foreign key.

Traditional, transactional OLTP databases such as Oracle, SQL Server, PostgreSQL, or MySQL can enforce relational integrity in the storage engine, ensuring that storing invalid foreign key values is not possible.

The problem of broken referential integrity is specific to data warehouses and data lakes using unstructured or semi-structured data, such as Parquet files.

When referential integrity is broken

We can identify several sources of broken referential integrity.

  • Data that is stored in a data lake as Parquet or CSV files. The data lake cannot enforce foreign key constraints, allowing the upload of Parquet files containing invalid keys.

  • Data loaded or updated without an ongoing transaction.

  • Data loaded to a table that didn't have an active foreign key constraint.

  • Data received in the landing zone as flat files.

  • Invalid transformation code in the data pipeline.

  • The columns were in a different order in some source files, such as a country name and state code were reversed.

In all these cases, post-load validation using data quality checks is the last line of defense to detect referential integrity issues.

Example of referential integrity

The following example shows the first ten rows of a public dataset containing the number of COVID-19 cases for each country and date.

date country_code country_name iso_3166_1_alpha_2 iso_3166_1_alpha_3 new_confirmed
2022-02-27 AR Argentina AR ARG 1818
2022-01-23 CO Colombia CO COL 91
2022-07-11 CZ Czech Republic CZ CZE 2036
2022-05-12 CZ Czech Republic CZ CZE 521
2022-02-05 ES Spain ES ESP 27997
2022-03-19 FI Finland FI FIN 0
2022-05-21 FI Finland FI FIN 0
2022-05-28 FI Finland FI FIN 0
2022-01-11 FR France FR FRA 368149
2022-02-13 FR France FR FRA 86562

The data was retrieved using the following query.

SELECT country_code, country_name, iso_3166_1_alpha_2, iso_3166_1_alpha_3, new_confirmed
FROM `bigquery-public-data`.`covid19_open_data`.`covid19_open_data`
WHERE `date` >= '2022-01-01'
LIMIT 10

The dictionary table with a list of all country codes is also public. We can retrieve the list of countries using the following query.

SELECT * 
FROM `bigquery-public-data`.`country_codes`.`country_codes`
LIMIT 10

We aim to verify that all country codes stored in the country_code column of the COVID-19 cases table have a matching two-letter country code in the dictionary of all countries.

The list of the first ten countries is shown below.

country_name alpha_2_code alpha_3_code
Afghanistan AF AFG
Albania AL ALB
Algeria DZ DZA
American Samoa AS ASM
Andorra AD AND
Angola AO AGO
Anguilla AI AIA
Antarctica AQ ATA
Antigua and Barbuda AG ATG
Argentina AR ARG
... ... ...

Missing keys

The following example shows a possible referential integrity issue we want to identify. The country_code column may contain a full country name instead of an ISO 3166 country code, as shown below.

date country_code country_name iso_3166_1_alpha_2 iso_3166_1_alpha_3 new_confirmed
... ... ... ... ... ...
2022-01-11 FR France FR FRA 368149
2022-02-13 FRANCE <- unknown key France FR FRA 86562

Detecting missing keys

DQOps detects missing keys by running an SQL query that joins the tested table with the dictionary table using a left outer join. Both tables must be on the data source because the join is performed inside the analyzed database. DQOps does not pull the data from tested data sources to detect missing keys.

The following two data checks detect missing keys.

If the dictionary table is not present in the same database, DQOps offers two other options to detect missing keys in the table.

  • Upload the list of keys to DQOps as a CSV file, stored as a data dictionary. Then, use the text_found_in_set_percent check described in the guide for validating accepted values.

  • Use the table comparison checks to reconcile the tested table and the dictionary table. Configure the grouping of both tables by the key column. It will be the foreign key column in the tested table and the primary key column in the dictionary table. Then, set up a comparison of a MAX value for both the key columns.
    DQOps will run a query using a GROUP BY clause on both tables. The table reconciliation report will reveal missing keys.

Configure key lookup checks in UI

The referential integrity checks are present in the integrity category. The following example shows how to enable the lookup_key_not_found and lookup_key_found_percent checks from the data quality check editor.

The checks use two parameters to identify the dictionary table and the key column in that table.

  • The foreign_table parameter should be a fully qualified table name. For the purpose of this example, we are using the `bigquery-public-data`.`country_codes`.`country_codes` table.

  • The foreign_column parameter is the name of the primary key column in the dictionary table.

The lookup_key_not_found check also takes a max_count rule parameter, and the lookup_key_found_percent check takes a max_percent rule parameter.

Detecting missing keys using a lookup data quality check in DQOps

Both data quality checks failed, showing that 1982 records use unknown country codes., which accounts for 0.0087% of the dataset. Our target of 100% rows containing valid country codes was missed by 0.0087%.

Key lookup checks error sampling in UI

To assist with identifying the root cause of errors and cleaning up the data, DQOps offers error sampling for this check. You can view representative examples of data that do not meet the specified data quality criteria by clicking on the Error sampling tab in the results section.

Detecting missing keys using a lookup data quality check in DQOps - error sampling

For additional information about error sampling, please refer to the Data Quality Error Sampling documentation.

Configure key lookup check in YAML

The configuration of the lookup_key_not_found check is straightforward. The most important parameters are highlighted.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    country_code:
      type_snapshot:
        column_type: STRING
        nullable: true
      monitoring_checks:
        daily:
          integrity:
            daily_lookup_key_not_found:
              parameters:
                foreign_table: '`bigquery-public-data`.`country_codes`.`country_codes`'
                foreign_column: alpha_2_code
              error:
                max_count: 0

The configuration of the lookup_key_found_percent check is similar.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    country_code:
      type_snapshot:
        column_type: STRING
        nullable: true
      monitoring_checks:
        daily:
          integrity:
            daily_lookup_key_found_percent:
              parameters:
                foreign_table: '`bigquery-public-data`.`country_codes`.`country_codes`'
                foreign_column: alpha_2_code
              error:
                min_percent: 100.0

Use cases

Name of the example Description
Detect missing keys between different tables This example shows how to check the referential integrity of a column against a column in another table using lookup_key_found_percent check.

List of integrity checks at a column level

Data quality check name Friendly name Data quality dimension Description Standard check
lookup_key_not_found Maximum count of rows containing values not found in a reference table (foreign key lookup) Integrity This check detects invalid values that are not present in a dictionary table. The lookup uses an outer join query within the same database. This check counts the number of values not found in the dictionary table. It raises a data quality issue when too many missing keys are discovered.
lookup_key_found_percent Minimum percentage of rows containing values not found in a reference table (foreign key lookup) Integrity This check detects invalid values that are not present in a dictionary table. The lookup uses an outer join query within the same database. This check measures the percentage of valid keys found in the dictionary table. It raises a data quality issue when a percentage of valid keys is below a minimum accepted threshold.

Reference and samples

The full list of all data quality checks in this category is located in the column/integrity reference. The reference section provides YAML code samples that are ready to copy-paste to the .dqotable.yaml files, the parameters reference, and samples of data source specific SQL queries generated by data quality sensors that are used by those checks.

What's next