Last updated: July 22, 2025
How to Detect PII Data? Examples and Best Practices
Read this guide to learn how to detect the presence of Personal Identifiable Information such as emails or phone numbers in tables.
The data quality checks that detect PII values are configured in the pii
category in DQOps.
What is PII data
Personal Identifiable Information (PII) is any information that permits the identity of an individual to whom the information applies.
The typical information that can lead to revealing the identity of an individual are:
-
Phone number.
-
Email address.
-
IP address.
Additionally, an individual's identity can be inferred by using a combination of multiple elements, such as a ZIP code.
Accidental exposure of PII data can have severe consequences for an organization. Companies that operate in the European Union must comply with the GDPR regulations. An exposure or just sharing of sensitive information to a third party without the individual's explicit approval is considered a data leak. An organization that does not protect the personal data of its customers and employees is subject to a fine as high as 10% of the worldwide annual revenue.
Even if an organization is not under the GDPR law, exposing sensitive information by mistake may lead to losing the trust of its customers and business partners.
When sensitive data can leak
In the age of data mesh and data sharing across organizations, many data assets are shared without control. The tables that are accessible to any user without control should not contain sensitive information.
The best way to ensure that no PII data is present is by running data quality checks that use patterns to find possible emails or phone numbers in columns that are not supposed to store these values. Personal information is often found in free-text columns, such as comments or descriptions.
Scanning the following types of tables and columns for possible exposure of PII data is advisable.
-
Tables that are shared across the organization with unlimited access to run queries.
-
Tables that are shared with external vendors, suppliers or customers.
-
Columns containing comments and descriptions.
-
Columns that store information captured in free-form fields. Sometimes, the personnel uses these fields to store additional comments.
How DQOps finds PII data
DQOps supports several dedicated data quality checks that search for patterns inside text fields.
The following table shows the possible content of a comments column with sensitive information.
Comment |
---|
The customer requested to be notified by email john.smith@gmail.com when the package is shipped. |
The customer left his private phone number for the courier: 123-456-7890 |
Data profiling
The selection of columns that should be monitored for sensitive data begins on the column's profiling screen. The following screenshot shows samples from a public dataset of 311 service requests line in Austin. We are looking at the complaint_description column, a free-form field that could contain some sensitive data.
So far, the data samples do not show anything sensitive, but we will prove it later with a data quality check.
The next sample comes from an incident_address column.We can instantly notice addresses and phone numbers. Please be aware that this table is public, and we do not see any visible proof of anonymization.
Enabling PII detection in DQOps
DQOps contains several data quality checks that run SQL queries to identify the most common PII values.
-
contains_usa_phone_percent check detects US phone numbers.
-
contains_email_percent check detects emails.
-
contains_usa_zipcode_percent detects US zip codes.
-
contains_ip4_percent detects IP4 internet addresses.
-
contains_ip6_percent detects common forms of IP6 internet addresses.
Activate PII checks in UI
The data quality check editor in DQOps shows the Personal Identifiable Information checks in the PII category. To display non-standard checks, such as contains_usa_zipcode_percent select the Show advanced checks checkbox at the top left of the Check editor table.
The following example shows the result of detecting phone numbers, emails, and zip codes in the complaint_description column. DQOps did not detect sensitive data inside any value stored in the column.
The next example shows the result of running the same checks on the incident_address column. We can see that this public dataset contains a few phone numbers and emails.
PII checks error sampling in UI
To assist with identifying the root cause of errors and cleaning up the data, DQOps offers error sampling for PII check. You can view representative examples of data that do not meet the specified data quality criteria by clicking on the Error sampling tab in the results section.
For additional information about error sampling, please refer to the Data Quality Error Sampling documentation.
Activate PII checks in YAML
The PII checks are configured by setting the max_percent parameter.
Detecting other forms of PII data
DQOps contains only data quality checks to detect the five most common and reliable PII patterns. If you need to detect other types of PII data, please follow the manual for configuring custom data quality checks. You can make a copy of one of the built-in PII checks and adapt the regular expression to find alternative patterns.
Use cases
Name of the example | Description |
---|---|
Percentage of rows containing USA zip codes | This example shows how to detect USA zip codes in text columns by measuring the percentage of rows containing a zip code using the contains_usa_zipcode_percent check. |
List of PII checks at a column level
Data quality check name | Friendly name | Data quality dimension | Description | Standard check |
---|---|---|---|---|
contains_usa_phone_percent | Detect USA phone numbers inside text columns | Validity | This check detects USA phone numbers inside text columns. It measures the percentage of columns containing a phone number and raises a data quality issue when too many rows contain phone numbers. | |
contains_email_percent | Detect emails inside text columns | Validity | This check detects emails inside text columns. It measures the percentage of columns containing an email and raises a data quality issue when too many rows contain emails. | |
contains_usa_zipcode_percent | Detect USA zip codes inside text columns | Validity | This check detects USA zip code inside text columns. It measures the percentage of columns containing a zip code and raises a data quality issue when too many rows contain zip codes. | |
contains_ip4_percent | Detect IP4 addresses inside text columns | Validity | This check detects IP4 addresses inside text columns. It measures the percentage of columns containing an IP4 address and raises a data quality issue when too many rows contain IP4 addresses. | |
contains_ip6_percent | Detect IP6 addresses inside text columns | Validity | This check detects IP6 addresses inside text columns. It measures the percentage of columns containing an IP6 address and raises a data quality issue when too many rows contain IP6 addresses. |
Reference and samples
The full list of all data quality checks in this category is located in the column/pii reference. The reference section provides YAML code samples that are ready to copy-paste to the .dqotable.yaml files, the parameters reference, and samples of data source specific SQL queries generated by data quality sensors that are used by those checks.
What's next
- Learn how to run data quality checks filtering by a check category name
- Learn how to configure data quality checks and apply alerting rules
- Read the definition of data quality dimensions used by DQOps