The purpose of data quality validation
Validating data quality means making sure our data is accurate, complete, and useful. It’s like checking your ingredients before baking a cake – you want everything to be fresh and measured correctly to get a delicious result. In the same way, ensuring good data quality is vital for businesses to make smart decisions, provide excellent customer service, and avoid costly mistakes. If we don’t catch data problems early on, they can snowball into bigger issues down the line, leading to wasted time, lost revenue, and even damaged reputations.
Imagine a customer enters their email address incorrectly when placing an order. If this error slips through, they won’t receive order confirmations or shipping updates, causing frustration and potentially lost business. Catching this mistake early, while the customer is still on the website, allows for a quick fix. But if the bad data makes its way into other systems, it becomes much harder to correct. Someone, like a data steward, might have to manually track down the customer and update their information, taking up valuable time and resources. This is why it’s important to build data quality checks into every step of our data journey – from the moment it’s entered to when it’s stored and used.
Types of data architectures
Depending on where we are in the process of handling data, it flows through different layers, much like water flowing through a series of pipes and filters. Each layer presents unique opportunities to add data quality checks, ensuring our data stays clean and reliable throughout its journey.
A typical setup involves several components working together:
- Business Application Layer: This is where users interact with the system, like through a website or desktop app. It includes both the frontend (what the user sees) and backend (the behind-the-scenes logic).
- Data Communication Layer: This layer manages how different parts of the system talk to each other, often using APIs, event streaming, or message queues.
- Data Platform Layer: This is where the data is ultimately stored, usually in a data lake or warehouse. This data is then used for reporting, analytics, and machine learning projects.
Let’s take a quick look at these architectures and where we can add data quality validation:
- Business Application Layer: Here, we can catch errors early, even as the user is entering data. Think of those helpful pop-up messages that tell you if you’ve entered an invalid email address or forgotten a required field.
- Data Communication Layer: In this layer, we can add validation steps between different components, like extra checks before data is sent from one service to another.
- Data Platform Layer: In this final stage, we focus on validating data in batches or using data observability tools to monitor data quality over time. This is important because the data here is often used for critical business decisions.
These architectures are shown on the cheat sheet below.
Business Application Layer
Adding data validation right on the data entry screens lets us involve the user in the quality process. This way, we can stop bad data from even entering the system in the first place.
A typical business application is built in three layers: frontend, backend, and data storage. Let’s look at three places where we can add data quality checks in this layer:
- Client-Side Data Validation: This means putting the validation rules directly into the frontend code, like the JavaScript that runs in your web browser. It’s great for giving users immediate feedback, but it can be tricky to update the rules without redeploying the whole application. Also, it’s not the most secure option, as hackers can potentially bypass these checks.
- Server-Side Data Validation: This approach puts the validation logic in the backend, often within the API controllers that handle data requests. This gives us more flexibility to change the rules without affecting the frontend, but sending error messages back to the user can be a bit more complex. Still, it’s a very common and effective way to ensure data quality.
- Data Observability of the OLTP Database: This involves using a separate data observability platform to monitor the database where the application stores its data. This allows us to catch issues that might slip through the other checks, but it’s more of a reactive approach. The errors won’t be visible to the user right away, and someone like a data steward will need to review and address them later.
Data Communication Layer
The data communication layer acts as a bridge, connecting different parts of the backend system and allowing our business application to share data with other platforms or applications that need it.
This layer uses various technologies to move data around. Each technology requires a slightly different approach to data quality validation to ensure every action, message, or event is checked for accuracy.
Let’s explore a few common data architectures in this layer:
- REST API Gateway: This acts like a central hub in a data mesh, where different applications and components expose their APIs (ways for others to interact with them). A well-designed API mesh uses an API Gateway to manage access and routing. To add data quality validation here, we can create special “validating proxy” microservices. These proxies sit in front of the real APIs, check incoming JSON messages, and only forward them if the data is valid.
- Event Streaming Architectures: These handle massive volumes of messages, like those generated by IoT devices. They often use technologies like Apache Kafka and Apache Spark Structured Streaming. To validate data in this fast-paced environment, we need to integrate the validation logic directly into the data streaming pipeline, ensuring data is checked in real-time and in the correct order.
- Enterprise Application Integration (EAI) Architectures: These use message queues to reliably send data between systems. The advantage here is that we can easily add more systems to receive copies of the messages. To validate data in an EAI setup, we typically add a new component to the message bus. This component reads incoming messages, validates them, and then routes them to two different queues: one for valid messages that continue on their journey, and another “dead-letter queue” for invalid messages that need to be fixed and resent.
Data Storage Layer
The data storage layer is like a vast archive where we keep our data safe and sound. It’s used for long-term storage, ensuring we comply with regulations, and providing a rich source of information for analytics and future AI projects.
This layer typically relies on a data lake or a relational database that supports SQL queries, making it easy to access and analyze the data. Data pipelines are the software components responsible for moving incoming data into this storage area.
These data pipelines offer several opportunities for data quality validation:
- Data Contracts: Think of these as agreements between the business application owner and the data platform owner, clearly outlining the expected data quality. These contracts are written in a machine-readable format (like a YAML file) and can be checked by the data pipeline to ensure incoming data meets the agreed-upon standards. This approach promotes collaboration and clarity around data quality expectations.
- Data Validation Steps in the Data Pipelines: Even without a formal data contract, or when additional checks are needed, we can build validation steps directly into the data pipeline. These checks can happen before data is ingested, after it’s loaded into a temporary landing zone, and after each transformation step. This helps catch errors early and prevents new issues from being introduced by bugs in the data processing code. For instance, if the code accidentally uses the wrong data type for a field, causing customer names to get cut off, these validation steps would flag that problem.
- Data Observability of the Data Platform: This acts as our final safety net. A data observability platform monitors all data sources and storage layers, looking for changes, anomalies, and data quality issues. Data teams and data stewards can configure specific checks to ensure the health of each data asset, like a table. This approach provides ongoing monitoring and helps identify problems that might have slipped through earlier checks.
What is Data Observability
Sometimes, validating data quality in real-time just isn’t possible. Maybe we’re dealing with a pre-built business application that we can’t easily modify, or perhaps the sheer volume of data makes real-time checks impractical. In these cases, our best bet is to thoroughly test all data quality requirements using dedicated data quality checks.
But there’s another category of data quality issues that real-time checks can’t catch: data anomalies. These are unexpected patterns or behaviors in the data that emerge over time and often require sophisticated analysis to detect. For example, imagine a data transformation process suddenly starts misinterpreting decimal separators due to a change in server settings. The value “1,001” (intended to be close to 1.0) might get loaded as “1001” (one thousand and one). This kind of error wouldn’t trigger a typical validation rule, but it would wreak havoc on any analysis relying on that data.
This is where data observability platforms shine. They keep a watchful eye on all our data stores – databases, data lakes, data warehouses, even flat files – looking for anomalies, schema changes (like new columns or altered data types), and delays in data processing that might indicate validation steps were skipped. By combining time-series analysis with AI, these platforms can spot unusual trends and alert us to potential data quality problems that we might otherwise miss.
Data quality best practices - a step-by-step guide to improve data quality
- Learn the best practices in starting and scaling data quality
- Learn how to find and manage data quality issues
What is the DQOps Data Quality Operations Center
DQOps is a data observability platform designed to monitor data and assess the data quality trust score with data quality KPIs. DQOps provides extensive support for configuring data quality checks, applying configuration by data quality policies, detecting anomalies, and managing the data quality incident workflow.
DQOps supports multiple forms of integration to integrate well within many data quality architectures. It exposes both a Python Client library and an extensive REST API for validating data from data pipelines. The data monitoring section runs data quality checks on various data sources. The user interface enables inspection of data quality issues and review of error samples of invalid records that must be fixed.
You can set up DQOps locally or in your on-premises environment to learn how DQOps can monitor data sources and ensure data quality within a data platform. Follow the DQOps documentation, go through the DQOps getting started guide to learn how to set up DQOps locally, and try it.
You may also be interested in our free eBook, “A step-by-step guide to improve data quality.” The eBook documents our proven process for managing data quality issues and ensuring a high level of data quality over time.