Architectures for Integrating Data Quality into Data Pipelines – Examples and Best Practices

The need for data quality checks

In today’s data-driven world, businesses rely on accurate information to make decisions. The data that fuels these decisions often comes from various sources, each with its own quirks and potential problems. Unfortunately, data isn’t always perfect. It can arrive with errors, inconsistencies, or even missing pieces. When this “bad data” makes its way into our systems, it can lead to faulty analyses, incorrect conclusions, and costly mistakes.

To safeguard against these issues, we need to be proactive. Think of it like a quality control check on a production line. Before data reaches its final destination – whether it’s a dashboard, a report, or a machine learning model – it passes through a data pipeline. This pipeline is responsible for retrieving the data, transforming it into a usable format, and loading it into a central location. By building data quality checks directly into this pipeline, we can catch problems early on. This not only prevents bad data from contaminating our systems but also saves us the headache of fixing issues later in the downstream data platforms.

Table of Contents

Data quality integration options

To protect the data platform from receiving data affected by data quality issues, we have to run data quality checks directly from the data pipeline code. The data pipeline code can run data quality checks and validate the data quality status of data assets before reading and transforming a data source. Alternatively, a data pipeline with the role of a data publisher can validate the target table that was just loaded. We have created a sequence diagram that shows the process of integrating data quality checks into a data pipeline that provides more details on this subject.

The data architecture choice that must be made is the method of integrating data quality checks into the data pipeline. We have two options: hardcode data quality checks into a data pipeline by calling a Python library to evaluate dataframes directly, or we can separate the data quality checks configuration into separate data contract files managed by a data quality platform and use the platform’s client interface to trigger data quality checks.

The first option is faster to implement but will cause problems after the data platform is handed over from the data engineering team to the data operations team for maintenance because any changes to the data quality check rules will require a code change of the data pipeline.

Data quality checks embedded into a data pipeline

One approach to integrating data quality checks into a data pipeline is through direct embedding. This involves leveraging a Python library, either an open-source tool like Great Expectations or a custom-built solution. The library is incorporated directly into the pipeline code, with the specific data quality check configurations (e.g., validation rules, thresholds) hardcoded alongside the data transformation logic. This approach creates a tightly coupled system where data validation and transformation occur seamlessly within the same pipeline.

While this embedded method offers simplicity and direct control, it comes with certain trade-offs. Any modifications to data quality checks – whether adding new ones, retiring old ones, or adjusting thresholds – necessitate redeploying the entire pipeline. Due to the tight coupling, even minor configuration changes often require re-executing both the data loading and quality check phases to assess their impact. Additionally, maintaining these checks requires familiarity with the pipeline’s codebase, which may limit the involvement of non-technical stakeholders, such as data stewards who might lack coding expertise but possess valuable domain knowledge.

Data quality checks separated into a data quality platform

A contrasting approach to data quality integration involves a clear separation between the data pipeline’s core responsibilities (retrieval, transformation, loading) and data quality validation. The code handling the data movement remains within the pipeline, while the configuration for data quality checks is externalized. This configuration is typically stored in YAML files, a human-readable format well-suited for version control systems like Git. This separation simplifies modifications and promotes reproducibility across different environments, from development to production.

To execute these checks, a standalone data quality platform is employed as a separate component within the data architecture. The data pipeline interacts with this platform through a thin integration layer, often provided as a client library. This library acts as a bridge, triggering data quality checks without requiring modifications to the pipeline’s core logic. The only code change within the pipeline is usually limited to updating the client library when the data quality platform is upgraded. This architecture promotes maintainability and flexibility, allowing data quality configurations to evolve independently of the pipeline’s primary functions.

The role of a data quality platform

Elevating data quality checks to a standalone platform unlocks a range of possibilities beyond what’s achievable with embedded solutions. First and foremost, it democratizes data quality management by providing user-friendly interfaces that empower non-technical users, such as data stewards or analysts, to define and configure checks without needing to write code. This fosters collaboration and ensures that domain expertise is directly involved in maintaining data reliability.

Furthermore, a dedicated platform can store historical data quality results, enabling trend analysis and anomaly detection over time. This historical perspective is crucial for identifying shifts in data patterns, like recent schema changes, unexpected data volume fluctuations, or numerical outliers that might signal deeper issues. The platform can also extend its capabilities beyond the data loading cycle by implementing data observability techniques. This involves continuously monitoring data health and proactively alerting on potential problems, even when the pipeline isn’t actively running.

This expanded role is particularly important when dealing with comprehensive data quality suites. As the number of checks grows, so does the time required to execute them. Decoupling the platform from the pipeline allows for asynchronous validation, preventing data quality checks from becoming a bottleneck in the data loading process.

Data quality best practices - a step-by-step guide to improve data quality

Data quality check maintenance

The choice of data quality architecture has significant ramifications when transitioning ownership from data engineering to data operations teams. Data engineers, proficient in coding, can swiftly update embedded data quality checks within the pipeline. However, operational teams, often comprising global support specialists with varying technical backgrounds, might struggle with code-based configurations.

This becomes particularly problematic when addressing false-positive alerts. Data quality checks, initially calibrated to specific thresholds, might trigger alerts as data evolves. For instance, a check limiting daily records to 1 million might be valid at first, but fail as the company grows and data sources expand. Resolving this would require adjusting the threshold to, say, 2 million.

In an embedded architecture, such a change necessitates code modifications within the pipeline, often exceeding the capabilities of support teams. This can lead to delays, potential errors, and increased reliance on engineering resources. Conversely, a standalone platform empowers operational teams to modify thresholds through user interfaces, facilitating rapid adjustments without code changes. This autonomy improves responsiveness, reduces the burden on engineers, and ultimately enhances the overall maintainability of the data platform.

The following diagram compares these two architectures, showing the impact of a false-positive data quality issue that required reconfiguring a data quality check rule. The scope of components that must be redeployed is shown as a red dashed line.

Comparing an embedded and externalized data architectures for integrating data quality checks into data pipelines

What is the DQOps Data Quality Operations Center

DQOps is a data quality platform designed to simplify the transition of data platforms from the development phase to the maintenance phase.

You can set up DQOps locally or in your on-premises environment to learn how a data quality that combines integration with data pipelines and no-code data quality management for non-technical users can help shorten the time to resolve data quality issues.

DQOps is an end-to-end data quality management solution that supports continuous monitoring, data profiling, and data quality incident management to facilitate issue resolution. Its unique feature is customizability, which allows organizations to define custom and fully reusable data quality checks to detect data quality issues from a business perspective.

Follow the DQOps documentation, go through the DQOps getting started guide to learn how to set up DQOps locally, and try it.

You may also be interested in our free eBook “A step-by-step guide to improve data quality“. The eBook documents our proven process for managing data quality issues and ensuring high data quality over time.

 

Do you want to learn more about Data Quality?

Subscribe to our newsletter and learn the best data quality practices.

From creators of DQOps

Related Articles