What is Data Quality
In the current digital age, organizations are constantly collecting vast amounts of information. This data, residing in databases, files, and data lakes, is crucial for daily operations and strategic decision-making. However, the sheer volume and diversity of data can be overwhelming. When it’s time to utilize this data, be it for analysis or machine learning, ensuring its reliability and suitability for the purpose is paramount. This is where the concept of data quality comes into play.
Data quality simply refers to how usable your data is for its intended purpose. When data is well-formatted, complete, up-to-date, and relevant to the use case, we say it’s of good quality.
This high-quality data is vital for data analysts to extract valuable insights and for data scientists to construct accurate and impactful AI models. Conversely, inaccurate or incomplete data can lead to incorrect conclusions and poor business decisions.
Furthermore, data quality is not just a matter of best practices. Numerous industries are subject to regulations that mandate the use of high-quality data. Examples include the EU’s NIS 2 Directive, which aims to improve incident reporting by requiring organizations to maintain accurate and complete data on cybersecurity incidents, and the forthcoming AI Act, which will likely impose stringent data quality standards for AI model training and deployment.
In conclusion, data quality forms the bedrock for any organization that depends on data to drive its operations and achieve its objectives. By guaranteeing that data is accurate, complete, and timely, organizations can make informed decisions, build effective models, and adhere to regulatory requirements.
Table of Contents
You can analyze data quality without coding for free
Before you continue reading, DQOps Data Quality Operations Center is a data quality platform that enables non-technical users to automatically find data quality issues and configure data quality checks without coding.
Please refer to the DQOps documentation to learn how to detect data quality issues for free.
How to Ensure Data Quality
Ensuring data quality is a multi-faceted process that involves rigorous testing and validation. Traditionally, organizations have relied on dedicated Data Governance or Data Quality teams to spearhead these efforts. These teams collaborate with business users, data analysts, and data scientists to gather comprehensive data quality requirements.
The specialists responsible for verifying data quality are often called Data Stewards, though many organizations are now creating dedicated Data Quality Engineer roles. These experts utilize data quality platforms to configure and execute data quality checks without the need for coding. They analyze data assets like tables to identify any quality issues and measure the overall health of the data, often by calculating a data quality KPI score.
Recently, another method of data quality validation has gained traction due to the rising popularity of automation in data management and the adoption of a “Shift Left” approach to data engineering. This approach emphasizes early testing by running data quality checks directly within data pipelines. While this allows for the early detection of data quality issues, coding expertise is required to implement the checks.
This article compares these two approaches—using a dedicated data quality platform versus embedding checks in code—and offers insights into which method might be better suited for various use cases. By understanding the strengths and weaknesses of each approach, organizations can make informed decisions to enhance their data quality management practices.
Data Quality and Data Management
Data Management encompasses a broad range of practices implemented by data teams to manage their data assets, databases, and data movement. Data Quality is an integral component of data management, with its influence extending across numerous activities. If processes like change management or incident management neglect data quality considerations, the overall integrity of data can deteriorate over time.
Several key data management processes need to be meticulously designed to preserve data quality:
- Change Management: When schema or configuration changes occur in data sources, it’s crucial that these changes are reflected in updates to data quality checks. This ensures that the checks remain effective and aligned with the evolving data landscape.
- Incident Management: When dealing with incidents involving invalid data, the process should include steps to review existing data quality checks, identify other potentially affected tables by tracking downstream dependencies along the data lineage, and update data quality check configurations as needed.
- Requirement and Business Process Changes: Changes in requirements or business processes should trigger updates to both the data pipeline code and the configuration of data quality checks. This guarantees that data quality remains aligned with the evolving needs of the business.
By integrating data quality considerations into these core data management processes, organizations can proactively maintain data integrity, ensure data remains fit for its intended purpose, and minimize the risk of errors or inconsistencies that can impact decision-making and operations.
Code vs Low-Code Data Quality
Two distinct approaches to establishing data quality checks are currently gaining traction. The traditional Low-Code Data Quality method, favored by Data Stewards and less technical specialists responsible for data quality, relies on user-friendly data quality platforms. These platforms offer an intuitive interface for managing metadata of the tables being tested and for configuring data quality checks and their associated rule thresholds. While this approach may still involve a small amount of coding, it is limited to defining unique data quality checks using SQL snippets, eliminating the need for complex integration or scheduling of queries. Since SQL was designed for business users to retrieve the data they need, this approach can be aptly termed “Low-Code” data quality.
On the other hand, the increasing adoption of open-source data orchestration platforms like Apache Airflow, as opposed to traditional ETL tools with visual interfaces, has given rise to another approach. Data engineering teams responsible for building data pipelines that ingest, transform, and load data into data lakes or warehouses are applying software development best practices, including unit testing. The concept of embedding data quality checks directly into the code and executing them before data processing steps is gaining momentum. Data engineering teams often prefer to code all tasks in a single programming language, usually Python or Scala. They may opt for Python libraries such as Great Expectations to perform basic data quality tests. In this approach, data quality validation code becomes an integral part of the data pipeline code and requires deployment whenever changes are made to data quality rule parameters.
No-Code Data Quality
Beyond the Low-Code approach that occasionally necessitates SQL queries, there’s another avenue for ensuring data quality: No-Code Data Quality. This approach enables data quality testing without writing any SQL, limiting interaction to configuring or reviewing proposed data quality rule thresholds, which are the parameters of data quality checks, like minimum row count or the number of unique values.
Here are three ways to achieve No-Code Data Quality:
- Templated Data Quality Checks: Utilize a data quality platform offering pre-defined data quality checks with a user interface to input only the desired parameter values. For example, you might enter “100” in the configuration of a check that verifies the minimum number of unique values in a column.
- Data Quality Rule Mining: Delegate data quality check configuration to an engine that analyzes data structure and sample values to propose a suitable list of checks adhering to best practices. This eliminates the need for manual configuration.
- Data Observability: Leverage a Data Observability platform without in-depth data quality rule configuration. These tools excel when a table’s data quality is generally good, and degradation might only occur due to unexpected schema changes or shifts in data distribution. They monitor tables, detect schema changes, and identify anomalies that could potentially impact data quality.
Implementing Data Quality Checks in Code
Incorporating data quality checks directly into your data pipeline code, often using Python libraries, offers several advantages:
- Flexibility and Customization: You have complete control over how you define and implement your checks. This flexibility allows for tailoring checks to highly specific or complex data quality requirements that may not be easily addressed by pre-built checks in a no-code platform.
- Tight Integration: The checks become an integral part of your data pipeline, ensuring data quality is validated at every stage of the process. This can help identify and rectify issues early on, preventing them from propagating downstream.
- Cost-Effectiveness: For smaller projects or organizations with strong programming skills, a code-based approach can be more cost-effective than investing in a separate data quality platform.
However, there are also some drawbacks to consider:
- Requires Coding Expertise: Implementing and maintaining code-based checks demands proficiency in programming languages like Python and familiarity with data quality libraries. This can pose a challenge for organizations without dedicated data engineering teams.
- Potential for Increased Complexity: As your data pipelines and quality requirements grow, the codebase for data quality checks can become increasingly complex and difficult to manage. This can lead to technical debt and increased maintenance overhead.
- Limited Visibility for Non-Technical Users: Code-based checks can be opaque to business users and other non-technical stakeholders, hindering collaboration and understanding of data quality rules.
Overall, embedding data quality checks in code is a powerful approach that offers flexibility and control, but it requires technical expertise and careful management to avoid complexity and ensure transparency.
Leveraging Low-Code Data Quality Platforms
Adopting a low-code data quality platform for defining and managing data quality checks brings forth its own set of advantages:
- Accessibility and Ease of Use: Low-code platforms provide intuitive, user-friendly interfaces that empower non-technical users, such as data stewards and business analysts, to actively participate in the data quality process. This democratizes data quality management and reduces reliance on specialized technical skills.
- Centralized Management and Collaboration: These platforms offer a centralized repository for data quality rules, promoting consistency and visibility across the organization. They often include features that facilitate collaboration and knowledge sharing among teams, fostering a data-driven culture.
- Rapid Implementation and Iteration: Pre-built checks, templates, and automated rule suggestions accelerate the implementation of data quality checks. The visual nature of low-code and no-code platforms also simplifies rule adjustments and iterations, enabling agile responses to evolving data requirements.
- Proactive Monitoring and Issue Detection: Many low-code platforms incorporate built-in monitoring and anomaly detection capabilities. These features help identify potential data quality issues early, allowing for timely intervention and prevention of downstream impacts.
- Clear and Comprehensive Reporting: Low-code platforms often provide interactive dashboards and visualizations that present data quality metrics, health scores, and historical trends in a clear and accessible manner. This facilitates communication and data-driven decision-making across the organization.
However, low-code platforms may also have some limitations:
- Potential for Customization Constraints: While these platforms offer a wide range of pre-built checks and templates, they might not cater to every unique or highly specialized data quality requirement. In some cases, achieving granular control or addressing complex scenarios might necessitate custom code or integration efforts.
- Upfront Investment: Implementing a low-code data quality platform typically involves an initial financial investment. Organizations should carefully evaluate the long-term costs and implications before committing to a specific platform. This obstacle can be overcome by using open-source data quality platforms.
Overall, low-code data quality platforms present a compelling solution for organizations seeking to democratize data quality management, foster collaboration, and accelerate the implementation of data quality checks. However, it’s important to weigh the potential limitations and ensure the chosen platform aligns with the organization’s specific needs and future growth plans.
These two approaches are compared on the infographic below.
Limitations of Testing Data Quality in Code
While embedding data quality checks directly into data pipeline code may seem like a swift and efficient approach, its limitations often become apparent within a few months of production use. Typically, data engineers initially receive only basic requirements regarding data types, valid value ranges, and field formats. These checks, while easy to implement in code, primarily validate data integrity, not its true business usability.
As business users, data analysts, data scientists, and data stewards start utilizing the data, they inevitably uncover missing or incorrect information. They then request updates to the data platform and additional data quality checks that assess the data from a usage perspective. These checks, now intertwined with the business logic, are added to the pipeline code and deployed alongside other platform changes. The pipeline now encompasses both data integrity checks and business-driven validations.
However, data formats and types change infrequently, usually only when upstream data sources are modified. Business requirements, such as “the maximum product price must be below $1000,” can fluctuate as the company evolves, enters new markets, or adjusts its product offerings. The first instance of a product exceeding this price threshold triggers a failure in the code-embedded check, halting data processing for all other products. The data engineering or operations team is then alerted to the pipeline failure and must sift through logs to identify the specific check and table causing the issue.
To rectify the situation, a business user must request that the data engineers update the price limit and redeploy the code. This highlights another limitation of the code-based approach: non-technical users, including data platform owners and data stewards, lack the autonomy to configure new checks or directly review data quality issues without relying on data engineers.
In contrast, a low-code approach circumvents this limitation. Data engineers focus on their core responsibilities of data ingestion, transformation, and loading, while also integrating calls to a data quality platform within their pipelines. Non-technical teams can then leverage a user-friendly interface to configure and manage data quality checks at scale, reviewing results in real-time without waiting for code deployments. This fosters greater collaboration and agility in maintaining data quality throughout the data lifecycle.
Combining Code and Low-Code Data Quality Together
A hybrid approach that merges the strengths of both code-based and low-code data quality methods is also possible. This synergy is achieved when the data quality platform provides a Python client library that can be seamlessly integrated into data pipelines. In this scenario, the data pipeline code leverages the library to interact with the data quality platform and trigger the execution of data quality checks.
This hybrid model offers the best of both worlds:
- Data Engineers retain control: Data engineers can still embed data quality checks directly within their pipelines, maintaining control over the validation process and ensuring tight integration with data transformations.
- Empowerment for non-technical users: Simultaneously, non-coding users like data stewards can continue to utilize the user-friendly interface of the data quality platform to define, configure, and modify data quality checks at any time.
- Real-time updates: Changes to the data quality check configuration on the platform can be applied instantly, eliminating the need for time-consuming code redeployments and minimizing disruptions to data pipelines.
This combined approach fosters collaboration between technical and non-technical teams, streamlines data quality management, and allows for greater agility in responding to evolving data requirements. It’s a testament to the fact that code-based and low-code approaches are not mutually exclusive and can be leveraged together to create a robust and adaptable data quality framework.
Data quality best practices - a step-by-step guide to improve data quality
- Learn the best practices in starting and scaling data quality
- Learn how to find and manage data quality issues
What is the DQOps Data Quality Operations Center
DQOps is a data observability platform designed to monitor data and assess the data quality trust score with data quality KPIs. DQOps provides extensive support for configuring data quality checks, applying configuration by data quality policies, detecting anomalies, and managing the data quality incident workflow.
DQOps combines all three approaches to data quality. It provides a no-code data quality user interface that uses a data quality rule mining engine to propose data quality checks. The low-code data quality management is supported by letting users define custom data quality checks as templated SQL queries. Finally, DQOps provides an extensive Python client library to run data quality checks and automate the whole platform from data pipelines.
You can set up DQOps locally or in your on-premises environment to learn how DQOps can monitor data sources and ensure data quality within a data platform. Follow the DQOps documentation, go through the DQOps getting started guide to learn how to set up DQOps locally, and try it.
You may also be interested in our free eBook, “A step-by-step guide to improve data quality.” The eBook documents our proven process for managing data quality issues and ensuring a high level of data quality over time. This is a great resource to learn about data quality.