What is a Data Contract? Definition, Validation and Best Practices

A data contract defines the structure, meaning, and data quantity validations for data exchanged between a data publisher and a data consumer. In the world of data-driven decision-making, ensuring data quality and achieving a clear understanding of the role of datasets guarantees smooth cooperation between parties exchanging data. This is where data contracts come into play. A data contract is a formal agreement, written in a machine-readable format, that defines the structure, purpose, and expectations surrounding a specific dataset. It acts as a bridge between data publishers (those who create and provide the data) and data consumers (those who utilize the data).

By establishing clear guidelines and expectations, data contracts promote trust and transparency, enabling efficient and effective data utilization across an organization.

A sample data contract in YAML format is shown in the infographic below.

Data contract example, components and description infographic by DQOps

Table of Contents

You can define and validate data contracts for free

Before you continue reading, DQOps Data Quality Operations Center is a data quality platform that supports data contracts. You define data quality requirements for each table in a YAML file, which becomes a data contact. Then, you can use the data quality client library to evaluate the contract from data pipelines.

Please refer to the DQOps documentation to learn how to start validating data contracts.

The purpose of Data Contracts

Data contracts play a pivotal role in modern data management strategies. Their primary purpose is twofold:

  • Ensuring Data Quality: By explicitly defining the structure, format, and constraints of data, data contracts establish a clear benchmark for data quality. This empowers data consumers to confidently rely on the data’s accuracy and completeness, minimizing the risk of errors and inconsistencies that can derail decision-making processes.
  • Fostering Understanding: Data contracts provide valuable context and metadata about data assets. This helps data consumers understand the purpose, origin, and intended use of the data, enabling them to make informed choices about which datasets to leverage for their specific needs. It also facilitates collaboration and knowledge sharing across teams and departments.

In essence, data contracts act as a “single source of truth” about data, promoting data governance, streamlining data integration efforts, and ultimately driving greater value from data assets. They apply the principles of contracts to data exchange between parties, defining their obligations related to the agreed structure, schema and quality metrics of exchanged data.

How Data Contracts are Defined

To be truly effective, data contracts must be machine-readable. This ensures they can be seamlessly interpreted and enforced by automated data validation mechanisms within the data flow, maintaining data integrity and consistency. However, it’s equally important for data contracts to be human-readable.

Utilizing plain-text formats that are easy to inspect, modify, and write offers significant advantages. It simplifies the process of defining, correcting, and evolving data contracts as requirements change or new data sources are added. Moreover, it eliminates the need for specialized tools to read and visualize the contract’s definition. One of the most convenient and widely adopted text formats for storing data contracts is YAML, known for its clear structure and readability.

When a user encounters a dataset accompanied by a data contract file, they can instantly review it to gain a comprehensive understanding of the structure and definition of all data elements. This accessibility fosters transparency and collaboration, enabling users to leverage the data with confidence

Components of a Data Contract

A well-defined data contract should be established for each distinct type of tabular data, whether it’s a traditional database table or a more complex dataset. This ensures clarity and consistency across all data assets. Let’s delve into the key components that comprise a robust data contract:

  1. Data Schema
    • Think of this as the blueprint of your table. It provides a comprehensive definition, listing every column, its purpose (what kind of information it holds), and the type of data it stores (like numbers, text, or dates).
  2. Data Format
    • This specifies how the data within each column should be structured. For instance, it might define that email addresses must follow a specific pattern or that phone numbers should include country codes.
  3. Identity and Relations
    • This highlights the unique identifiers (primary keys) within the table and how it connects to other tables (foreign keys). It’s like mapping the relationships in your data family.
  4. Constraints
    • These are the rules your data must follow. For example, certain columns might be required to always have a value (not null), or some values might need to be unique across all rows.
  5. Ownership
    • This clearly states who’s responsible for the data. If you have questions or encounter issues, you’ll know who to contact.
  6. Sensitivity
    • Some data is sensitive and needs extra protection. This section identifies columns that might contain personal or confidential information, requiring special handling or anonymization.
  7. Criticality
    • Not all data is created equal. This indicates how important the table is. If problems arise with multiple tables, this helps prioritize which ones to fix first.
  8. Data Quality Rules
    • These are checks the data publisher performs to ensure the data is accurate. It might include things like making sure values fall within a certain range or that there are no duplicates.
  9. Service Level Objectives (SLOs)
    • This sets expectations for data quality and availability. It defines things like how fresh the data should be, how often it should be available, and what percentage of it needs to pass quality checks.

Validating Data Contracts

Data contracts are not just static documents; they are actively enforced to guarantee the integrity and quality of your data. This validation process occurs at multiple stages throughout the data lifecycle:

  • Validation by the Publisher: Before data is even shared, the publisher (the source of the data) checks it against the data contract. This ensures that only data that meets the defined standards is released. It’s like a quality control check before a product leaves the factory.
  • Validation within the Infrastructure: As data travels through your systems, it’s continuously monitored. Any data that doesn’t conform to the contract’s rules is flagged and set aside for correction. Think of this as an inspection line within the data pipeline, catching any defects before they cause problems downstream.
  • Validation after Publishing: Even after the data is available, monitoring continues. Specialized tools keep an eye on the data, looking for any inconsistencies or issues that might have slipped through. This is like having a watchdog that alerts you if anything unexpected happens.

 

By implementing validation at these different levels, you create a robust system that helps maintain data quality and trust. It’s like having multiple layers of security to protect your valuable data assets.

Data Contract Validation Levels

Data quality requirements should be classified into three distinct levels to ensure effective validation and appropriate responses:

  1. Hard Dataset Constraints:
    These constraints pertain to the overall structure and schema of the dataset. Any mismatch between the record’s schema and the schema defined in the data contract should trigger an immediate halt in processing. This prevents potential data corruption in downstream systems that rely on the data’s integrity.
  1. Hard Record-Level Constraints:
    These constraints apply to individual records within the dataset. Violations of these constraints, such as null values in mandatory fields or values in incorrect formats, should result in the record being sent to a dead-letter queue. This allows for manual inspection, correction, and reprocessing of the problematic record without disrupting the overall data flow.
  1. Service Level Objectives (Quality Expectations):
    These are metrics and thresholds that represent the desired quality level of the data, but are not necessarily hard constraints. Examples include the percentage of anomalies or data timeliness. These metrics should be continuously monitored using a data observability tool on the output dataset. Deviations from the expected levels can trigger alerts and prompt further investigation but do not necessarily require immediate intervention or halting of data processing.

By categorizing data quality requirements into these three levels, organizations can implement a more nuanced and effective approach to data contract validation, ensuring both data integrity and efficient data processing workflows.

Data quality best practices - a step-by-step guide to improve data quality

What is the DQOps Data Quality Operations Center

DQOps is a data quality and observability platform designed to monitor data and assess the data quality trust score with data quality KPIs. DQOps provides extensive support for configuring data quality checks, applying configuration by data quality policies, detecting anomalies, and managing the data quality incident workflow

DQOps stores the configuration of data quality checks and the table schema in YAML files, which have a structure that is compatible with the idea of data contracts. The data quality requirements for each table are defined in a separate YAML file that can be stored in a version control system, such as Git. Python and REST API clients allow performing data validation at every stage of the data pipeline.

You can set up DQOps locally or in your on-premises environment to learn how DQOps can monitor data sources and ensure data quality within a data platform. Follow the DQOps documentation, go through the DQOps getting started guide to learn how to set up DQOps locally, and try it.

You may also be interested in our free eBook, “A step-by-step guide to improve data quality.” The eBook documents our proven process for managing data quality issues and ensuring a high level of data quality over time.

Do you want to learn more about Data Quality?

Subscribe to our newsletter and learn the best data quality practices.

From creators of DQOps

Related Articles