Data product definition
In today’s data-driven landscape, the concept of a data product has emerged as a crucial component of effective data management and utilization. As organizations strive to extract value from their vast data reserves, data products offer a structured and focused approach to delivering actionable insights and driving business outcomes. Unlike traditional data projects that often involve complex, monolithic structures, data products encapsulate specific data domains, solving one distinct use case at a time.
This granular approach aligns with the principles of data mesh, a modern architectural paradigm that promotes decentralized data ownership and domain-driven design. Data products within a data mesh have a clear ownership scope, ensuring accountability and efficient management throughout the entire lifecycle. They are designed to be discoverable, accessible, and interoperable, enabling seamless integration and collaboration across different teams and departments. By understanding the essence of data products and their role in the data ecosystem, organizations can unlock the true potential of their data assets and pave the way for data-driven innovation.
Table of Contents
Data product in a Data Mesh architecture
Within a data mesh architecture, data products play a central role in organizing and delivering value from data assets. A common approach is to organize the data lake using a medallion architecture, consisting of bronze, silver, and gold layers. Each layer represents a distinct stage of data refinement, and within each layer, specific types of data products can be defined.
At the ingestion level, source-oriented data products handle the ingestion of raw data from various sources. These data products are responsible for ensuring data quality, consistency, and availability as the data enters the lake.
In the middle layer, aggregation-oriented data products manage the cleansed and standardized data. They aggregate data from different sources, resolve inconsistencies, and apply business rules to create a validated repository of data assets.
At the top, consumer-oriented data products transform the aggregated data into formats suitable for consumption by specific use cases. This could involve creating data marts, exposing APIs, or generating reports and dashboards for business intelligence tools.
Each data product encapsulates a single, well-defined role, aligning with the principles of Domain-Driven Design (DDD). This creates a bounded context for each data product, with clear ownership, responsibility, inputs, outputs, and schema. By adhering to DDD principles, data products become more manageable, maintainable, and adaptable to evolving business needs.
The core concepts of a data product are shown in the following image.
Federated Computational Governance
In the era of data warehousing, centralized data teams were often responsible for building and managing data solutions. However, this approach lacked the agility and domain-specific knowledge required to build data products at scale. As a result, many business departments opted to build their own data solutions internally, often leading to failures due to incompatibilities, duplicated costs, and inefficient resource utilization.
Data mesh introduces the concept of federated computational governance as a solution to these challenges. It establishes a governing and advisory body that provides policies and guidelines for shared concepts, such as data access management, logging, data quality, and compliance. This ensures consistency and adherence to best practices across all data products within the organization.
Furthermore, federated computational governance is responsible for providing shared resources, including data storage, compute engines, orchestration tools, data quality platforms, data catalogs, and metadata standards. These resources are essential for building and maintaining data products, and their centralized provision ensures efficient utilization and cost optimization.
To streamline the process and avoid errors, provisioning these resources for each data product should be fully automated through a self-service portal. This portal ensures that all data products comply with established policies and are correctly registered within the broader data platform. By implementing federated computational governance, organizations can achieve a balance between centralized control and decentralized autonomy, fostering a collaborative and efficient data ecosystem.
Data product components
A data product, while deployed on a shared data and compute infrastructure, maintains ownership of its code, data, and interfaces. The code component of a data product encompasses four key areas:
- Core Logic: This includes the code responsible for the core functionality of the data pipeline, encompassing data ingestion, transformation, aggregation, and publishing.
- Metadata Sharing: This code facilitates access to the metadata and schema of the data product, integrating it with data catalogs, data quality tools, and logging mechanisms. It ensures that relevant information about the data product is discoverable and accessible to other components within the data ecosystem.
- Shared Concerns: This code addresses cross-cutting concerns mandated by the policies defined as part of federated computational governance. It includes aspects like security, logging, observability, and traceability, ensuring compliance and adherence to best practices across all data products.
- Data Quality Rules: This component defines data quality checks that must be satisfied by the data sources ingested by the data product. It also defines data quality rules for the output data produced by the data product, ensuring the reliability and trustworthiness of the data throughout its lifecycle.
Beyond the code, a data product owns all the data it processes and the interfaces that expose this data to consumers. Data products can publish data using various data delivery architectures, including:
- Batch Processing: Producing and publishing datasets that are regularly updated.
- Real-time Streaming: Sending real-time events to connected data consumers.
- On-demand API: Exposing a consumable API for on-demand data delivery, providing a queryable interface.
A data product is not limited to a single data delivery channel and can leverage multiple methods simultaneously to cater to diverse consumer needs.
The interfaces for receiving source data and publishing output data are known as input ports and output ports, respectively. The schema accepted and published by these ports is defined within the data product, and a preferred format is a machine-readable data contract. This data contract can be published to data consumers, making it discoverable and facilitating seamless integration with other data products and applications.
Data product ownership
Clear ownership is a cornerstone of successful data products. In a decentralized organization, where data is distributed across different domains, having designated data product owners becomes crucial for unlocking the full potential of data assets. Data product owners are individuals who possess in-depth knowledge of the business area managed by the data product. They understand the intricacies of the business processes and the data model of the IT systems that generate the data.
The success of a data product hinges on applying product thinking principles to the data domain it manages. A well-designed data product should be easy to discover, consume, and provide visible value to its users. Data product owners play a pivotal role in achieving this value by:
- Staying abreast of changes: Data product owners must actively monitor and adapt to changes in business processes and organizational structures that may impact the data product. This involves updating data transformation code, metadata, and data contracts to ensure the data product remains relevant and aligned with evolving business needs.
- Promoting the data product: A key responsibility of data product owners is to communicate and promote the data product within the organization. This includes creating comprehensive documentation, providing easy access to relevant data through preferred interfaces, and ensuring the data is in a format that minimizes the need for further processing by consumers.
By fostering trust in the data product and showcasing its value, data product owners can drive adoption and utilization across the organization. This, in turn, leads to better decision-making, improved operational efficiency, and ultimately, increased business value derived from data assets.
Data product principles
To achieve widespread adoption and deliver value at an organizational scale, a data product must adhere to several fundamental principles:
- Discoverability: A data product must be easily discoverable by potential data consumers. This involves publishing its metadata, including descriptions, schemas, and data quality metrics, in a centralized data catalog. Consumers should be able to search for and access relevant data products through this catalog, facilitating data discovery and collaboration.
- Addressability: Each data product should have a unique address within the organization, allowing its output ports to be accessed by other data products in a standardized and easily consumable format. This promotes interoperability and seamless integration of data products across different domains.
- Trustworthiness: Data quality is paramount for any data product. Continuous monitoring and assessment of data quality are essential, and a reliable, up-to-date data quality score should be published alongside the data assets. This instills confidence in data consumers, ensuring they can trust the accuracy and reliability of the data.
- Self-Describing Semantics: Data products should be described using self-explanatory semantics, making them easily understandable by independent consumers who discover them. Clear and concise documentation, including data dictionaries and usage guidelines, should be readily available to facilitate seamless integration and interpretation of the data.
- Security and Compliance: Data products must adhere to global standards for security, access management, privacy, and compliance. This involves implementing robust security measures to protect data from unauthorized access, ensuring compliance with relevant regulations, and maintaining transparent data governance practices.
By adhering to these principles, data products become valuable assets that drive innovation, empower decision-making, and contribute to the overall success of the organization.
Self-service provisioning
Manual configuration of infrastructure components for each new data product is not a scalable approach. A successful data product implementation requires a self-service platform that enables automated creation and setup of all necessary components. This process should be fully automated, generating configurations for data storage, orchestration, security, access rights, data quality rules, data catalog registration, and integration with monitoring, logging, and alerting systems.
In the absence of a fully-fledged self-service provisioning platform, organizations can still embrace the concept of data contracts by publishing discoverable templates. These templates would leverage automation scripts, such as those based on Terraform, to provision the required infrastructure when connected to a CI/CD server. This approach provides a stepping stone towards full automation and ensures consistent, compliant deployment of data products.
Ensuring trust in data products
The trustworthiness of a data product is fundamentally linked to its data quality. To foster trust among data consumers, a data product must ensure a high level of data quality and document this level transparently. When data consumers search for data assets to fulfill their tasks, they should be able to discover the relevant asset along with comprehensive documentation. This documentation should outline the purpose of the data, how it was collected, its structure, access methods, publishing format, and, crucially, its data quality status.
To achieve this, a data quality platform integrated into the shared infrastructure continuously monitors the data assets within the data product. This platform evaluates all data quality checks defined within the data product’s configuration code. Assessing data quality involves two key perspectives:
- Current Data Quality Status: This perspective assesses the current state of the data asset. It determines whether the data stored within the data product is free from severe data quality issues at the present moment. A data asset is considered to have acceptable data quality if it passes all or most of the defined checks.
- Overall Trustworthiness and Stability: This perspective measures the long-term reliability and stability of the data asset. It is evaluated by calculating a data quality KPI score, which represents the percentage of passed data quality checks over a recent period, such as the current or last month. A high KPI score indicates a consistently trustworthy data asset, while a lower score may suggest potential issues that require attention.
By continuously monitoring and publishing data quality metrics, data product owners can establish transparency and build trust among data consumers. This, in turn, encourages wider adoption and utilization of the data product, ultimately driving more value for the organization.
Data quality best practices - a step-by-step guide to improve data quality
- Learn the best practices in starting and scaling data quality
- Learn how to find and manage data quality issues
What is the DQOps Data Quality Operations Center
DQOps is a data quality platform designed to monitor data and assess the data quality trust score with data quality KPIs. DQOps provides extensive support for configuring data quality checks, clustering issues into incidents, and managing the data quality incident workflow. DQOps is designed in a unique way, making it a perfect fit for monitoring data quality for data products. The competitive advantage of DQOps is that it is a dual interface model, allowing data quality configuration to be stored in YAML files that are perfect for automated provisioning. Once the data quality configuration files are published, DQOps automatically discovers them and provides an alternative, no-code user interface for data consumers, data stewards, and data operations teams.
The extensive API provided by DQOps allows full automation of all aspects of the platform, ranging from data discovery, data profiling, data quality testing, data observability, data quality incident management, and data quality reporting using 50+ data quality dashboards.
You can set up DQOps locally or in your on-premises environment to learn how DQOps can monitor data sources and ensure data quality within data products Follow the DQOps documentation, go through the DQOps getting started guide to learn how to set up DQOps locally, and try it.
You may also be interested in our free eBook, “A step-by-step guide to improve data quality.” The eBook documents our proven process for managing data quality issues and ensuring a high level of data quality over time.