It’s true that table schemas are not static; they evolve over time. This evolution is often driven by changing business requirements or the need to accommodate new information not initially anticipated in the original design. However, it’s important to remember that schema changes can impact any data format, not just relational tables.
For instance, schema changes frequently occur in semi-structured data like JSON and CSV files. In these cases, the structure relies on an implicit agreement between the data publisher (the exporter) and the data consumer (typically a data team responsible for loading the data). Consequently, seemingly minor alterations, such as removing or reordering columns, or modifying data types, can disrupt this agreement and lead to incompatibilities that prevent data ingestion into a data platform.
Table of Contents
What is a Schema Change
Schema changes, also called schema drifts, occur when the structure of a database table or data file is modified. These modifications can be categorized in various ways, each with a different potential impact on data quality. In some cases, schema changes can even lead to data loss. Here are some common types:
- Column order changed: Altering the order of columns in a table might seem trivial, but it can break SQL queries that rely on a specific column order (e.g., those using SELECT *). It can also cause problems when loading CSV files that lack a header row, as the data will be mapped to the wrong columns.
- Column size changes: If the size of a text column is decreased, existing data might be truncated during transformation, leading to data loss. Increasing the size is generally less problematic but can impact storage and query performance.
- Column type changed: Changing the data type of a column (e.g., from an integer to a date) can lead to data conversion errors if existing values cannot be converted to the new type. Careful consideration is needed to ensure compatibility.
- Column nullability changes: Modifying whether a column allows null (missing) values has implications for data integrity. If a column becomes nullable, the system must be able to handle missing data. Conversely, if a column becomes non-nullable, it can cause errors if the data source contains null values.
- New column added: Adding a new column requires adjustments to data pipelines, including ETL (Extract, Transform, Load) processes, to incorporate and potentially process the new data. Until then, the new data might be ignored.
- Column removed: Removing a column can cause transformations and other data operations to fail if they attempt to access the now non-existent column.
- Table removed: This is the most disruptive type of schema change, as it invalidates all code and processes that depend on the table, leading to widespread failures and potential data loss.
Why Schema Drifts
Schema drifts are inevitable in data management. This is largely because data platforms must constantly adapt to evolving business requirements. For example, if a business process changes, requiring the collection of additional customer information like an office phone number, the application storing this data must be updated. This triggers a chain reaction: data pipelines need modification to ingest the new data, and schemas in data lakes and data warehouses must also change. This creates a ripple effect that impacts the entire data lineage.
But it’s not just internal changes that cause schema drifts. Organizations often rely on third-party software, like SaaS solutions, which undergo their own evolution. Vendors regularly release updates with new features, often accompanied by schema changes. While these changes are usually documented in release notes, this information may not always reach the data engineering or data analytics teams. This communication gap can lead to unexpected schema drifts and integration challenges.
How Schema Changes Affect Data Reliability
Unexpected schema changes in data sources can significantly impact data teams responsible for data ingestion and processing. Even seemingly minor changes require careful review to ensure they don’t lead to data quality issues, such as data loss or data corruption.
For instance, a simple column rename in a source table can cause data loss if the data ingestion pipeline is not updated to reflect this change. Similarly, modifications to data types can create incompatibilities, making it impossible to load new data into existing tables. Such issues can even cause data pipelines to crash, disrupting the ingestion of data from other sources. Therefore, continuous monitoring and proactive review of schema changes are crucial for maintaining data reliability and preventing disruptions to data pipelines.
Furthermore, the repercussions of schema changes extend beyond data ingestion and transformation. When tables used by data analysts and data scientists undergo schema modifications without prior notification, it can disrupt their workflows. Dashboards may fail to render correctly, and machine learning training pipelines can encounter errors, hindering analysis and model development.
How to Monitor Schema Changes
Effectively monitoring schema changes requires proactive measures. Data observability tools are invaluable in this regard. These tools maintain continuous connections to data sources, enabling them to read metadata, including the list of tables, columns, and their data types. By comparing the current schema with the last known schema, these tools can automatically detect any discrepancies.
Upon detecting a schema change, data observability tools typically trigger a data quality incident and initiate a notification workflow. This ensures that the data owner or data engineering team is promptly informed and can assess the severity of the change. Moreover, many data observability tools offer data lineage tracking capabilities. This means they understand the relationships between source and target tables across the entire data transformation pipeline. Consequently, these platforms can identify the potential impact radius of a schema change, helping teams prioritize and address the most critical issues.
Data quality best practices - a step-by-step guide to improve data quality
- Learn the best practices in starting and scaling data quality
- Learn how to find and manage data quality issues
Best Practices for Monitoring Schema Changes
Effectively managing schema changes requires a proactive approach. Here are some key best practices:
- Utilize data observability tools: Employ data observability tools to continuously monitor schema drifts across all source and target tables within your data environment.
- Establish clear ownership: Define roles and responsibilities to ensure a clear notification and escalation path for schema change alerts. This helps ensure timely responses to critical changes.
- Investigate unexpected changes: Always investigate unexpected schema changes, as they may be indicative of underlying data issues or errors that require immediate attention.
What is the DQOps Data Quality Operations Center
DQOps is a data quality and observability platform designed to monitor data sources and detect possible data quality issues. DQOps provides extensive support for monitoring schema drifts, configuring data quality checks, applying configuration by data quality policies, detecting anomalies, and managing the data quality incident workflow.
DQOps is a platform that combines the functionality of a data quality platform to perform the data quality assessment of data assets. It is also a complete data observability platform that can monitor data and measure data quality metrics at table level to measure its health scores with data quality KPIs.
You can set up DQOps locally or in your on-premises environment to learn how DQOps can monitor data sources and ensure data quality within a data platform. Follow the DQOps documentation, go through the DQOps getting started guide to learn how to set up DQOps locally, and try it.
You may also be interested in our free eBook, “A step-by-step guide to improve data quality.” The eBook documents our proven process for managing data quality issues and ensuring a high level of data quality over time. This is a great resource to learn about data quality.