Managing Schema Evolution in Spark SQL and dbt
When working with data, change is inevitable. Event producers adjust formats, add attributes, or modify data types as systems evolve. While this flexibility drives progress, it can quickly introduce challenges for analytical pipelines that rely on schema consistency. In these situations, manual schema evolution can provide a more transparent and reliable way to keep historical and newly ingested data aligned.
Automatic schema evolution and its limits
Modern table formats support automatic schema evolution, so teams can add or rename columns without rewriting entire datasets. At the metadata level, this is extremely powerful. However, automatic schema evolution primarily guarantees structural compatibility. It does not automatically ensure a consistent meaning across historical and newly ingested data
In real production systems, producers rarely introduce breaking changes in a single step. Instead, they often migrate gradually. A data type may change while legacy identifiers are still supported. New fields may be introduced alongside old ones. Multiple versions may coexist temporarily to preserve backward compatibility for downstream consumers.
In such scenarios, automatic schema evolution keeps the table readable, but it does not define how teams should interpret historical and new records together. When nested structures and type transitions are involved, manual schema alignment becomes a safer and more transparent approach.
This post examines why automatic schema evolution can be insufficient in practice and why manual schema alignment is sometimes necessary. We then show how to handle these changes explicitly in Spark SQL and manage them reproducibly in dbt using Jinja templates.
Why manual schema evolution is sometimes necessary
Automatic schema evolution works well with simple, flat tables. Even though Delta and Iceberg support nested schema evolution, some use cases remain hard to solve automatically.
If a new column is added at the top level, the table metadata can be updated safely, and historical rows simply contain NULL for the new column. Structural compatibility is preserved.
Nested structures introduce additional complexity. Changes may occur inside arrays or deeply nested structs. Type transitions may affect fields that are part of a larger composite structure. In such cases, schema evolution preserves the structural shape of the table, but it does not guarantee a consistent interpretation of historical and new records.
Consider the dataset from the previous post, user_movie_activity. In its original structure, the movies_watched array contains the following fields:
movie_id (INTEGER)
title (STRING)
duration_seconds (LONG)
completed (BOOLEAN) After an update to the event producer, two changes appear in the incoming data:
- The
movie_idfield changes fromINTEGERtoSTRINGbecause the producer begins using globally unique identifiers instead of numeric codes. - A new attribute
rating (FLOAT)is added to capture user feedback.
A gradual migration scenario
In practice, producers may introduce such changes gradually. For example, they may temporarily publish both the old and new identifier formats, or introduce overlapping fields to ease downstream transitions. Even if the final state is structurally compatible, intermediate states may contain multiple representations of the same concept.
With automatic schema evolution enabled, the table schema may merge without errors. However, inconsistencies can arise:
- Historical records still contain
INTEGERidentifiers. - New records contain
STRINGidentifiers. - Downstream joins may behave differently depending on implicit casts.
- Nested arrays may contain structurally valid but semantically inconsistent records.
The table remains readable, but the data no longer follows a single, clearly defined contract.
At this point, manual schema alignment effectively defines that contract. All historical and newly ingested data must satisfy a unified schema before being combined. When consistency across historical data is critical, explicitly evolving the schema and documenting each transformation step improves reliability and transparency.
Manual schema alignment with Spark SQL
The goal of manual schema evolution is to align old and new data before combining them. This ensures that all records, past and present, can be queried under a single, unified schema.
Assume the older data is in user_movie_activity_v1, and the updated data is in user_movie_activity_v2. The schemas can be aligned manually in Spark SQL as follows:
WITH movies_v1 AS (
SELECT
user_id,
session_id,
EXPLODE(movies_watched) AS mw
FROM user_movie_activity_v1
),
movies_v1_aligned AS (
SELECT
user_id,
session_id,
CAST(mw.movie_id AS STRING) AS movie_id,
mw.title,
mw.duration_seconds,
mw.completed,
NULL AS rating
FROM movies_v1
),
movies_v2 AS (
SELECT
user_id,
session_id,
EXPLODE(movies_watched) AS mw
FROM user_movie_activity_v2
),
movies_v2_aligned AS (
SELECT
user_id,
session_id,
mw.movie_id,
mw.title,
mw.duration_seconds,
mw.completed,
mw.rating
FROM movies_v2
)
SELECT * FROM movies_v1_aligned
UNION ALL
SELECT * FROM movies_v2_aligned; This approach preserves historical data while adding new attributes and aligning existing fields to a common data type. Explicit casting and placeholder columns, such as NULL AS rating, ensure that every record conforms to the same schema.
Using UNION ALL retains all records without aggregation or data loss. The result is not merely a structurally valid table, but a semantically aligned dataset with a clearly defined contract.
In practice, this pattern is often combined with partition pruning or incremental models to keep performance predictable as data volume grows.
Managing schema evolution in dbt
As data models evolve, maintaining manual schema alignment directly in SQL can become repetitive and error-prone. dbt, together with Jinja templates, provides a structured and scalable way to manage these changes.
In dbt, parameterised models can adapt automatically to schema differences across multiple data shapes. The same logic expressed with Jinja templating could look like this:
{%- set versions = ['v1', 'v2'] -%}
WITH
{%- for version in versions %}
movies_{{ version }} AS (
SELECT
user_id,
session_id,
EXPLODE(movies_watched) AS mw
FROM {{ ref('user_movie_activity_' ~ version) }}
),
movies_{{ version }}_aligned AS (
SELECT
user_id,
session_id,
CAST(mw.movie_id AS STRING) AS movie_id,
mw.title,
mw.duration_seconds,
mw.completed,
{%- if version == 'v1' -%}
NULL AS rating
{%- else -%}
mw.rating
{%- endif -%}
FROM movies_{{ version }}
)
{%- if not loop.last %},{% endif -%}
{%- endfor %}
{%- for version in versions %}
SELECT * FROM movies_{{ version }}_aligned
{%- if not loop.last %}
UNION ALL
{%- endif %}
{%- endfor %} This Jinja pattern dynamically generates SQL for each schema variant and applies the appropriate transformation logic. When a new version appears, extending the versions list and defining the required adjustments is sufficient.
This keeps models readable, reproducible, and easy to evolve over time, while preserving explicit control over how schema changes are interpreted.
Takeaways
- Automatic schema evolution works well for flat data and structural metadata changes.
- Nested structures and type transitions introduce challenges to semantic alignment.
- Manual schema alignment defines an explicit contract across historical and new data.
- Using
UNION ALLallows multiple versions to coexist under a unified schema. - dbt and Jinja make schema alignment scalable without sacrificing readability.
- Explicit schema management results in predictable, traceable, and easier-to-debug data pipelines.
What’s next
So far, we focused on aligning schemas and retaining historical data as structures evolve. The next step is ensuring that these contracts remain valid over time.
In the next post, we extend this idea of explicit schema contracts beyond structural alignment. We examine how to validate evolving schemas, test key assumptions, and enforce visibility guarantees in practice using dbt and Spark SQL. This includes catching unexpected changes early, safeguarding downstream consumers, and managing access constraints such as masking sensitive fields while preserving existing data structures.
