Man­aging Schema Evol­u­tion in Spark SQL and dbt 



When work­ing with data, change is inev­it­able. Event pro­du­cers adjust formats, add attrib­utes, or modify data types as sys­tems evolve. While this flex­ib­il­ity drives pro­gress, it can quickly intro­duce chal­lenges for ana­lyt­ical pipelines that rely on schema con­sist­ency. In these situ­ations, manual schema evol­u­tion can provide a more trans­par­ent and reli­able way to keep his­tor­ical and newly inges­ted data aligned.

Auto­matic schema evol­u­tion and its limits

Mod­ern table formats sup­port auto­matic schema evol­u­tion, so teams can add or rename columns without rewrit­ing entire data­sets. At the metadata level, this is extremely power­ful. How­ever, auto­matic schema evol­u­tion primar­ily guar­an­tees struc­tural com­pat­ib­il­ity. It does not auto­mat­ic­ally ensure a con­sist­ent mean­ing across his­tor­ical and newly inges­ted data

In real pro­duc­tion sys­tems, pro­du­cers rarely intro­duce break­ing changes in a single step. Instead, they often migrate gradu­ally. A data type may change while leg­acy iden­ti­fi­ers are still sup­por­ted. New fields may be intro­duced along­side old ones. Mul­tiple ver­sions may coex­ist tem­por­ar­ily to pre­serve back­ward com­pat­ib­il­ity for down­stream consumers. 

In such scen­arios, auto­matic schema evol­u­tion keeps the table read­able, but it does not define how teams should inter­pret his­tor­ical and new records together. When nes­ted struc­tures and type trans­itions are involved, manual schema align­ment becomes a safer and more trans­par­ent approach. 

This post exam­ines why auto­matic schema evol­u­tion can be insuf­fi­cient in prac­tice and why manual schema align­ment is some­times neces­sary. We then show how to handle these changes expli­citly in Spark SQL and man­age them repro­du­cibly in dbt using Jinja templates.

Why manual schema evol­u­tion is some­times necessary 

Auto­matic schema evol­u­tion works well with simple, flat tables. Even though Delta and Ice­berg sup­port nes­ted schema evol­u­tion, some use cases remain hard to solve automatically. 

If a new column is added at the top level, the table metadata can be updated safely, and his­tor­ical rows simply con­tain NULL for the new column. Struc­tural com­pat­ib­il­ity is preserved. 

Nes­ted struc­tures intro­duce addi­tional com­plex­ity. Changes may occur inside arrays or deeply nes­ted structs. Type trans­itions may affect fields that are part of a lar­ger com­pos­ite struc­ture. In such cases, schema evol­u­tion pre­serves the struc­tural shape of the table, but it does not guar­an­tee a con­sist­ent inter­pret­a­tion of his­tor­ical and new records. 

Con­sider the data­set from the pre­vi­ous post, user_movie_activity. In its ori­ginal struc­ture, the movies_watched array con­tains the fol­low­ing fields: 

movie_id (INTEGER) 

title (STRING) 

duration_seconds (LONG) 

completed (BOOLEAN) 

After an update to the event pro­du­cer, two changes appear in the incom­ing data: 

  1. The movie_id field changes from INTEGER to STRING because the pro­du­cer begins using glob­ally unique iden­ti­fi­ers instead of numeric codes. 
  2. A new attrib­ute rating (FLOAT) is added to cap­ture user feedback. 

A gradual migra­tion scenario

In prac­tice, pro­du­cers may intro­duce such changes gradu­ally. For example, they may tem­por­ar­ily pub­lish both the old and new iden­ti­fier formats, or intro­duce over­lap­ping fields to ease down­stream trans­itions. Even if the final state is struc­tur­ally com­pat­ible, inter­me­di­ate states may con­tain mul­tiple rep­res­ent­a­tions of the same concept. 

With auto­matic schema evol­u­tion enabled, the table schema may merge without errors. How­ever, incon­sist­en­cies can arise: 

  • His­tor­ical records still con­tain INTEGER identifiers. 
  • New records con­tain STRING identifiers. 
  • Down­stream joins may behave dif­fer­ently depend­ing on impli­cit casts. 
  • Nes­ted arrays may con­tain struc­tur­ally valid but semantic­ally incon­sist­ent records. 

The table remains read­able, but the data no longer fol­lows a single, clearly defined contract. 

At this point, manual schema align­ment effect­ively defines that con­tract. All his­tor­ical and newly inges­ted data must sat­isfy a uni­fied schema before being com­bined. When con­sist­ency across his­tor­ical data is crit­ical, expli­citly evolving the schema and doc­u­ment­ing each trans­form­a­tion step improves reli­ab­il­ity and transparency. 

Manual schema align­ment with Spark SQL 

The goal of manual schema evol­u­tion is to align old and new data before com­bin­ing them. This ensures that all records, past and present, can be quer­ied under a single, uni­fied schema. 

Assume the older data is in user_movie_activity_v1, and the updated data is in user_movie_activity_v2. The schemas can be aligned manu­ally in Spark SQL as follows: 

WITH movies_v1 AS ( 

  SELECT 
    user_id, 
    session_id, 
    EXPLODE(movies_watched) AS mw 
  FROM user_movie_activity_v1 

), 

movies_v1_aligned AS ( 

  SELECT 
    user_id, 
    session_id, 
    CAST(mw.movie_id AS STRING) AS movie_id, 
    mw.title, 
    mw.duration_seconds, 
    mw.completed, 
    NULL AS rating 
  FROM movies_v1 

), 

movies_v2 AS ( 

  SELECT 
    user_id, 
    session_id, 
    EXPLODE(movies_watched) AS mw 
  FROM user_movie_activity_v2 

), 

movies_v2_aligned AS ( 

  SELECT 
    user_id, 
    session_id, 
    mw.movie_id, 
    mw.title, 
    mw.duration_seconds, 
    mw.completed, 
    mw.rating 
  FROM movies_v2 

) 

SELECT * FROM movies_v1_aligned 
UNION ALL 
SELECT * FROM movies_v2_aligned; 

This approach pre­serves his­tor­ical data while adding new attrib­utes and align­ing exist­ing fields to a com­mon data type. Expli­cit cast­ing and place­holder columns, such as NULL AS rating, ensure that every record con­forms to the same schema. 

Using UNION ALL retains all records without aggreg­a­tion or data loss. The res­ult is not merely a struc­tur­ally valid table, but a semantic­ally aligned data­set with a clearly defined contract. 

In prac­tice, this pat­tern is often com­bined with par­ti­tion prun­ing or incre­mental mod­els to keep per­form­ance pre­dict­able as data volume grows. 

Man­aging schema evol­u­tion in dbt 

As data mod­els evolve, main­tain­ing manual schema align­ment dir­ectly in SQL can become repet­it­ive and error-prone. dbt, together with Jinja tem­plates, provides a struc­tured and scal­able way to man­age these changes. 

In dbt, para­met­erised mod­els can adapt auto­mat­ic­ally to schema dif­fer­ences across mul­tiple data shapes. The same logic expressed with Jinja tem­plat­ing could look like this: 

{%- set versions = ['v1', 'v2'] -%} 

WITH 
{%- for version in versions %} 

movies_{{ version }} AS ( 

  SELECT 
    user_id, 
    session_id, 
    EXPLODE(movies_watched) AS mw 
  FROM {{ ref('user_movie_activity_' ~ version) }} 

), 

movies_{{ version }}_aligned AS ( 

  SELECT 
    user_id, 
    session_id, 
    CAST(mw.movie_id AS STRING) AS movie_id, 
    mw.title, 
    mw.duration_seconds, 
    mw.completed, 
    {%- if version == 'v1' -%} 
      NULL AS rating 
    {%- else -%} 
      mw.rating 
    {%- endif -%} 
  FROM movies_{{ version }} 

) 

{%- if not loop.last %},{% endif -%}
 
{%- endfor %} 

{%- for version in versions %} 

SELECT * FROM movies_{{ version }}_aligned 
{%- if not loop.last %} 
UNION ALL 
{%- endif %} 

{%- endfor %} 

This Jinja pat­tern dynam­ic­ally gen­er­ates SQL for each schema vari­ant and applies the appro­pri­ate trans­form­a­tion logic. When a new ver­sion appears, extend­ing the versions list and defin­ing the required adjust­ments is sufficient. 

This keeps mod­els read­able, repro­du­cible, and easy to evolve over time, while pre­serving expli­cit con­trol over how schema changes are interpreted. 

Takeaways

  • Auto­matic schema evol­u­tion works well for flat data and struc­tural metadata changes. 
  • Nes­ted struc­tures and type trans­itions intro­duce chal­lenges to semantic alignment. 
  • Manual schema align­ment defines an expli­cit con­tract across his­tor­ical and new data. 
  • Using UNION ALL allows mul­tiple ver­sions to coex­ist under a uni­fied schema. 
  • dbt and Jinja make schema align­ment scal­able without sac­ri­fi­cing readability. 
  • Expli­cit schema man­age­ment res­ults in pre­dict­able, trace­able, and easier-to-debug data pipelines. 

What’s next

So far, we focused on align­ing schemas and retain­ing his­tor­ical data as struc­tures evolve. The next step is ensur­ing that these con­tracts remain valid over time. 

In the next post, we extend this idea of expli­cit schema con­tracts bey­ond struc­tural align­ment. We exam­ine how to val­id­ate evolving schemas, test key assump­tions, and enforce vis­ib­il­ity guar­an­tees in prac­tice using dbt and Spark SQL. This includes catch­ing unex­pec­ted changes early, safe­guard­ing down­stream con­sumers, and man­aging access con­straints such as mask­ing sens­it­ive fields while pre­serving exist­ing data structures.