Mana­ging Schema Evo­lu­tion in Spark SQL and dbt 



When working with data, change is ine­vi­ta­ble. Event pro­du­cers adjust for­mats, add attri­bu­tes, or modify data types as sys­tems evolve. While this fle­xi­bi­lity dri­ves pro­gress, it can quickly intro­duce chal­lenges for ana­ly­ti­cal pipe­lines that rely on schema con­sis­tency. In these situa­tions, manual schema evo­lu­tion can pro­vide a more trans­pa­rent and relia­ble way to keep his­to­ri­cal and newly inge­sted data aligned.

Auto­ma­tic schema evo­lu­tion and its limits

Modern table for­mats sup­port auto­ma­tic schema evo­lu­tion, so teams can add or rename columns wit­hout rewri­ting entire data­sets. At the meta­data level, this is extre­mely powerful. Howe­ver, auto­ma­tic schema evo­lu­tion pri­ma­rily gua­ran­tees struc­tu­ral com­pa­ti­bi­lity. It does not auto­ma­ti­cally ensure a con­sis­tent mea­ning across his­to­ri­cal and newly inge­sted data

In real pro­duc­tion sys­tems, pro­du­cers rarely intro­duce brea­king chan­ges in a sin­gle step. Ins­tead, they often migrate gra­du­ally. A data type may change while legacy iden­ti­fiers are still sup­ported. New fields may be intro­du­ced along­side old ones. Mul­ti­ple ver­si­ons may coexist tem­po­r­a­rily to pre­serve back­ward com­pa­ti­bi­lity for down­stream consumers. 

In such sce­na­rios, auto­ma­tic schema evo­lu­tion keeps the table rea­da­ble, but it does not define how teams should inter­pret his­to­ri­cal and new records tog­e­ther. When nes­ted struc­tures and type tran­si­ti­ons are invol­ved, manual schema ali­gnment beco­mes a safer and more trans­pa­rent approach. 

This post exami­nes why auto­ma­tic schema evo­lu­tion can be insuf­fi­ci­ent in prac­tice and why manual schema ali­gnment is some­ti­mes neces­sary. We then show how to handle these chan­ges expli­citly in Spark SQL and manage them repro­du­ci­bly in dbt using Jinja templates.

Why manual schema evo­lu­tion is some­ti­mes necessary 

Auto­ma­tic schema evo­lu­tion works well with simple, flat tables. Even though Delta and Ice­berg sup­port nes­ted schema evo­lu­tion, some use cases remain hard to solve automatically. 

If a new column is added at the top level, the table meta­data can be updated safely, and his­to­ri­cal rows sim­ply con­tain NULL for the new column. Struc­tu­ral com­pa­ti­bi­lity is preserved. 

Nes­ted struc­tures intro­duce addi­tio­nal com­ple­xity. Chan­ges may occur inside arrays or deeply nes­ted structs. Type tran­si­ti­ons may affect fields that are part of a lar­ger com­po­site struc­ture. In such cases, schema evo­lu­tion pre­ser­ves the struc­tu­ral shape of the table, but it does not gua­ran­tee a con­sis­tent inter­pre­ta­tion of his­to­ri­cal and new records. 

Con­sider the data­set from the pre­vious post, user_movie_activity. In its ori­gi­nal struc­ture, the movies_watched array con­ta­ins the fol­lo­wing fields: 

movie_id (INTEGER) 

title (STRING) 

duration_seconds (LONG) 

completed (BOOLEAN) 

After an update to the event pro­du­cer, two chan­ges appear in the inco­ming data: 

  1. The movie_id field chan­ges from INTEGER to STRING because the pro­du­cer beg­ins using glo­bally uni­que iden­ti­fiers ins­tead of nume­ric codes. 
  2. A new attri­bute rating (FLOAT) is added to cap­ture user feedback. 

A gra­dual migra­tion scenario

In prac­tice, pro­du­cers may intro­duce such chan­ges gra­du­ally. For exam­ple, they may tem­po­r­a­rily publish both the old and new iden­ti­fier for­mats, or intro­duce over­lap­ping fields to ease down­stream tran­si­ti­ons. Even if the final state is struc­tu­rally com­pa­ti­ble, inter­me­diate sta­tes may con­tain mul­ti­ple repre­sen­ta­ti­ons of the same concept. 

With auto­ma­tic schema evo­lu­tion enab­led, the table schema may merge wit­hout errors. Howe­ver, incon­sis­ten­cies can arise: 

  • His­to­ri­cal records still con­tain INTEGER identifiers. 
  • New records con­tain STRING identifiers. 
  • Down­stream joins may behave dif­fer­ently depen­ding on impli­cit casts. 
  • Nes­ted arrays may con­tain struc­tu­rally valid but seman­ti­cally incon­sis­tent records. 

The table remains rea­da­ble, but the data no lon­ger fol­lows a sin­gle, cle­arly defi­ned contract. 

At this point, manual schema ali­gnment effec­tively defi­nes that con­tract. All his­to­ri­cal and newly inge­sted data must satisfy a uni­fied schema before being com­bi­ned. When con­sis­tency across his­to­ri­cal data is cri­ti­cal, expli­citly evol­ving the schema and docu­men­ting each trans­for­ma­tion step impro­ves relia­bi­lity and transparency. 

Manual schema ali­gnment with Spark SQL 

The goal of manual schema evo­lu­tion is to align old and new data before com­bi­ning them. This ensu­res that all records, past and pre­sent, can be queried under a sin­gle, uni­fied schema. 

Assume the older data is in user_movie_activity_v1, and the updated data is in user_movie_activity_v2. The sche­mas can be ali­gned manu­ally in Spark SQL as follows: 

WITH movies_v1 AS ( 

  SELECT 
    user_id, 
    session_id, 
    EXPLODE(movies_watched) AS mw 
  FROM user_movie_activity_v1 

), 

movies_v1_aligned AS ( 

  SELECT 
    user_id, 
    session_id, 
    CAST(mw.movie_id AS STRING) AS movie_id, 
    mw.title, 
    mw.duration_seconds, 
    mw.completed, 
    NULL AS rating 
  FROM movies_v1 

), 

movies_v2 AS ( 

  SELECT 
    user_id, 
    session_id, 
    EXPLODE(movies_watched) AS mw 
  FROM user_movie_activity_v2 

), 

movies_v2_aligned AS ( 

  SELECT 
    user_id, 
    session_id, 
    mw.movie_id, 
    mw.title, 
    mw.duration_seconds, 
    mw.completed, 
    mw.rating 
  FROM movies_v2 

) 

SELECT * FROM movies_v1_aligned 
UNION ALL 
SELECT * FROM movies_v2_aligned; 

This approach pre­ser­ves his­to­ri­cal data while adding new attri­bu­tes and alig­ning exis­ting fields to a com­mon data type. Expli­cit cas­ting and pla­ce­hol­der columns, such as NULL AS rating, ensure that every record con­forms to the same schema. 

Using UNION ALL reta­ins all records wit­hout aggre­ga­tion or data loss. The result is not merely a struc­tu­rally valid table, but a seman­ti­cally ali­gned data­set with a cle­arly defi­ned contract. 

In prac­tice, this pat­tern is often com­bi­ned with par­ti­tion pru­ning or incre­men­tal models to keep per­for­mance pre­dic­ta­ble as data volume grows. 

Mana­ging schema evo­lu­tion in dbt 

As data models evolve, main­tai­ning manual schema ali­gnment directly in SQL can become repe­ti­tive and error-prone. dbt, tog­e­ther with Jinja tem­pla­tes, pro­vi­des a struc­tu­red and sca­lable way to manage these changes. 

In dbt, para­me­ter­i­sed models can adapt auto­ma­ti­cally to schema dif­fe­ren­ces across mul­ti­ple data shapes. The same logic expres­sed with Jinja tem­pla­ting could look like this: 

{%- set versions = ['v1', 'v2'] -%} 

WITH 
{%- for version in versions %} 

movies_{{ version }} AS ( 

  SELECT 
    user_id, 
    session_id, 
    EXPLODE(movies_watched) AS mw 
  FROM {{ ref('user_movie_activity_' ~ version) }} 

), 

movies_{{ version }}_aligned AS ( 

  SELECT 
    user_id, 
    session_id, 
    CAST(mw.movie_id AS STRING) AS movie_id, 
    mw.title, 
    mw.duration_seconds, 
    mw.completed, 
    {%- if version == 'v1' -%} 
      NULL AS rating 
    {%- else -%} 
      mw.rating 
    {%- endif -%} 
  FROM movies_{{ version }} 

) 

{%- if not loop.last %},{% endif -%}
 
{%- endfor %} 

{%- for version in versions %} 

SELECT * FROM movies_{{ version }}_aligned 
{%- if not loop.last %} 
UNION ALL 
{%- endif %} 

{%- endfor %} 

This Jinja pat­tern dyna­mi­cally gene­ra­tes SQL for each schema vari­ant and applies the appro­priate trans­for­ma­tion logic. When a new ver­sion appears, exten­ding the versions list and defi­ning the requi­red adjus­t­ments is sufficient. 

This keeps models rea­da­ble, repro­du­ci­ble, and easy to evolve over time, while pre­ser­ving expli­cit con­trol over how schema chan­ges are interpreted. 

Takea­ways

  • Auto­ma­tic schema evo­lu­tion works well for flat data and struc­tu­ral meta­data changes. 
  • Nes­ted struc­tures and type tran­si­ti­ons intro­duce chal­lenges to seman­tic alignment. 
  • Manual schema ali­gnment defi­nes an expli­cit con­tract across his­to­ri­cal and new data. 
  • Using UNION ALL allows mul­ti­ple ver­si­ons to coexist under a uni­fied schema. 
  • dbt and Jinja make schema ali­gnment sca­lable wit­hout sacri­fi­cing readability. 
  • Expli­cit schema manage­ment results in pre­dic­ta­ble, traceable, and easier-to-debug data pipelines. 

What’s next

So far, we focu­sed on alig­ning sche­mas and retai­ning his­to­ri­cal data as struc­tures evolve. The next step is ensu­ring that these con­tracts remain valid over time. 

In the next post, we extend this idea of expli­cit schema con­tracts bey­ond struc­tu­ral ali­gnment. We examine how to vali­date evol­ving sche­mas, test key assump­ti­ons, and enforce visi­bi­lity gua­ran­tees in prac­tice using dbt and Spark SQL. This includes cat­ching unex­pec­ted chan­ges early, safe­guar­ding down­stream con­su­mers, and mana­ging access cons­traints such as mas­king sen­si­tive fields while pre­ser­ving exis­ting data structures.