Schema Con­tracts in Prac­tice: Test­ing and Mas­king Nes­ted Data in Spark SQL and dbt 



Pre­vious posts in this series addres­sed two recur­ring chal­lenges when working with nes­ted data in Spark SQL. This post focu­ses on nes­ted data mas­king, show­ing how to pro­tect sen­si­tive fields inside STRUCTs while pre­ser­ving the over­all data shape. Ear­lier posts covered flat­tening STRUCT and ARRAY fields and expli­citly alig­ning old and new data using UNION ALL ins­tead of rely­ing solely on auto­ma­tic schema evolution.

Both topics focus pri­ma­rily on struc­tu­ral cor­rect­ness. Fields must exist, data types have to align, and his­to­ri­cal data must remain acces­si­ble over time. In prac­tice, schema con­tracts also define attri­bute visi­bi­lity. They spe­cify which con­su­mers are allo­wed to view which fields and under what con­di­ti­ons, while pre­ser­ving the over­all data structure. 

In this post, we extend the idea of schema con­tracts to include test­ing and nes­ted data mas­king, focu­sing on situa­tions where data flat­tening is not desirable.

The Mas­king Pro­blem with Nes­ted STRUCTs

Even when sche­mas are pro­perly ali­gned, down­stream expec­ta­ti­ons can still be vio­la­ted. Sen­si­tive attri­bu­tes may be expo­sed unin­ten­tio­nally, access rules can dif­fer bet­ween con­su­mers, or nes­ted fields may need to remain pre­sent even when their values must be hidden. 

A com­mon mas­king approach is to flat­ten the data, apply column-level mas­king, and optio­nally recon­s­truct the ori­gi­nal struc­ture. While this approach is simple, it intro­du­ces seve­ral trade-offs. The shape of the data chan­ges, schema ali­gnment beco­mes more com­plex, and the resul­ting SQL grows lar­ger and har­der to reason about. 

In some cases, mas­king must be applied while kee­ping the data nes­ted. In these sce­na­rios, the STRUCT shape must be pre­ser­ved exactly, only spe­ci­fic leaf fields should be mas­ked, and down­stream queries must con­ti­nue to work wit­hout modi­fi­ca­tion. Spark SQL sup­ports this pat­tern, but it requi­res expli­cit recon­s­truc­tion of the structure. 

Shape Pre­ser­ving Mas­king in Spark SQL 

Con­sider a nes­ted struc­ture in which ses­sion details include device infor­ma­tion, and brow­ser meta­data is deeply nes­ted within it. If browser.major must be mas­ked for all but pri­vi­le­ged users; the key requi­re­ment is that session_details reta­ins its full struc­ture regard­less of masking. 

This can be achie­ved by expli­citly rebuil­ding the STRUCT and app­ly­ing mas­king only at the leaf level: 

SELECT 

  user_id, 
  session_id, 
  named_struct( 
    'device_info', 
    named_struct( 
      'os', session_details.device_info.os, 
      'browser', 
      named_struct( 
        'name', session_details.device_info.browser.name, 
        'major', 
        CASE 
          WHEN is_member('analytics_full_access') 
          THEN session_details.device_info.browser.major 
          ELSE NULL 
        END 
      ) 
    ) 
  ) AS session_details 

FROM user_movie_activity; 

Each level of the struc­ture is deli­bera­tely recon­s­truc­ted, with mas­king applied only where requi­red. The resul­ting data­set pre­ser­ves its ori­gi­nal shape, main­ta­ins field pre­sence and data types, and enforces mas­king during query exe­cu­tion. For down­stream con­su­mers, the only dif­fe­rence is the visi­bi­lity of the mas­ked value. 

This approach is inten­tio­nally ver­bose. By embed­ding the mas­king beha­viour directly in the query, the con­tract remains expli­cit and easier to manage as sche­mas evolve. 

Mas­king Enforce­ment in Unity Cata­log and Hive Metastore 

It is important to distin­gu­ish bet­ween decla­ring a field as sen­si­tive and enfor­cing mas­king to pre­vent unaut­ho­ri­sed users from acces­sing the under­ly­ing value. While Spark SQL can express mas­king logic, the enforce­ment strength depends on the gover­nance layer. 

In Dat­ab­ricks envi­ron­ments, Unity Cata­log pro­vi­des cen­tra­li­sed gover­nance and sup­ports dyna­mic mas­king poli­cies at the cata­lo­gue level. Mas­king rules can be defi­ned once and applied con­sis­t­ently across note­books, SQL end­points, and down­stream tools, even when mul­ti­ple access paths exist. Simi­lar gover­nance capa­bi­li­ties exist in other plat­forms, but the prin­ci­ples remain the same. 

Nes­ted data com­pli­ca­tes enforce­ment even in cata­lo­gue-gover­ned envi­ron­ments. Mas­king is simp­lest when applied to columns at the outer­most level. When sen­si­tive attri­bu­tes are deeply nes­ted, it is often still neces­sary to expose gover­ned views that expli­citly recon­s­truct the struc­ture and mask leaf fields. As a result, shape-pre­ser­ving mas­king fre­quently remains an expli­cit SQL concern. 

Inside envi­ron­ments based on the Hive metastore, enforce­ment is typi­cally less cen­tra­li­sed. Access con­trol is usually limi­ted to data­ba­ses or tables, and dyna­mic field-level mas­king is not available. Mas­king is the­r­e­fore enforced by con­ven­tion. Raw tables are rest­ric­ted to tech­ni­cal users, and con­su­mers access only mas­ked views or models. In this con­text, expli­cit STRUCT recon­s­truc­tion in SQL beco­mes the pri­mary enforce­ment mecha­nism, pro­vi­ded that raw access is pro­perly restricted. 

Test­ing Mas­ked Nes­ted Fields with dbt 

Mas­king chan­ges the seman­tic gua­ran­tees of a data­set. A mas­ked field remains part of the schema and is inten­tio­nally nullable. For con­su­mers, mas­ked values are indis­tin­gu­is­ha­ble from legi­ti­mate nulls unless this beha­viour is tes­ted and enforced. For this reason, mas­king must be trea­ted as part of the schema con­tract and vali­da­ted continuously. 

dbt does not enforce mas­king directly, but it pro­vi­des a con­ve­ni­ent frame­work for coor­di­na­ting con­tract checks. The enforce­ment logic lives in Spark SQL, while dbt sup­plies struc­ture, repea­ta­bi­lity, and early feedback. 

A simple SQL-based test can verify that mas­ked fields are not visi­ble to users wit­hout the requi­red privileges: 

SELECT * 
FROM {{ ref('user_movie_activity_masked') }} 
WHERE 
  session_details.device_info.browser.major IS NOT NULL 
  AND NOT is_member('analytics_full_access'); 

If this query returns rows, the mas­king con­tract has been vio­la­ted. Addi­tio­nal tests can con­firm that expec­ted struc­tures remain pre­sent and that new nes­ted attri­bu­tes do not appear unex­pec­tedly. Tog­e­ther, these checks act as early war­ning signals and help block unin­ten­ded chan­ges from pro­pa­ga­ting downstream. 

Decla­ring Schema Con­tracts in dbt YAML 

In addi­tion to SQL-based tests, schema con­tracts can be declared expli­citly in dbt YAML files. This makes expec­ta­ti­ons visi­ble at the model boun­dary and pro­vi­des a clear refe­rence for review­ers and down­stream consumers. 

When contract.enforced: true is enab­led, dbt vali­da­tes the final out­put of a model against the declared schema. This applies equally to models that expose nes­ted STRUCT columns. Con­tracts can the­r­e­fore reco­g­nise issues such as miss­ing or ren­a­med columns, unex­pec­ted new columns, or incom­pa­ti­ble top-level type chan­ges, even when mas­king logic is present. 

What con­tracts verify is the shape of the mas­ked model out­put, not the raw data under­ly­ing it. Mas­king logic is applied inside the SQL model, and the con­tract eva­lua­tes the result of that logic. In this sense, mas­king beco­mes part of the con­tract implicitly. 

Howe­ver, con­tracts do not fully vali­date the inter­nal struc­ture of a STRUCT. They can­not assert that a spe­ci­fic nes­ted field exists, that it is mas­ked cor­rectly, or that con­di­tio­nal visi­bi­lity beha­ves as inten­ded. Those gua­ran­tees must still be imple­men­ted expli­citly in Spark SQL and veri­fied using tar­ge­ted tests. 

A Laye­red View of Con­tracts and Masking 

It is useful to think about schema con­tracts and mas­king as a set of com­ple­men­tary lay­ers rather than a sin­gle mechanism. 

At the lowest layer, SQL trans­for­ma­ti­ons define beha­viour. They recon­s­truct nes­ted struc­tures, apply mas­king rules, and pre­serve data shape. 

On top of that, dbt con­tracts lock down the published inter­face of a model. They ensure that expec­ted columns remain pre­sent, that the over­all schema does not drift unex­pec­tedly, and that mas­ked models con­ti­nue to expose a sta­ble boundary. 

Finally, dbt tests vali­date assump­ti­ons that can­not be expres­sed decla­ra­tively. They search for mas­king beha­viour, the pre­sence of nes­ted fields, and other seman­tic gua­ran­tees that mat­ter to down­stream consumers. 

Each layer addres­ses a dif­fe­rent class of fail­ure. Tog­e­ther, they pre­sent a prac­ti­cal and resi­li­ent approach to enfor­cing schema con­tracts in com­plex, nes­ted data systems. 

Inter­ac­tion with Schema Evolution 

Mas­king and schema evo­lu­tion are clo­sely lin­ked. When new nes­ted fields are intro­du­ced, they must be added to the struc­ture recon­s­truc­tion, eva­lua­ted for mas­king, ali­gned across schema ver­si­ons, and covered by tests. Trea­ting mas­ked attri­bu­tes like any other deve­lo­ping field, expli­citly defi­ned, typed, and vali­da­ted, helps keep the schema con­tract clear and mana­geable over time. 

Con­clu­sion

Across this series, we have loo­ked at three clo­sely rela­ted pro­blems when working with nes­ted data: flat­tening com­plex struc­tures for ana­ly­sis, expli­citly alig­ning evol­ving sche­mas, and enfor­cing visi­bi­lity gua­ran­tees while pre­ser­ving data shape. 

Tog­e­ther, these tech­ni­ques pro­vide a con­sis­tent approach to nes­ted data mas­king and schema con­tracts in evol­ving data sys­tems. Expli­cit SQL trans­for­ma­ti­ons make intent visi­ble, schema ali­gnment keeps his­to­ri­cal data usable, and shape-pre­ser­ving mas­king ensu­res sen­si­tive attri­bu­tes remain pro­tec­ted as sys­tems evolve. 

Hand­led deli­bera­tely, nes­ted data need not be fra­gile or opaque. With clear con­tracts and expli­cit trans­for­ma­ti­ons, it can remain both fle­xi­ble and relia­ble over time.