Schema Con­tracts in Prac­tice: Test­ing and Mask­ing Nes­ted Data in Spark SQL and dbt 



Pre­vi­ous posts in this series addressed two recur­ring chal­lenges when work­ing with nes­ted data in Spark SQL. This post focuses on nes­ted data mask­ing, show­ing how to pro­tect sens­it­ive fields inside STRUCTs while pre­serving the over­all data shape. Earlier posts covered flat­ten­ing STRUCT and ARRAY fields and expli­citly align­ing old and new data using UNION ALL instead of rely­ing solely on auto­matic schema evolution.

Both top­ics focus primar­ily on struc­tural cor­rect­ness. Fields must exist, data types have to align, and his­tor­ical data must remain access­ible over time. In prac­tice, schema con­tracts also define attrib­ute vis­ib­il­ity. They spe­cify which con­sumers are allowed to view which fields and under what con­di­tions, while pre­serving the over­all data structure. 

In this post, we extend the idea of schema con­tracts to include test­ing and nes­ted data mask­ing, focus­ing on situ­ations where data flat­ten­ing is not desirable.

The Mask­ing Prob­lem with Nes­ted STRUCTs

Even when schemas are prop­erly aligned, down­stream expect­a­tions can still be viol­ated. Sens­it­ive attrib­utes may be exposed unin­ten­tion­ally, access rules can dif­fer between con­sumers, or nes­ted fields may need to remain present even when their val­ues must be hidden. 

A com­mon mask­ing approach is to flat­ten the data, apply column-level mask­ing, and option­ally recon­struct the ori­ginal struc­ture. While this approach is simple, it intro­duces sev­eral trade-offs. The shape of the data changes, schema align­ment becomes more com­plex, and the res­ult­ing SQL grows lar­ger and harder to reason about. 

In some cases, mask­ing must be applied while keep­ing the data nes­ted. In these scen­arios, the STRUCT shape must be pre­served exactly, only spe­cific leaf fields should be masked, and down­stream quer­ies must con­tinue to work without modi­fic­a­tion. Spark SQL sup­ports this pat­tern, but it requires expli­cit recon­struc­tion of the structure. 

Shape Pre­serving Mask­ing in Spark SQL 

Con­sider a nes­ted struc­ture in which ses­sion details include device inform­a­tion, and browser metadata is deeply nes­ted within it. If browser.major must be masked for all but priv­ileged users; the key require­ment is that session_details retains its full struc­ture regard­less of masking. 

This can be achieved by expli­citly rebuild­ing the STRUCT and apply­ing mask­ing only at the leaf level: 

SELECT 

  user_id, 
  session_id, 
  named_struct( 
    'device_info', 
    named_struct( 
      'os', session_details.device_info.os, 
      'browser', 
      named_struct( 
        'name', session_details.device_info.browser.name, 
        'major', 
        CASE 
          WHEN is_member('analytics_full_access') 
          THEN session_details.device_info.browser.major 
          ELSE NULL 
        END 
      ) 
    ) 
  ) AS session_details 

FROM user_movie_activity; 

Each level of the struc­ture is delib­er­ately recon­struc­ted, with mask­ing applied only where required. The res­ult­ing data­set pre­serves its ori­ginal shape, main­tains field pres­ence and data types, and enforces mask­ing dur­ing query exe­cu­tion. For down­stream con­sumers, the only dif­fer­ence is the vis­ib­il­ity of the masked value. 

This approach is inten­tion­ally verb­ose. By embed­ding the mask­ing beha­viour dir­ectly in the query, the con­tract remains expli­cit and easier to man­age as schemas evolve. 

Mask­ing Enforce­ment in Unity Cata­log and Hive Metastore 

It is import­ant to dis­tin­guish between declar­ing a field as sens­it­ive and enfor­cing mask­ing to pre­vent unau­thor­ised users from access­ing the under­ly­ing value. While Spark SQL can express mask­ing logic, the enforce­ment strength depends on the gov­ernance layer. 

In Dat­ab­ricks envir­on­ments, Unity Cata­log provides cent­ral­ised gov­ernance and sup­ports dynamic mask­ing policies at the cata­logue level. Mask­ing rules can be defined once and applied con­sist­ently across note­books, SQL end­points, and down­stream tools, even when mul­tiple access paths exist. Sim­ilar gov­ernance cap­ab­il­it­ies exist in other plat­forms, but the prin­ciples remain the same. 

Nes­ted data com­plic­ates enforce­ment even in cata­logue-gov­erned envir­on­ments. Mask­ing is simplest when applied to columns at the out­er­most level. When sens­it­ive attrib­utes are deeply nes­ted, it is often still neces­sary to expose gov­erned views that expli­citly recon­struct the struc­ture and mask leaf fields. As a res­ult, shape-pre­serving mask­ing fre­quently remains an expli­cit SQL concern. 

Inside envir­on­ments based on the Hive metastore, enforce­ment is typ­ic­ally less cent­ral­ised. Access con­trol is usu­ally lim­ited to data­bases or tables, and dynamic field-level mask­ing is not avail­able. Mask­ing is there­fore enforced by con­ven­tion. Raw tables are restric­ted to tech­nical users, and con­sumers access only masked views or mod­els. In this con­text, expli­cit STRUCT recon­struc­tion in SQL becomes the primary enforce­ment mech­an­ism, provided that raw access is prop­erly restricted. 

Test­ing Masked Nes­ted Fields with dbt 

Mask­ing changes the semantic guar­an­tees of a data­set. A masked field remains part of the schema and is inten­tion­ally nul­lable. For con­sumers, masked val­ues are indis­tin­guish­able from legit­im­ate nulls unless this beha­viour is tested and enforced. For this reason, mask­ing must be treated as part of the schema con­tract and val­id­ated continuously. 

dbt does not enforce mask­ing dir­ectly, but it provides a con­veni­ent frame­work for coordin­at­ing con­tract checks. The enforce­ment logic lives in Spark SQL, while dbt sup­plies struc­ture, repeat­ab­il­ity, and early feedback. 

A simple SQL-based test can verify that masked fields are not vis­ible to users without the required privileges: 

SELECT * 
FROM {{ ref('user_movie_activity_masked') }} 
WHERE 
  session_details.device_info.browser.major IS NOT NULL 
  AND NOT is_member('analytics_full_access'); 

If this query returns rows, the mask­ing con­tract has been viol­ated. Addi­tional tests can con­firm that expec­ted struc­tures remain present and that new nes­ted attrib­utes do not appear unex­pec­tedly. Together, these checks act as early warn­ing sig­nals and help block unin­ten­ded changes from propagat­ing downstream. 

Declar­ing Schema Con­tracts in dbt YAML 

In addi­tion to SQL-based tests, schema con­tracts can be declared expli­citly in dbt YAML files. This makes expect­a­tions vis­ible at the model bound­ary and provides a clear ref­er­ence for review­ers and down­stream consumers. 

When contract.enforced: true is enabled, dbt val­id­ates the final out­put of a model against the declared schema. This applies equally to mod­els that expose nes­ted STRUCT columns. Con­tracts can there­fore recog­nise issues such as miss­ing or renamed columns, unex­pec­ted new columns, or incom­pat­ible top-level type changes, even when mask­ing logic is present. 

What con­tracts verify is the shape of the masked model out­put, not the raw data under­ly­ing it. Mask­ing logic is applied inside the SQL model, and the con­tract eval­u­ates the res­ult of that logic. In this sense, mask­ing becomes part of the con­tract implicitly. 

How­ever, con­tracts do not fully val­id­ate the internal struc­ture of a STRUCT. They can­not assert that a spe­cific nes­ted field exists, that it is masked cor­rectly, or that con­di­tional vis­ib­il­ity behaves as inten­ded. Those guar­an­tees must still be imple­men­ted expli­citly in Spark SQL and veri­fied using tar­geted tests. 

A Layered View of Con­tracts and Masking 

It is use­ful to think about schema con­tracts and mask­ing as a set of com­ple­ment­ary lay­ers rather than a single mechanism. 

At the low­est layer, SQL trans­form­a­tions define beha­viour. They recon­struct nes­ted struc­tures, apply mask­ing rules, and pre­serve data shape. 

On top of that, dbt con­tracts lock down the pub­lished inter­face of a model. They ensure that expec­ted columns remain present, that the over­all schema does not drift unex­pec­tedly, and that masked mod­els con­tinue to expose a stable boundary. 

Finally, dbt tests val­id­ate assump­tions that can­not be expressed declar­at­ively. They search for mask­ing beha­viour, the pres­ence of nes­ted fields, and other semantic guar­an­tees that mat­ter to down­stream consumers. 

Each layer addresses a dif­fer­ent class of fail­ure. Together, they present a prac­tical and resi­li­ent approach to enfor­cing schema con­tracts in com­plex, nes­ted data systems. 

Inter­ac­tion with Schema Evolution 

Mask­ing and schema evol­u­tion are closely linked. When new nes­ted fields are intro­duced, they must be added to the struc­ture recon­struc­tion, eval­u­ated for mask­ing, aligned across schema ver­sions, and covered by tests. Treat­ing masked attrib­utes like any other devel­op­ing field, expli­citly defined, typed, and val­id­ated, helps keep the schema con­tract clear and man­age­able over time. 

Con­clu­sion

Across this series, we have looked at three closely related prob­lems when work­ing with nes­ted data: flat­ten­ing com­plex struc­tures for ana­lysis, expli­citly align­ing evolving schemas, and enfor­cing vis­ib­il­ity guar­an­tees while pre­serving data shape. 

Together, these tech­niques provide a con­sist­ent approach to nes­ted data mask­ing and schema con­tracts in evolving data sys­tems. Expli­cit SQL trans­form­a­tions make intent vis­ible, schema align­ment keeps his­tor­ical data usable, and shape-pre­serving mask­ing ensures sens­it­ive attrib­utes remain pro­tec­ted as sys­tems evolve. 

Handled delib­er­ately, nes­ted data need not be fra­gile or opaque. With clear con­tracts and expli­cit trans­form­a­tions, it can remain both flex­ible and reli­able over time.