Schema Contracts in Practice: Testing and Masking Nested Data in Spark SQL and dbt

Previous posts in this series addressed two recurring challenges when working with nested data in Spark SQL. This post focuses on nested data masking, showing how to protect sensitive fields inside STRUCTs while preserving the overall data shape. Earlier posts covered flattening STRUCT and ARRAY fields and explicitly aligning old and new data using UNION ALL instead of relying solely on automatic schema evolution.

Both topics focus primarily on structural correctness. Fields must exist, data types have to align, and historical data must remain accessible over time. In practice, schema contracts also define attribute visibility. They specify which consumers are allowed to view which fields and under what conditions, while preserving the overall data structure.

In this post, we extend the idea of schema contracts to include testing and nested data masking, focusing on situations where data flattening is not desirable.

The Masking Problem with Nested STRUCTs

Even when schemas are properly aligned, downstream expectations can still be violated. Sensitive attributes may be exposed unintentionally, access rules can differ between consumers, or nested fields may need to remain present even when their values must be hidden.

A common masking approach is to flatten the data, apply column-level masking, and optionally reconstruct the original structure. While this approach is simple, it introduces several trade-offs. The shape of the data changes, schema alignment becomes more complex, and the resulting SQL grows larger and harder to reason about.

In some cases, masking must be applied while keeping the data nested. In these scenarios, the STRUCT shape must be preserved exactly, only specific leaf fields should be masked, and downstream queries must continue to work without modification. Spark SQL supports this pattern, but it requires explicit reconstruction of the structure.

Shape Preserving Masking in Spark SQL

Consider a nested structure in which session details include device information, and browser metadata is deeply nested within it. If browser.major must be masked for all but privileged users; the key requirement is that session_details retains its full structure regardless of masking.

This can be achieved by explicitly rebuilding the STRUCT and applying masking only at the leaf level:

SELECT 

  user_id, 
  session_id, 
  named_struct( 
    'device_info', 
    named_struct( 
      'os', session_details.device_info.os, 
      'browser', 
      named_struct( 
        'name', session_details.device_info.browser.name, 
        'major', 
        CASE 
          WHEN is_member('analytics_full_access') 
          THEN session_details.device_info.browser.major 
          ELSE NULL 
        END 
      ) 
    ) 
  ) AS session_details 

FROM user_movie_activity;

Each level of the structure is deliberately reconstructed, with masking applied only where required. The resulting dataset preserves its original shape, maintains field presence and data types, and enforces masking during query execution. For downstream consumers, the only difference is the visibility of the masked value.

This approach is intentionally verbose. By embedding the masking behaviour directly in the query, the contract remains explicit and easier to manage as schemas evolve.

Masking Enforcement in Unity Catalog and Hive Metastore

It is important to distinguish between declaring a field as sensitive and enforcing masking to prevent unauthorised users from accessing the underlying value. While Spark SQL can express masking logic, the enforcement strength depends on the governance layer.

In Databricks environments, Unity Catalog provides centralised governance and supports dynamic masking policies at the catalogue level. Masking rules can be defined once and applied consistently across notebooks, SQL endpoints, and downstream tools, even when multiple access paths exist. Similar governance capabilities exist in other platforms, but the principles remain the same.

Nested data complicates enforcement even in catalogue-governed environments. Masking is simplest when applied to columns at the outermost level. When sensitive attributes are deeply nested, it is often still necessary to expose governed views that explicitly reconstruct the structure and mask leaf fields. As a result, shape-preserving masking frequently remains an explicit SQL concern.

Inside environments based on the Hive metastore, enforcement is typically less centralised. Access control is usually limited to databases or tables, and dynamic field-level masking is not available. Masking is therefore enforced by convention. Raw tables are restricted to technical users, and consumers access only masked views or models. In this context, explicit STRUCT reconstruction in SQL becomes the primary enforcement mechanism, provided that raw access is properly restricted.

Testing Masked Nested Fields with dbt

Masking changes the semantic guarantees of a dataset. A masked field remains part of the schema and is intentionally nullable. For consumers, masked values are indistinguishable from legitimate nulls unless this behaviour is tested and enforced. For this reason, masking must be treated as part of the schema contract and validated continuously.

dbt does not enforce masking directly, but it provides a convenient framework for coordinating contract checks. The enforcement logic lives in Spark SQL, while dbt supplies structure, repeatability, and early feedback.

A simple SQL-based test can verify that masked fields are not visible to users without the required privileges:

SELECT * 
FROM {{ ref('user_movie_activity_masked') }} 
WHERE 
  session_details.device_info.browser.major IS NOT NULL 
  AND NOT is_member('analytics_full_access');

If this query returns rows, the masking contract has been violated. Additional tests can confirm that expected structures remain present and that new nested attributes do not appear unexpectedly. Together, these checks act as early warning signals and help block unintended changes from propagating downstream.

Declaring Schema Contracts in dbt YAML

In addition to SQL-based tests, schema contracts can be declared explicitly in dbt YAML files. This makes expectations visible at the model boundary and provides a clear reference for reviewers and downstream consumers.

When contract.enforced: true is enabled, dbt validates the final output of a model against the declared schema. This applies equally to models that expose nested STRUCT columns. Contracts can therefore recognise issues such as missing or renamed columns, unexpected new columns, or incompatible top-level type changes, even when masking logic is present.

What contracts verify is the shape of the masked model output, not the raw data underlying it. Masking logic is applied inside the SQL model, and the contract evaluates the result of that logic. In this sense, masking becomes part of the contract implicitly.

However, contracts do not fully validate the internal structure of a STRUCT. They cannot assert that a specific nested field exists, that it is masked correctly, or that conditional visibility behaves as intended. Those guarantees must still be implemented explicitly in Spark SQL and verified using targeted tests.

A Layered View of Contracts and Masking

It is useful to think about schema contracts and masking as a set of complementary layers rather than a single mechanism.

At the lowest layer, SQL transformations define behaviour. They reconstruct nested structures, apply masking rules, and preserve data shape.

On top of that, dbt contracts lock down the published interface of a model. They ensure that expected columns remain present, that the overall schema does not drift unexpectedly, and that masked models continue to expose a stable boundary.

Finally, dbt tests validate assumptions that cannot be expressed declaratively. They search for masking behaviour, the presence of nested fields, and other semantic guarantees that matter to downstream consumers.

Each layer addresses a different class of failure. Together, they present a practical and resilient approach to enforcing schema contracts in complex, nested data systems.

Interaction with Schema Evolution

Masking and schema evolution are closely linked. When new nested fields are introduced, they must be added to the structure reconstruction, evaluated for masking, aligned across schema versions, and covered by tests. Treating masked attributes like any other developing field, explicitly defined, typed, and validated, helps keep the schema contract clear and manageable over time.

Conclusion

Across this series, we have looked at three closely related problems when working with nested data: flattening complex structures for analysis, explicitly aligning evolving schemas, and enforcing visibility guarantees while preserving data shape.

Together, these techniques provide a consistent approach to nested data masking and schema contracts in evolving data systems. Explicit SQL transformations make intent visible, schema alignment keeps historical data usable, and shape-preserving masking ensures sensitive attributes remain protected as systems evolve.

Handled deliberately, nested data need not be fragile or opaque. With clear contracts and explicit transformations, it can remain both flexible and reliable over time.