Work­ing with Nes­ted Data in Spark SQL: From Dot-Walk­ing to Exploding



As a data engin­eer, data analyst, or Spark SQL prac­ti­tioner, you have prob­ably encountered nes­ted data in Spark SQL in the form of objects nes­ted within objects or lists of objects. This design keeps related inform­a­tion together and makes inges­tion flex­ible; how­ever, query­ing it is another matter. 

A single record may con­tain sev­eral lay­ers of structs and arrays, and what appears eleg­ant in stor­age can quickly turn into lengthy SQL paths, unread­able quer­ies, and dif­fi­cult-to-debug errors. 
Simple ques­tions such as “How many movies did each user watch per ses­sion?” or “Which devices are most com­mon among act­ive users?” can sud­denly require mul­tiple joins, ali­asing, or even nes­ted quer­ies. Although nes­ted data is power­ful, it requires clear strategies for flat­ten­ing and nam­ing to ensure that data pipelines remain maintainable.

This post explains how to work with nes­ted data in Spark SQL using the STRUCT and ARRAY con­structs, and con­cludes with a real-world example that com­bines both tech­niques. These meth­ods will help you flat­ten com­plex data, avoid subtle join errors, and cre­ate clean, query­able tables that are scal­able for ana­lyt­ics while remain­ing easy to understand. 

The user_movie_activity Dataset

All examples use an exem­plary table called user_movie_activity , which rep­res­ents stream­ing ses­sion data con­tain­ing both device inform­a­tion and view­ing behaviour.

FieldTypeDescrip­tion
user_idSTRINGUnique user identifier
session_idSTRINGSes­sion identifier
sessions_details.device_infoSTRUCTDevice and browser information
movies_watchedARRAYList of movies watched dur­ing a session
movies_watched.viewing_segmentsARRAY<STRUCT<ARRAY»Nes­ted view­ing seg­ments within each movie

This table illus­trates how nes­ted data com­monly appears in ana­lyt­ical sys­tems: structs that hold struc­tured attrib­utes and arrays that cap­ture repeated actions.

Full Schema

user_movie_activity 
user_id STRING 
session_id STRING 
session_details STRUCT< 
  device_info STRUCT< 
    os STRUCT< 
      name STRING, 
      version STRING 
    >, 
    browser STRUCT< 
      name STRING, 
      major STRING 
    > 
  > 
> 
movies_watched ARRAY< 
  STRUCT< 
    movie_id INT, 
    title STRING, 
    duration_seconds LONG, 
    completed BOOLEAN, 
    viewing_segments ARRAY< 
      STRUCT< 
        segment_index INT, 
        start_second INT, 
        end_second INT, 
        label STRING 
      > 
    > 
  > 
> 

The fol­low­ing sec­tions will ref­er­ence indi­vidual paths inside this structure. 

STRUCT < STRUCT >: session_details.device_info

A STRUCT groups mul­tiple related fields into one com­pos­ite object. For example, oper­at­ing sys­tem and browser details nat­ur­ally belong together in a single struct. For ana­lysis or report­ing, how­ever, each attrib­ute usu­ally needs to become its own column with a clear name that will not con­flict later in joins or uni­ons. To access indi­vidual attrib­utes inside a struct, use dot nota­tion to spe­cify the full path, such as session_details.device_info.os.name.

When query­ing nes­ted struc­tures in SQL, it is help­ful to use a Com­mon Table Expres­sion (CTE) to make the pro­cess easier to read and main­tain. A CTE is a named sub­query intro­duced with the WITH clause that keeps each trans­form­a­tion step expli­cit and reusable.

WITH device_flat AS (

SELECT
   user_id,
   session_id,
   session_details.device_info.os.name AS os_name,
   session_details.device_info.os.version AS os_version,
   session_details.device_info.browser.name AS browser_name,
   session_details.device_info.browser.major AS browser_major
FROM user_movie_activity

)

SELECT * FROM device_flat;

The CTE device_flat flat­tens the inner struc­ture and assigns con­tex­tual ali­ases ( os_name, browser_name) that remain unam­bigu­ous through­out the pipeline. CTEs make quer­ies self-doc­u­ment­ing, as each step has a clear pur­pose and can be reused or exten­ded without requir­ing the rewrit­ing of nes­ted subqueries.

Why nam­ing con­ven­tions matter

In the query above, two dif­fer­ent attrib­utes have the same name, name. If both were flattened without renam­ing, they would both appear as name, and Spark SQL or a BI tool would treat them as identical columns. Later oper­a­tions, such as JOIN or UNION could silently over­write one column with the other, pro­du­cing incor­rect or ambigu­ous res­ults.
To pre­vent this:

  • Give columns con­tex­tual ali­ases, such as device_name and movie_name.
  • Fol­low a con­sist­ent nam­ing con­ven­tion across models.
  • Avoid gen­eric names like id , name , or value unless the scope is clear.

Ensur­ing attrib­ute names remain con­sist­ent and descript­ive through­out the pipeline pre­vents ambi­gu­ity and main­tains clar­ity in down­stream queries.

Takeaways

  • Use dot-walk­ing to access fields inside structs.
  • Flat­ten early to sim­plify joins and avoid ambigu­ous references.
  • Use CTEs to make com­plex trans­form­a­tions trans­par­ent and maintainable.
  • Con­sist­ent nam­ing pre­vents silent column col­li­sions and keeps your schema self-doc­u­ment­ing, ensur­ing clar­ity and ease of use.

ARRAY < STRUCT >: movies_watched

Next, let’s explore how to work with arrays, the other major nes­ted data type in Spark SQL. In this data­set, the movies_watched array con­tains mul­tiple movie structs within a single ses­sion. Each struct includes fields such as movie_id, title, and duration_seconds. Arrays are effi­cient for stor­age, but not ideal for ana­lysis, because ana­lysts typ­ic­ally require one row per array element.

To achieve this, you must explode the array, which repeats the par­ent record for every ele­ment in the list. EXPLODE takes an array and cre­ates a new row for each ele­ment of that array. If one record con­tains three movies in the movies_watched array, EXPLODE pro­duces three sep­ar­ate rows, one for each movie, while repeat­ing all other columns (such as user_id and session_id). This trans­form­a­tion turns repeated data within a single record into a flat, tab­u­lar format that is easier to query and aggregate.

WITH movies_flat AS (

SELECT
   user_id,
   session_id,
   EXPLODE(movies_watched) AS mw
FROM user_movie_activity

),

movies_parsed AS (

SELECT
   user_id,
   session_id,
   mw.movie_id,
   mw.title,
   mw.duration_seconds,
   mw.completed
FROM movies_flat

)

SELECT * FROM movies_parsed;

Using two small CTEs keeps the trans­form­a­tion clear. The first expands the array, and the second extracts its fields.

Takeaways

  • Use EXPLODE to con­vert arrays into rows, pro­du­cing one row per element.
  • Keep iden­ti­fier fields ( user_id, session_id) in every step to main­tain context.
  • Use EXPLODE_OUTER if arrays can be empty, but you still want to retain par­ent rows.
  • After explod­ing, select only the neces­sary fields to main­tain a stable and read­able schema.

Com­bin­ing STRUCTs and ARRAYs: movies_watched.viewing_segments

In real data­sets, structs and arrays often appear together. In this example, each movie has its own array of view­ing seg­ments, rep­res­ent­ing how a user inter­ac­ted with it, such as when they paused, skipped, or resumed play­back. Flat­ten­ing this com­plex struc­ture for down­stream ana­lysis requires com­bin­ing both dot-walk­ing and explod­ing, one level at a time.

WITH movies AS (

SELECT
   user_id,
   session_id,
   EXPLODE(movies_watched) AS mw
FROM user_movie_activity

),

segments_flat AS (

SELECT
   user_id,
   session_id,
   mw.movie_id,
   EXPLODE(mw.viewing_segments) AS seg -- combining explode and dot-walking
FROM movies

),

segments_parsed AS (

SELECT
   user_id,
   session_id,
   movie_id,
   seg.segment_index,
   seg.start_second,
   seg.end_second,
   seg.label
FROM segments_flat

)

SELECT * FROM segments_parsed;

This example illus­trates real-world data that requires mul­tiple levels of nest­ing to be flattened systematically

Takeaways

  • Com­bine tech­niques by explod­ing the outer array first and the inner one second.
  • Keep hier­arch­ical iden­ti­fi­ers ( user_id, session_id, movie_id) at each step.
  • Avoid stack­ing mul­tiple EXPLODE func­tions in a single SELECT state­ment, as this can obscure logic and unin­ten­tion­ally mul­tiply rows.

From Nes­ted to Readable

Hand­ling nes­ted data is an essen­tial skill in Spark SQL. By mas­ter­ing the two com­mon struc­tures, STRUCT<STRUCT> for dot-walk­ing and ARRAY<STRUCT> for explod­ing, you can con­fid­ently handle even com­plex, multi-level com­bin­a­tions. Con­sist­ent nam­ing, early flat­ten­ing, and clear CTEs make quer­ies more read­able, reli­able, and main­tain­able regard­less of data complexity.

What’s Next

So far, the schema of the data has remained con­stant. In the next part of this series, we will explore schema evol­u­tion, which refers to man­aging struc­tural changes in nes­ted fields
when new attrib­utes are added, removed, or mod­i­fied. We will also explain why built-in schema evol­u­tion in Ice­berg or Delta might not always be suf­fi­cient and show how to manu­ally
align chan­ging schemas while keep­ing all his­tor­ical data using UNION ALL .

This post is Part 1 of a three-part series on work­ing with nes­ted data in Spark SQL. 
Part 2 cov­ers schema evol­u­tion and manual align­ment. 
Part 3 focuses on test­ing and mask­ing nes­ted data while pre­serving structure