Working with Nes­ted Data in Spark SQL: From Dot-Wal­king to Exploding



As a data engi­neer, data analyst, or Spark SQL prac­ti­tio­ner, you have pro­ba­bly encoun­te­red nes­ted data in Spark SQL in the form of objects nes­ted within objects or lists of objects. This design keeps rela­ted infor­ma­tion tog­e­ther and makes inges­tion fle­xi­ble; howe­ver, query­ing it is ano­ther matter. 

A sin­gle record may con­tain seve­ral lay­ers of structs and arrays, and what appears ele­gant in sto­rage can quickly turn into leng­thy SQL paths, unre­a­da­ble queries, and dif­fi­cult-to-debug errors. 
Simple ques­ti­ons such as “How many movies did each user watch per ses­sion?” or “Which devices are most com­mon among active users?” can sud­denly require mul­ti­ple joins, ali­a­sing, or even nes­ted queries. Alt­hough nes­ted data is powerful, it requi­res clear stra­te­gies for flat­tening and naming to ensure that data pipe­lines remain maintainable.

This post explains how to work with nes­ted data in Spark SQL using the STRUCT and ARRAY con­s­tructs, and con­cludes with a real-world exam­ple that com­bi­nes both tech­ni­ques. These methods will help you flat­ten com­plex data, avoid subtle join errors, and create clean, querya­ble tables that are sca­lable for ana­ly­tics while remai­ning easy to understand. 

The user_movie_activity Dataset

All examp­les use an exem­plary table cal­led user_movie_activity , which repres­ents strea­ming ses­sion data con­tai­ning both device infor­ma­tion and vie­w­ing behaviour.

FieldTypeDescrip­tion
user_idSTRINGUni­que user identifier
session_idSTRINGSes­sion identifier
sessions_details.device_infoSTRUCTDevice and brow­ser information
movies_watchedARRAYList of movies wat­ched during a session
movies_watched.viewing_segmentsARRAY<STRUCT<ARRAY»Nes­ted vie­w­ing seg­ments within each movie

This table illus­tra­tes how nes­ted data com­monly appears in ana­ly­ti­cal sys­tems: structs that hold struc­tu­red attri­bu­tes and arrays that cap­ture repea­ted actions.

Full Schema

user_movie_activity 
user_id STRING 
session_id STRING 
session_details STRUCT< 
  device_info STRUCT< 
    os STRUCT< 
      name STRING, 
      version STRING 
    >, 
    browser STRUCT< 
      name STRING, 
      major STRING 
    > 
  > 
> 
movies_watched ARRAY< 
  STRUCT< 
    movie_id INT, 
    title STRING, 
    duration_seconds LONG, 
    completed BOOLEAN, 
    viewing_segments ARRAY< 
      STRUCT< 
        segment_index INT, 
        start_second INT, 
        end_second INT, 
        label STRING 
      > 
    > 
  > 
> 

The fol­lo­wing sec­tions will refe­rence indi­vi­dual paths inside this structure. 

STRUCT < STRUCT >: session_details.device_info

A STRUCT groups mul­ti­ple rela­ted fields into one com­po­site object. For exam­ple, ope­ra­ting sys­tem and brow­ser details natu­rally belong tog­e­ther in a sin­gle struct. For ana­ly­sis or report­ing, howe­ver, each attri­bute usually needs to become its own column with a clear name that will not con­flict later in joins or uni­ons. To access indi­vi­dual attri­bu­tes inside a struct, use dot nota­tion to spe­cify the full path, such as session_details.device_info.os.name.

When query­ing nes­ted struc­tures in SQL, it is hel­pful to use a Com­mon Table Expres­sion (CTE) to make the pro­cess easier to read and main­tain. A CTE is a named sub­query intro­du­ced with the WITH clause that keeps each trans­for­ma­tion step expli­cit and reusable.

WITH device_flat AS (

SELECT
   user_id,
   session_id,
   session_details.device_info.os.name AS os_name,
   session_details.device_info.os.version AS os_version,
   session_details.device_info.browser.name AS browser_name,
   session_details.device_info.browser.major AS browser_major
FROM user_movie_activity

)

SELECT * FROM device_flat;

The CTE device_flat flat­tens the inner struc­ture and assigns con­tex­tual ali­a­ses ( os_name, browser_name) that remain unam­bi­guous throug­hout the pipe­line. CTEs make queries self-docu­men­ting, as each step has a clear pur­pose and can be reu­sed or exten­ded wit­hout requi­ring the rewri­ting of nes­ted subqueries.

Why naming con­ven­ti­ons matter

In the query above, two dif­fe­rent attri­bu­tes have the same name, name. If both were flat­tened wit­hout ren­aming, they would both appear as name, and Spark SQL or a BI tool would treat them as iden­ti­cal columns. Later ope­ra­ti­ons, such as JOIN or UNION could sil­ently over­write one column with the other, pro­du­cing incor­rect or ambi­guous results.
To pre­vent this:

  • Give columns con­tex­tual ali­a­ses, such as device_name and movie_name.
  • Fol­low a con­sis­tent naming con­ven­tion across models.
  • Avoid gene­ric names like id , name , or value unless the scope is clear.

Ensu­ring attri­bute names remain con­sis­tent and descrip­tive throug­hout the pipe­line pre­vents ambi­guity and main­ta­ins cla­rity in down­stream queries.

Takea­ways

  • Use dot-wal­king to access fields inside structs.
  • Flat­ten early to sim­plify joins and avoid ambi­guous references.
  • Use CTEs to make com­plex trans­for­ma­ti­ons trans­pa­rent and maintainable.
  • Con­sis­tent naming pre­vents silent column col­li­si­ons and keeps your schema self-docu­men­ting, ensu­ring cla­rity and ease of use.

ARRAY < STRUCT >: movies_watched

Next, let’s explore how to work with arrays, the other major nes­ted data type in Spark SQL. In this data­set, the movies_watched array con­ta­ins mul­ti­ple movie structs within a sin­gle ses­sion. Each struct includes fields such as movie_id, title, and duration_seconds. Arrays are effi­ci­ent for sto­rage, but not ideal for ana­ly­sis, because ana­lysts typi­cally require one row per array element.

To achieve this, you must explode the array, which repeats the parent record for every ele­ment in the list. EXPLODE takes an array and crea­tes a new row for each ele­ment of that array. If one record con­ta­ins three movies in the movies_watched array, EXPLODE pro­du­ces three sepa­rate rows, one for each movie, while repea­ting all other columns (such as user_id and session_id). This trans­for­ma­tion turns repea­ted data within a sin­gle record into a flat, tabu­lar for­mat that is easier to query and aggregate.

WITH movies_flat AS (

SELECT
   user_id,
   session_id,
   EXPLODE(movies_watched) AS mw
FROM user_movie_activity

),

movies_parsed AS (

SELECT
   user_id,
   session_id,
   mw.movie_id,
   mw.title,
   mw.duration_seconds,
   mw.completed
FROM movies_flat

)

SELECT * FROM movies_parsed;

Using two small CTEs keeps the trans­for­ma­tion clear. The first expands the array, and the second extra­cts its fields.

Takea­ways

  • Use EXPLODE to con­vert arrays into rows, pro­du­cing one row per element.
  • Keep iden­ti­fier fields ( user_id, session_id) in every step to main­tain context.
  • Use EXPLODE_OUTER if arrays can be empty, but you still want to retain parent rows.
  • After explo­ding, sel­ect only the neces­sary fields to main­tain a sta­ble and rea­da­ble schema.

Com­bi­ning STRUCTs and ARRAYs: movies_watched.viewing_segments

In real data­sets, structs and arrays often appear tog­e­ther. In this exam­ple, each movie has its own array of vie­w­ing seg­ments, repre­sen­ting how a user inter­ac­ted with it, such as when they pau­sed, skip­ped, or resu­med play­back. Flat­tening this com­plex struc­ture for down­stream ana­ly­sis requi­res com­bi­ning both dot-wal­king and explo­ding, one level at a time.

WITH movies AS (

SELECT
   user_id,
   session_id,
   EXPLODE(movies_watched) AS mw
FROM user_movie_activity

),

segments_flat AS (

SELECT
   user_id,
   session_id,
   mw.movie_id,
   EXPLODE(mw.viewing_segments) AS seg -- combining explode and dot-walking
FROM movies

),

segments_parsed AS (

SELECT
   user_id,
   session_id,
   movie_id,
   seg.segment_index,
   seg.start_second,
   seg.end_second,
   seg.label
FROM segments_flat

)

SELECT * FROM segments_parsed;

This exam­ple illus­tra­tes real-world data that requi­res mul­ti­ple levels of nes­t­ing to be flat­tened systematically

Takea­ways

  • Com­bine tech­ni­ques by explo­ding the outer array first and the inner one second.
  • Keep hier­ar­chi­cal iden­ti­fiers ( user_id, session_id, movie_id) at each step.
  • Avoid stack­ing mul­ti­ple EXPLODE func­tions in a sin­gle SELECT state­ment, as this can obscure logic and unin­ten­tio­nally mul­ti­ply rows.

From Nes­ted to Readable

Hand­ling nes­ted data is an essen­tial skill in Spark SQL. By mas­te­ring the two com­mon struc­tures, STRUCT<STRUCT> for dot-wal­king and ARRAY<STRUCT> for explo­ding, you can con­fi­dently handle even com­plex, multi-level com­bi­na­ti­ons. Con­sis­tent naming, early flat­tening, and clear CTEs make queries more rea­da­ble, relia­ble, and main­tainable regard­less of data complexity.

What’s Next

So far, the schema of the data has remained con­stant. In the next part of this series, we will explore schema evo­lu­tion, which refers to mana­ging struc­tu­ral chan­ges in nes­ted fields
when new attri­bu­tes are added, remo­ved, or modi­fied. We will also explain why built-in schema evo­lu­tion in Ice­berg or Delta might not always be suf­fi­ci­ent and show how to manu­ally
align chan­ging sche­mas while kee­ping all his­to­ri­cal data using UNION ALL .

This post is Part 1 of a three-part series on working with nes­ted data in Spark SQL. 
Part 2 covers schema evo­lu­tion and manual ali­gnment. 
Part 3 focu­ses on test­ing and mas­king nes­ted data while pre­ser­ving structure