Working with Nested Data in Spark SQL: From Dot-Walking to Exploding
As a data engineer, data analyst, or Spark SQL practitioner, you have probably encountered nested data in Spark SQL in the form of objects nested within objects or lists of objects. This design keeps related information together and makes ingestion flexible; however, querying it is another matter.
A single record may contain several layers of structs and arrays, and what appears elegant in storage can quickly turn into lengthy SQL paths, unreadable queries, and difficult-to-debug errors.
Simple questions such as “How many movies did each user watch per session?” or “Which devices are most common among active users?” can suddenly require multiple joins, aliasing, or even nested queries. Although nested data is powerful, it requires clear strategies for flattening and naming to ensure that data pipelines remain maintainable.
This post explains how to work with nested data in Spark SQL using the STRUCT and ARRAY constructs, and concludes with a real-world example that combines both techniques. These methods will help you flatten complex data, avoid subtle join errors, and create clean, queryable tables that are scalable for analytics while remaining easy to understand.
The user_movie_activity Dataset
All examples use an exemplary table called user_movie_activity , which represents streaming session data containing both device information and viewing behaviour.
| Field | Type | Description |
| user_id | STRING | Unique user identifier |
| session_id | STRING | Session identifier |
| sessions_details.device_info | STRUCT | Device and browser information |
| movies_watched | ARRAY | List of movies watched during a session |
| movies_watched.viewing_segments | ARRAY<STRUCT<ARRAY» | Nested viewing segments within each movie |
This table illustrates how nested data commonly appears in analytical systems: structs that hold structured attributes and arrays that capture repeated actions.
Full Schema
user_movie_activity
user_id STRING
session_id STRING
session_details STRUCT<
device_info STRUCT<
os STRUCT<
name STRING,
version STRING
>,
browser STRUCT<
name STRING,
major STRING
>
>
>
movies_watched ARRAY<
STRUCT<
movie_id INT,
title STRING,
duration_seconds LONG,
completed BOOLEAN,
viewing_segments ARRAY<
STRUCT<
segment_index INT,
start_second INT,
end_second INT,
label STRING
>
>
>
> The following sections will reference individual paths inside this structure.
STRUCT < STRUCT >: session_details.device_info
A STRUCT groups multiple related fields into one composite object. For example, operating system and browser details naturally belong together in a single struct. For analysis or reporting, however, each attribute usually needs to become its own column with a clear name that will not conflict later in joins or unions. To access individual attributes inside a struct, use dot notation to specify the full path, such as session_details.device_info.os.name.
When querying nested structures in SQL, it is helpful to use a Common Table Expression (CTE) to make the process easier to read and maintain. A CTE is a named subquery introduced with the WITH clause that keeps each transformation step explicit and reusable.
WITH device_flat AS (
SELECT
user_id,
session_id,
session_details.device_info.os.name AS os_name,
session_details.device_info.os.version AS os_version,
session_details.device_info.browser.name AS browser_name,
session_details.device_info.browser.major AS browser_major
FROM user_movie_activity
)
SELECT * FROM device_flat;
The CTE device_flat flattens the inner structure and assigns contextual aliases ( os_name, browser_name) that remain unambiguous throughout the pipeline. CTEs make queries self-documenting, as each step has a clear purpose and can be reused or extended without requiring the rewriting of nested subqueries.
Why naming conventions matter
In the query above, two different attributes have the same name, name. If both were flattened without renaming, they would both appear as name, and Spark SQL or a BI tool would treat them as identical columns. Later operations, such as JOIN or UNION could silently overwrite one column with the other, producing incorrect or ambiguous results.
To prevent this:
- Give columns contextual aliases, such as
device_nameandmovie_name. - Follow a consistent naming convention across models.
- Avoid generic names like id , name , or value unless the scope is clear.
Ensuring attribute names remain consistent and descriptive throughout the pipeline prevents ambiguity and maintains clarity in downstream queries.
Takeaways
- Use dot-walking to access fields inside structs.
- Flatten early to simplify joins and avoid ambiguous references.
- Use CTEs to make complex transformations transparent and maintainable.
- Consistent naming prevents silent column collisions and keeps your schema self-documenting, ensuring clarity and ease of use.
ARRAY < STRUCT >: movies_watched
Next, let’s explore how to work with arrays, the other major nested data type in Spark SQL. In this dataset, the movies_watched array contains multiple movie structs within a single session. Each struct includes fields such as movie_id, title, and duration_seconds. Arrays are efficient for storage, but not ideal for analysis, because analysts typically require one row per array element.
To achieve this, you must explode the array, which repeats the parent record for every element in the list. EXPLODE takes an array and creates a new row for each element of that array. If one record contains three movies in the movies_watched array, EXPLODE produces three separate rows, one for each movie, while repeating all other columns (such as user_id and session_id). This transformation turns repeated data within a single record into a flat, tabular format that is easier to query and aggregate.
WITH movies_flat AS (
SELECT
user_id,
session_id,
EXPLODE(movies_watched) AS mw
FROM user_movie_activity
),
movies_parsed AS (
SELECT
user_id,
session_id,
mw.movie_id,
mw.title,
mw.duration_seconds,
mw.completed
FROM movies_flat
)
SELECT * FROM movies_parsed;Using two small CTEs keeps the transformation clear. The first expands the array, and the second extracts its fields.
Takeaways
- Use
EXPLODEto convert arrays into rows, producing one row per element. - Keep identifier fields (
user_id,session_id) in every step to maintain context. - Use
EXPLODE_OUTERif arrays can be empty, but you still want to retain parent rows. - After exploding, select only the necessary fields to maintain a stable and readable schema.
Combining STRUCTs and ARRAYs: movies_watched.viewing_segments
In real datasets, structs and arrays often appear together. In this example, each movie has its own array of viewing segments, representing how a user interacted with it, such as when they paused, skipped, or resumed playback. Flattening this complex structure for downstream analysis requires combining both dot-walking and exploding, one level at a time.
WITH movies AS (
SELECT
user_id,
session_id,
EXPLODE(movies_watched) AS mw
FROM user_movie_activity
),
segments_flat AS (
SELECT
user_id,
session_id,
mw.movie_id,
EXPLODE(mw.viewing_segments) AS seg -- combining explode and dot-walking
FROM movies
),
segments_parsed AS (
SELECT
user_id,
session_id,
movie_id,
seg.segment_index,
seg.start_second,
seg.end_second,
seg.label
FROM segments_flat
)
SELECT * FROM segments_parsed;This example illustrates real-world data that requires multiple levels of nesting to be flattened systematically
Takeaways
- Combine techniques by exploding the outer array first and the inner one second.
- Keep hierarchical identifiers (
user_id,session_id,movie_id) at each step. - Avoid stacking multiple
EXPLODEfunctions in a singleSELECTstatement, as this can obscure logic and unintentionally multiply rows.
From Nested to Readable
Handling nested data is an essential skill in Spark SQL. By mastering the two common structures, STRUCT<STRUCT> for dot-walking and ARRAY<STRUCT> for exploding, you can confidently handle even complex, multi-level combinations. Consistent naming, early flattening, and clear CTEs make queries more readable, reliable, and maintainable regardless of data complexity.
What’s Next
So far, the schema of the data has remained constant. In the next part of this series, we will explore schema evolution, which refers to managing structural changes in nested fields
when new attributes are added, removed, or modified. We will also explain why built-in schema evolution in Iceberg or Delta might not always be sufficient and show how to manually
align changing schemas while keeping all historical data using UNION ALL .
This post is Part 1 of a three-part series on working with nested data in Spark SQL.
Part 2 covers schema evolution and manual alignment.
Part 3 focuses on testing and masking nested data while preserving structure
