Optim­iz­ing Dat­ab­ricks Per­form­ance: Liquid Clus­ter­ing and Dele­tion Vec­tors in Practice



Intro­duc­tion

Teams run­ning Dat­ab­ricks in pro­duc­tion face the clas­sic tri­angle: per­form­ance, cost, and oper­a­tional effort. Tra­di­tional tun­ing approaches like par­ti­tion­ing strategies, peri­odic OPTIMIZE, ZORDER BY, cluster siz­ing, and shuffle tun­ing still work – but they require con­tinu­ous rework as data volumes grow and query pat­terns change.

Dat­ab­ricks keeps push­ing Delta Lake per­form­ance for­ward. While Photon has been around for years (it became the default for newly cre­ated Dat­ab­ricks SQL end­points back in 2021), two fea­tures are par­tic­u­larly rel­ev­ant for most mod­ern lake­house projects:

Liquid Clus­ter­ing is a mod­ern data lay­out optim­iz­a­tion that replaces manual par­ti­tion­ing and ZORDER, allow­ing your lay­out to evolve without full rewrites.

Dele­tion Vec­tors (DV) provide “soft-delete” metadata that helps avoid rewrit­ing entire Par­quet files for row-level changes and powers faster updates on Photon-enabled compute.

What Prob­lems do Liquid Clus­ter­ing and Dele­tion Vec­tors Solve?

In prac­tice, these fea­tures address dif­fer­ent challenges:

  1. Slow select­ive reads: Fil­ters scan too many files or row groups
  2. Expens­ive updates, deletes, and merges: Small changes rewrite large files

Liquid Clus­ter­ing mainly tar­gets prob­lem (1), while Dele­tion Vec­tors tar­get prob­lem (2) – and you’ll often use both together.

Fea­ture Overview

Liquid Clus­ter­ing (LC)

Liquid Clus­ter­ing organ­izes data based on clus­ter­ing keys and improves data skip­ping for fil­ters on those keys. It’s designed to replace par­ti­tion­ing and ZORDER, and it’s not com­pat­ible with them. Liquid Clus­ter­ing is gen­er­ally avail­able for Delta tables with Dat­ab­ricks Runtime 15.2+ and can be used for all new tables, includ­ing stream­ing tables and mater­i­al­ized views.

How it works: Liquid Clus­ter­ing groups related data phys­ic­ally on stor­age so select­ive quer­ies scan fewer files. Unlike static par­ti­tion­ing, clus­ter­ing keys can be adjus­ted later without com­pletely rewrit­ing the exist­ing data struc­ture [1].

Dele­tion Vec­tor (DV)

Dele­tion Vec­tors store row-level changes as metadata (deleted or updated rows) rather than imme­di­ately rewrit­ing every affected data file. Dat­ab­ricks uses DV to power Pre­dict­ive I/O for updates on Photon-enabled compute.

The prin­ciple: Instead of rewrit­ing the entire Par­quet file when you DELETE, UPDATE, or MERGE a row, DV marks the row as mod­i­fied. The cur­rent table state is then resolved dur­ing reads by apply­ing modi­fic­a­tions from the dele­tion vector.

Import­ant: Enabling DV upgrades the table pro­tocol, so older cli­ents may not be able to read the table [2].

Photon

Photon is Dat­ab­ricks nat­ive vec­tor­ized query engine for faster SQL and Data­Frame exe­cu­tion. While not “new,” it’s crit­ical because sev­eral mod­ern optim­iz­a­tions (like Pre­dict­ive I/O on Azure Dat­ab­ricks) are expli­citly Photon-based.

Note: Pre­dict­ive I/O is Dat­ab­ricks’ umbrella term for Photon-only runtime optim­iz­a­tions that make data inter­ac­tions faster. It cov­ers (1) accel­er­ated reads, which speed up scan­ning and fil­ter­ing, and (2) accel­er­ated updates, which reduce full file rewrites for DELETE, UPDATE, and MERGE by lever­aging Dele­tion Vec­tors on sup­por­ted Photon-enabled compute.

Hands-on Lab: Liquid Clus­ter­ing and Dele­tion Vec­tors in Practice

Pre­requis­ites

  • Work­space with DBR 15.2+ available
  • Photon-enabled com­pute option: 
    • Dat­ab­ricks SQL Ware­house (Photon typ­ic­ally enabled), or
    • Cluster with Photon enabled

Cre­at­ing Schema and Dataset

First, let’s cre­ate a demo schema and gen­er­ate a repro­du­cible data­set with PyS­park. The data­set sim­u­lates a fact table with:

  • High-car­din­al­ity customer_id
  • Time fil­ter­ing on event_date
  • Numeric amount value
CREATE SCHEMA IF NOT EXISTS performance_demo;
USE performance_demo;
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand, expr, date_add, lit
from datetime import datetime, timedelta

# Generate dataset
spark.range(0, 200_000_000) \
  .withColumn("customer_id", (rand() * 100_000).cast("int")) \
  .withColumn("event_date", 
    date_add(lit(datetime(2024, 1, 1)), (rand() * 365).cast("int"))) \
  .withColumn("amount", (rand() * 1000).cast("decimal(10,2)")) \
  .write.format("delta").mode("overwrite") \
  .saveAsTable("sales_unclustered")

Liquid Clus­ter­ing in Practice

What is Liquid Clus­ter­ing and When Should You Use It?

Liquid Clus­ter­ing replaces par­ti­tion­ing and ZORDER and allows you to redefine clus­ter­ing keys without com­pletely rewrit­ing the exist­ing data lay­out. It’s expli­citly not com­pat­ible with par­ti­tion­ing or ZORDER.

Dat­ab­ricks enables LC via the CLUSTER BY (...) clause:

CREATE TABLE sales_lc
CLUSTER BY (customer_id, event_date)
AS SELECT * FROM sales_unclustered;

Trig­ger­ing Clus­ter­ing with OPTIMIZE

Liquid Clus­ter­ing is incre­mental and typ­ic­ally applied via OPTIMIZE:

OPTIMIZE sales_unclustered;
OPTIMIZE sales_lc;

Run­ning a Select­ive Query and Com­par­ing Query Performance

Let’s choose a real­istic pre­dic­ate (high-car­din­al­ity cus­tomer + nar­row date range):

SELECT SUM(amount) AS total_amount
FROM sales_unclustered
WHERE customer_id = 12345 
  AND event_date BETWEEN '2024-06-01' AND '2024-06-30';

SELECT SUM(amount) AS total_amount
FROM sales_lc
WHERE customer_id = 12345 
  AND event_date BETWEEN '2024-06-01' AND '2024-06-30';

The same query on sales_unclustered vs. sales_lc shows clear differences:

Dat­ab­ricks Liquid Clus­ter­ing per­form­ance com­par­ison between sales_unclustered and sales_lc.
Met­ricsales_unclusteredsales_lc (Liquid Clustering)
Wall-clock Dur­a­tion8 s 158 ms806 ms
% of bytes pruned dur­ing scan0%88%
Bytes read1,020 MB291 MB
Rows read33
Files read122
% of files pruned dur­ing scan0%82%

Observation: Liquid Clus­ter­ing did­n’t just reduce runtime – it dra­mat­ic­ally accel­er­ated query per­form­ance by over 10x (from 8.2 seconds down to 806 mil­li­seconds). This improve­ment stems dir­ectly from aggress­ive data prun­ing: 88% of bytes and 82% of files were skipped dur­ing the scan phase. The query scanned only 291 MB across 2 files instead of 1,020 MB across 12 files, demon­strat­ing exactly the beha­vior you want for select­ive pre­dic­ates with high-car­din­al­ity fil­ters like customer_id com­bined with date ranges. The same 3 rows were returned, but with a frac­tion of the I/O cost.

Dele­tion Vec­tors in Practice

What Are Dele­tion Vectors?

Dele­tion Vec­tors record row-level changes as metadata and apply them phys­ic­ally later dur­ing rewrite or main­ten­ance oper­a­tions (e.g., OPTIMIZE).

Cre­at­ing Two Identical Tables: DV Off vs. DV On

To enable DV, we set the Delta prop­erty delta.enableDeletionVectors:

-- DV OFF
CREATE TABLE dv_off
AS SELECT * FROM sales_unclustered
LIMIT 200_000_000;

-- DV ON
CREATE TABLE dv_on
TBLPROPERTIES ('delta.enableDeletionVectors' = 'true')
AS SELECT * FROM sales_unclustered
LIMIT 200_000_000;

Com­pact­ing Tables before Testing

OPTIMIZE dv_off;
OPTIMIZE dv_on;

Run­ning a Small DELETE and Com­par­ing Query Performance

Let’s execute a small DELETE operation:

DELETE FROM dv_off
WHERE customer_id > 123456 AND event_date = '2024-01-07';

DELETE FROM dv_on
WHERE customer_id > 123456 AND event_date = '2024-01-07';

The oper­a­tion met­rics show clear differences:

Dele­tion Vec­tors in Dat­ab­ricks com­par­ing oper­a­tion met­rics with and without DV enabled.
Met­ricdv_off (Dele­tion Vec­tors Off)dv_on (Dele­tion Vec­tors On)
Total wall clock duration16 s 684 ms3 s 333 ms
Rows read200,534,4151
Bytes read3.09 GB763 MB
Bytes writ­ten2.25 GB0
Files read2412
Files writ­ten240
Rows writ­ten199,465,5850
Dele­tion vec­tors added01

Observation: With DV OFF, the delete oper­a­tion took 16.7 seconds and required sig­ni­fic­ant data rewrit­ing (24 files writ­ten, 2.25 GB of data writ­ten, nearly 200 mil­lion rows rewrit­ten). With DV ON, the same delete com­pleted in just 3.3 seconds – approx­im­ately 5x faster – with zero files or bytes writ­ten. Instead of rewrit­ing data files, Dat­ab­ricks recor­ded the dele­tion via a single dele­tion vec­tor, avoid­ing the expens­ive rewrite oper­a­tion entirely while still pro­cessing the dele­tion logic (evid­ent in the 763 MB read for scanning).

Advant­ages and Limitations

Advant­ages

Liquid Clus­ter­ing:

  • Dynamic adapt­a­tion of data lay­out without full rewrites
  • Sig­ni­fic­antly reduced I/O through improved data skipping
  • Sim­pli­fied main­ten­ance com­pared to manual par­ti­tion­ing + ZORDER
  • Com­pat­ible with stream­ing tables and mater­i­al­ized views

Dele­tion Vectors:

  • Faster row-level changes since not every change forces imme­di­ate full-file rewrites
  • Enables/boosts update accel­er­a­tion fea­tures on Photon com­pute (Pre­dict­ive I/O on Azure Databricks)
  • Drastic reduc­tion in bytes rewrit­ten for DELETE/UPDATE/MERGE operations

Lim­it­a­tions

Liquid Clus­ter­ing:

  • Not com­pat­ible with tra­di­tional par­ti­tion­ing or ZORDER
  • Requires DBR 15.2+ for pro­duc­tion use
  • Ini­tial setup must care­fully choose clus­ter­ing keys (though they can be adjus­ted later)

Dele­tion Vectors:

  • Enabling DV upgrades table pro­tocol – older cli­ents or applic­a­tions may not be able to read the table
  • Mani­fest gen­er­a­tion lim­it­a­tions exist when DV is present (requires purge first – REORG TABLE ... APPLY (PURGE))
  • Plat­form-spe­cific restric­tions can apply

Note: A mani­fest gen­er­a­tion work­flow is the pro­cess used to make Delta tables read­able by query engines that don’t under­stand Delta’s trans­ac­tion log.

Con­clu­sion

Liquid Clus­ter­ing and Dele­tion Vec­tors are two power­ful fea­tures in the Dat­ab­ricks plat­form that each address dif­fer­ent per­form­ance chal­lenges. Liquid Clus­ter­ing optim­izes select­ive reads through intel­li­gent data lay­out, while Dele­tion Vec­tors accel­er­ate expens­ive row-level operations.

For pro­duc­tion envir­on­ments, a com­bined strategy could be recom­men­ded: Liquid Clus­ter­ing for fre­quently filtered columns and Dele­tion Vec­tors for tables with reg­u­lar updates or deletes. The prac­tical bench­marks show that both fea­tures deliver sig­ni­fic­ant per­form­ance improve­ments – Liquid Clus­ter­ing achieves up to 90% reduc­tion in scanned data, while Dele­tion Vec­tors reduce rewrite time by over 80%.

The invest­ment in prop­erly con­fig­ur­ing these fea­tures pays off long-term through lower costs, bet­ter per­form­ance, and reduced oper­a­tional effort. With Dat­ab­ricks Runtime 15.2+, data engin­eer­ing teams have a mod­ern tool­box to elev­ate lake­house per­form­ance to a new level.

Ref­er­ences

[1] How Liquid Clus­ter­ing Actu­ally Works – Data Engin­eer Wiki

[2] Dele­tion Vec­tors Doc­u­ment­a­tion – Databricks