Opti­mi­zing Dat­ab­ricks Per­for­mance: Liquid Clus­te­ring and Dele­tion Vec­tors in Practice



Intro­duc­tion

Teams run­ning Dat­ab­ricks in pro­duc­tion face the clas­sic tri­angle: per­for­mance, cost, and ope­ra­tio­nal effort. Tra­di­tio­nal tuning approa­ches like par­ti­tio­ning stra­te­gies, peri­odic OPTIMIZE, ZORDER BY, clus­ter sizing, and shuffle tuning still work – but they require con­ti­nuous rework as data volu­mes grow and query pat­terns change.

Dat­ab­ricks keeps pushing Delta Lake per­for­mance for­ward. While Pho­ton has been around for years (it became the default for newly crea­ted Dat­ab­ricks SQL end­points back in 2021), two fea­tures are par­ti­cu­larly rele­vant for most modern lake­house projects:

Liquid Clus­te­ring is a modern data lay­out opti­miza­tion that replaces manual par­ti­tio­ning and ZORDER, allo­wing your lay­out to evolve wit­hout full rewrites.

Dele­tion Vec­tors (DV) pro­vide „soft-delete“ meta­data that helps avoid rewri­ting entire Par­quet files for row-level chan­ges and powers fas­ter updates on Pho­ton-enab­led compute.

What Pro­blems do Liquid Clus­te­ring and Dele­tion Vec­tors Solve?

In prac­tice, these fea­tures address dif­fe­rent challenges:

  1. Slow sel­ec­tive reads: Fil­ters scan too many files or row groups
  2. Expen­sive updates, dele­tes, and mer­ges: Small chan­ges rewrite large files

Liquid Clus­te­ring mainly tar­gets pro­blem (1), while Dele­tion Vec­tors tar­get pro­blem (2) – and you’ll often use both together.

Fea­ture Overview

Liquid Clus­te­ring (LC)

Liquid Clus­te­ring orga­ni­zes data based on clus­te­ring keys and impro­ves data skip­ping for fil­ters on those keys. It’s desi­gned to replace par­ti­tio­ning and ZORDER, and it’s not com­pa­ti­ble with them. Liquid Clus­te­ring is gene­rally available for Delta tables with Dat­ab­ricks Run­time 15.2+ and can be used for all new tables, inclu­ding strea­ming tables and mate­ria­li­zed views.

How it works: Liquid Clus­te­ring groups rela­ted data phy­si­cally on sto­rage so sel­ec­tive queries scan fewer files. Unlike sta­tic par­ti­tio­ning, clus­te­ring keys can be adjus­ted later wit­hout com­ple­tely rewri­ting the exis­ting data struc­ture [1].

Dele­tion Vec­tor (DV)

Dele­tion Vec­tors store row-level chan­ges as meta­data (dele­ted or updated rows) rather than imme­dia­tely rewri­ting every affec­ted data file. Dat­ab­ricks uses DV to power Pre­dic­tive I/O for updates on Pho­ton-enab­led compute.

The prin­ci­ple: Ins­tead of rewri­ting the entire Par­quet file when you DELETE, UPDATE, or MERGE a row, DV marks the row as modi­fied. The cur­rent table state is then resol­ved during reads by app­ly­ing modi­fi­ca­ti­ons from the dele­tion vector.

Important: Enab­ling DV upgrades the table pro­to­col, so older cli­ents may not be able to read the table [2].

Pho­ton

Pho­ton is Dat­ab­ricks native vec­to­ri­zed query engine for fas­ter SQL and Data­Frame exe­cu­tion. While not „new,“ it’s cri­ti­cal because seve­ral modern opti­miza­ti­ons (like Pre­dic­tive I/O on Azure Dat­ab­ricks) are expli­citly Photon-based.

Note: Pre­dic­tive I/O is Dat­ab­ricks‘ umbrella term for Pho­ton-only run­time opti­miza­ti­ons that make data inter­ac­tions fas­ter. It covers (1) acce­le­ra­ted reads, which speed up scan­ning and fil­te­ring, and (2) acce­le­ra­ted updates, which reduce full file rewri­tes for DELETE, UPDATE, and MERGE by lever­aging Dele­tion Vec­tors on sup­ported Pho­ton-enab­led compute.

Hands-on Lab: Liquid Clus­te­ring and Dele­tion Vec­tors in Practice

Pre­re­qui­si­tes

  • Workspace with DBR 15.2+ available
  • Pho­ton-enab­led com­pute option: 
    • Dat­ab­ricks SQL Ware­house (Pho­ton typi­cally enab­led), or
    • Clus­ter with Pho­ton enabled

Crea­ting Schema and Dataset

First, let’s create a demo schema and gene­rate a repro­du­ci­ble data­set with PySpark. The data­set simu­la­tes a fact table with:

  • High-car­di­na­lity customer_id
  • Time fil­te­ring on event_date
  • Nume­ric amount value
CREATE SCHEMA IF NOT EXISTS performance_demo;
USE performance_demo;
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, rand, expr, date_add, lit
from datetime import datetime, timedelta

# Generate dataset
spark.range(0, 200_000_000) \
  .withColumn("customer_id", (rand() * 100_000).cast("int")) \
  .withColumn("event_date", 
    date_add(lit(datetime(2024, 1, 1)), (rand() * 365).cast("int"))) \
  .withColumn("amount", (rand() * 1000).cast("decimal(10,2)")) \
  .write.format("delta").mode("overwrite") \
  .saveAsTable("sales_unclustered")

Liquid Clus­te­ring in Practice

What is Liquid Clus­te­ring and When Should You Use It?

Liquid Clus­te­ring replaces par­ti­tio­ning and ZORDER and allows you to rede­fine clus­te­ring keys wit­hout com­ple­tely rewri­ting the exis­ting data lay­out. It’s expli­citly not com­pa­ti­ble with par­ti­tio­ning or ZORDER.

Dat­ab­ricks enables LC via the CLUSTER BY (...) clause:

CREATE TABLE sales_lc
CLUSTER BY (customer_id, event_date)
AS SELECT * FROM sales_unclustered;

Trig­ge­ring Clus­te­ring with OPTIMIZE

Liquid Clus­te­ring is incre­men­tal and typi­cally applied via OPTIMIZE:

OPTIMIZE sales_unclustered;
OPTIMIZE sales_lc;

Run­ning a Sel­ec­tive Query and Com­pa­ring Query Performance

Let’s choose a rea­li­stic pre­di­cate (high-car­di­na­lity cus­to­mer + nar­row date range):

SELECT SUM(amount) AS total_amount
FROM sales_unclustered
WHERE customer_id = 12345 
  AND event_date BETWEEN '2024-06-01' AND '2024-06-30';

SELECT SUM(amount) AS total_amount
FROM sales_lc
WHERE customer_id = 12345 
  AND event_date BETWEEN '2024-06-01' AND '2024-06-30';

The same query on sales_unclustered vs. sales_lc shows clear differences:

Dat­ab­ricks Liquid Clus­te­ring per­for­mance com­pa­ri­son bet­ween sales_unclustered and sales_lc.
Metricsales_unclusteredsales_lc (Liquid Clustering)
Wall-clock Dura­tion8 s 158 ms806 ms
% of bytes pru­ned during scan0%88%
Bytes read1,020 MB291 MB
Rows read33
Files read122
% of files pru­ned during scan0%82%

Observation: Liquid Clus­te­ring did­n’t just reduce run­time – it dra­ma­ti­cally acce­le­ra­ted query per­for­mance by over 10x (from 8.2 seconds down to 806 mil­li­se­conds). This impro­ve­ment stems directly from aggres­sive data pru­ning: 88% of bytes and 82% of files were skip­ped during the scan phase. The query scan­ned only 291 MB across 2 files ins­tead of 1,020 MB across 12 files, demons­t­ra­ting exactly the beha­vior you want for sel­ec­tive pre­di­ca­tes with high-car­di­na­lity fil­ters like customer_id com­bi­ned with date ran­ges. The same 3 rows were retur­ned, but with a frac­tion of the I/O cost.

Dele­tion Vec­tors in Practice

What Are Dele­tion Vectors?

Dele­tion Vec­tors record row-level chan­ges as meta­data and apply them phy­si­cally later during rewrite or main­ten­ance ope­ra­ti­ons (e.g., OPTIMIZE).

Crea­ting Two Iden­ti­cal Tables: DV Off vs. DV On

To enable DV, we set the Delta pro­perty delta.enableDeletionVectors:

-- DV OFF
CREATE TABLE dv_off
AS SELECT * FROM sales_unclustered
LIMIT 200_000_000;

-- DV ON
CREATE TABLE dv_on
TBLPROPERTIES ('delta.enableDeletionVectors' = 'true')
AS SELECT * FROM sales_unclustered
LIMIT 200_000_000;

Com­pac­ting Tables before Testing

OPTIMIZE dv_off;
OPTIMIZE dv_on;

Run­ning a Small DELETE and Com­pa­ring Query Performance

Let’s exe­cute a small DELETE operation:

DELETE FROM dv_off
WHERE customer_id > 123456 AND event_date = '2024-01-07';

DELETE FROM dv_on
WHERE customer_id > 123456 AND event_date = '2024-01-07';

The ope­ra­tion metrics show clear differences:

Dele­tion Vec­tors in Dat­ab­ricks com­pa­ring ope­ra­tion metrics with and wit­hout DV enabled.
Metricdv_off (Dele­tion Vec­tors Off)dv_on (Dele­tion Vec­tors On)
Total wall clock duration16 s 684 ms3 s 333 ms
Rows read200,534,4151
Bytes read3.09 GB763 MB
Bytes writ­ten2.25 GB0
Files read2412
Files writ­ten240
Rows writ­ten199,465,5850
Dele­tion vec­tors added01

Observation: With DV OFF, the delete ope­ra­tion took 16.7 seconds and requi­red signi­fi­cant data rewri­ting (24 files writ­ten, 2.25 GB of data writ­ten, nearly 200 mil­lion rows rewrit­ten). With DV ON, the same delete com­ple­ted in just 3.3 seconds – appro­xi­m­ately 5x fas­ter – with zero files or bytes writ­ten. Ins­tead of rewri­ting data files, Dat­ab­ricks recor­ded the dele­tion via a sin­gle dele­tion vec­tor, avo­i­ding the expen­sive rewrite ope­ra­tion enti­rely while still pro­ces­sing the dele­tion logic (evi­dent in the 763 MB read for scanning).

Advan­ta­ges and Limitations

Advan­ta­ges

Liquid Clus­te­ring:

  • Dyna­mic adapt­a­tion of data lay­out wit­hout full rewrites
  • Signi­fi­cantly redu­ced I/O through impro­ved data skipping
  • Sim­pli­fied main­ten­ance com­pared to manual par­ti­tio­ning + ZORDER
  • Com­pa­ti­ble with strea­ming tables and mate­ria­li­zed views

Dele­tion Vectors:

  • Fas­ter row-level chan­ges since not every change forces imme­diate full-file rewrites
  • Enables/boosts update acce­le­ra­tion fea­tures on Pho­ton com­pute (Pre­dic­tive I/O on Azure Databricks)
  • Dra­stic reduc­tion in bytes rewrit­ten for DELETE/UPDATE/MERGE operations

Limi­ta­ti­ons

Liquid Clus­te­ring:

  • Not com­pa­ti­ble with tra­di­tio­nal par­ti­tio­ning or ZORDER
  • Requi­res DBR 15.2+ for pro­duc­tion use
  • Initial setup must carefully choose clus­te­ring keys (though they can be adjus­ted later)

Dele­tion Vectors:

  • Enab­ling DV upgrades table pro­to­col – older cli­ents or appli­ca­ti­ons may not be able to read the table
  • Mani­fest gene­ra­tion limi­ta­ti­ons exist when DV is pre­sent (requi­res purge first – REORG TABLE ... APPLY (PURGE))
  • Plat­form-spe­ci­fic rest­ric­tions can apply

Note: A mani­fest gene­ra­tion work­flow is the pro­cess used to make Delta tables rea­da­ble by query engi­nes that don’t under­stand Delta’s tran­sac­tion log.

Con­clu­sion

Liquid Clus­te­ring and Dele­tion Vec­tors are two powerful fea­tures in the Dat­ab­ricks plat­form that each address dif­fe­rent per­for­mance chal­lenges. Liquid Clus­te­ring opti­mi­zes sel­ec­tive reads through intel­li­gent data lay­out, while Dele­tion Vec­tors acce­le­rate expen­sive row-level operations.

For pro­duc­tion envi­ron­ments, a com­bi­ned stra­tegy could be recom­men­ded: Liquid Clus­te­ring for fre­quently fil­te­red columns and Dele­tion Vec­tors for tables with regu­lar updates or dele­tes. The prac­ti­cal bench­marks show that both fea­tures deli­ver signi­fi­cant per­for­mance impro­ve­ments – Liquid Clus­te­ring achie­ves up to 90% reduc­tion in scan­ned data, while Dele­tion Vec­tors reduce rewrite time by over 80%.

The invest­ment in pro­perly con­fi­gu­ring these fea­tures pays off long-term through lower costs, bet­ter per­for­mance, and redu­ced ope­ra­tio­nal effort. With Dat­ab­ricks Run­time 15.2+, data engi­nee­ring teams have a modern tool­box to ele­vate lake­house per­for­mance to a new level.

Refe­ren­ces

[1] How Liquid Clus­te­ring Actually Works – Data Engi­neer Wiki

[2] Dele­tion Vec­tors Docu­men­ta­tion – Databricks