In today’s con­stantly evolving tech land­scape, we have wit­nessed tre­mend­ous pro­gress in data archi­tec­tures for data man­age­ment. This jour­ney star­ted with the struc­tured con­fines of data ware­housing, offer­ing a sys­tem­atic approach to organ­ising and man­aging struc­tured data, but due to the coup­ling of stor­age and com­pute, stor­age costs were high, and what’s more, you couldn’t store semi-struc­tured or unstruc­tured data. To over­come these lim­it­a­tions, the data lake was born, offer­ing a cost-effect­ive, scal­able solu­tion for all types of data. Nev­er­the­less, chal­lenges per­sisted in terms of reli­ab­il­ity, con­sist­ency, and performance.

The solu­tion to these chal­lenges was the lake­house, a concept presen­ted by Dat­ab­ricks in their paper Lake­house: A New Gen­er­a­tion of Open Plat­forms that Unify Data Ware­housing and Advanced Ana­lyt­ics, now broadly adop­ted across the industry. We have covered this concept in many of our pre­vi­ous blog posts and webinars, so it may well be famil­iar to you! In this blog post, we are going to see how Delta Lake is emer­ging as a key player in the lake­house revolution.

What is Delta Lake?

Delta Lake, from Dat­ab­ricks, is an open-source pro­ject with a mis­sion to address the short­com­ings of tra­di­tional data lakes. It first emerged to tackle issues such as data reli­ab­il­ity, per­form­ance optim­isa­tion, and trans­ac­tion con­sist­ency, effect­ively becom­ing a corner­stone of the lake­house concept.

Delta Lake intro­duces ground­break­ing con­cepts in data lakes, incor­por­at­ing ACID trans­ac­tions for robust­ness, scal­able metadata hand­ling for effi­ciency, as well as com­pat­ib­il­ity with both batch and stream­ing data.

Delta Lake is an open format built upon stand­ard data formats stored in any data lake solu­tion from any cloud pro­vider. It seam­lessly integ­rates with the medal­lion archi­tec­ture, improv­ing the struc­ture and qual­ity of data as it flows through each layer, provid­ing reli­able, clean data for users and applic­a­tions to per­form ana­lyt­ics or machine learning.

Fig­ure 1: Delta Lake Diagram

The endorse­ment of Delta Lake by industry giants like Dat­ab­ricks, and its integ­ra­tion into Microsoft Fab­ric high­lights its import­ance. As organ­isa­tions seek solu­tions to enhance their data infra­struc­ture, Delta Lake stands out as a pre­ferred option thanks to its broad range of integ­ra­tions and capabilities.

Fig­ure 2: Delta Lake Integrations

In the fol­low­ing sec­tion, we’ll delve into the intric­a­cies of Delta Lake, explor­ing some of the fea­tures and cap­ab­il­it­ies that make it a game-changer in the realm of data management.

Delta Lake Features

Delta Lake merges open stor­age formats (like Par­quet) for stor­ing struc­tured, semi-struc­tured, and unstruc­tured data, with a data man­age­ment layer that includes trans­ac­tion logs in a folder named _delta_log, where indi­vidual com­mits and check­points are stored in JSON format. Delta Lake also sup­ports both batch and stream­ing operations.

Fig­ure 3: Dat­ab­ricks: Unpack­ing The Trans­ac­tion Log

Track­ing trans­ac­tions allows Delta Lake to offer a wide range of fea­tures. Let’s check some of them out!

ACID Trans­ac­tions

As men­tioned before, Delta Lake intro­duces ACID trans­ac­tions to the under­ly­ing open formats. This means that, unlike in formats such as Par­quet, where updat­ing or delet­ing spe­cific records means rewrit­ing the entire data­set, in Delta Lake these oper­a­tions are seam­lessly integ­rated like in a data warehouse.

UPDATE delta_table SET col = “Synvert” WHERE col IS NULL
DELETE FROM delta_table WHERE col IS NULL

Besides updat­ing and delet­ing, Delta Lake also enables you to per­form upserts using the MERGE state­ment, which merges data from a source table, view, or Data­Frame into a tar­get Delta table. This oper­a­tion enables inserts, updates, and deletes, auto­mat­ic­ally ensur­ing the source schema is com­pat­ible with the tar­get. It com­pares records on both sides based on a key, and depend­ing on the match, the fol­low­ing oper­a­tions can be performed:

  • when­Matched: Updates or deletes tar­get data when keys match.
  • when­Not­Matched: Inserts source data into the tar­get when keys don’t match (the tar­get does not have that record).
  • when­Not­Matched­By­So­urce: Updates or deletes tar­get data when keys don’t match (the source does not have that record).

The MERGE oper­a­tion is a power­ful tool for per­form­ing trans­form­a­tions that require mark­ing pre­vi­ous key val­ues as old and adding new rows as the latest val­ues. Such use cases are com­mon in Sil­ver and Gold lay­ers, and espe­cially in Slowly Chan­ging Dimen­sions (SCD) Type 2 scenarios.

This func­tion has seen con­tinu­ous improve­ments since its intro­duc­tion, includ­ing data skip­ping for matched-only records, and in Delta Lake 3.0 per­form­ance has been increased by up to 56%.

Schema Enforce­ment and Evolution

Schema enforce­ment ensures that data writ­ten to a table adheres to a pre­defined schema, pre­vent­ing data cor­rup­tion by reject­ing writes that devi­ate from what’s specified.

When writ­ing data to a Delta table, Delta Lake veri­fies the incom­ing data’s schema against the table’s pre­defined schema. By default, Delta Lake raises an error if there is a mis­match, pre­vent­ing the write oper­a­tion and main­tain­ing data integ­rity. For instance:

# Write data to Delta Lake with initial schema
data_initial_schema = spark.createDataFrame([(1, "John"), (2, "Alice")], ["id", "name"])
data_initial_schema.write.format("delta").save("path/to/delta-table")


# Attempt to write data with a mismatched schema
data_mismatched_schema = spark.createDataFrame([(1, "John", "Male")], ["id", "name", "gender"], ["id", "name", "gender"])
data_mismatched_schema.write.format("delta").mode("append").save("path/to/delta-table")

In this example, we define a schema for the Delta table with only the “id” and “name” columns. When attempt­ing to append data that con­tains an addi­tional “gender” column, Delta Lake enforces the schema and trig­gers an Ana­lys­i­sEx­cep­tion error.

Schema evol­u­tion refers to the pro­cess of modi­fy­ing your data’s struc­ture over time. In a dynamic data envir­on­ment, where require­ments change and new data attrib­utes need to be accom­mod­ated, a sys­tem that sup­ports schema evol­u­tion is essential.

Delta Lake sup­ports schema evol­u­tion when writ­ing to a Delta Table with the mergeS­chema option set to true. When you append new data to a Delta table with a mod­i­fied schema, Delta Lake auto­mat­ic­ally adjusts to accom­mod­ate the changes, without requir­ing you to restruc­ture the entire data­set, mean­ing that you can add new columns (or in some cases change the data types of exist­ing columns) without affect­ing the exist­ing data.

# Evolve the schema and append new column “gender”
data_evolved_schema = spark.createDataFrame([(3, "Bob", "Male"), (4, "Eve", "Female")], ["id", "name", "gender"])
data_evolved_schema.write.option("mergeSchema", "true").format("delta").mode("append").save("path/to/delta-table")

In this code, we evolve the schema by adding a new column, “gender”, and append data with the mod­i­fied struc­ture. Delta Lake effort­lessly handles this schema evol­u­tion when the option option(“mergeSchema”, “true”) or the con­fig­ur­a­tion spark.conf.set(“spark.databricks.delta.schema.autoMerge.enabled”, “true”) is enabled.

Change Data Feed (CDF)

The concept of CDF in Delta Lake refers to an improved Change Data Cap­ture (CDC) mech­an­ism that tracks row-level changes made to a table.

In Delta Lake CDF is not enabled by default, but it is recom­men­ded for your tables in the Sil­ver and Gold lay­ers, par­tic­u­larly when you pub­lish changes to external tools, or for audit pur­poses. Basic­ally, it is recom­men­ded whenever you need to track changes in your data.

Once enabled, any altered data res­ult­ing from UPDATE, DELETE, and MERGE oper­a­tions is doc­u­mented in a folder named _change_data. INSERT-only state­ments or full par­ti­tion deletes are handled more effi­ciently this folder. All these changes can be quer­ied using the table_changes SQL function.

Fig­ure 4: Dat­ab­ricks: CDF Diagram

Time Travel

Chan­ging data always brings chal­lenges such as audit­ing, repro­du­cing exper­i­ments dur­ing model train­ing, or rolling back after bad data has been stored.

Delta Lake’s time travel fea­ture sim­pli­fies all these use cases by stor­ing snap­shots of your data at each change oper­a­tion, allow­ing you to read or restore older ver­sions. For example, you can read an older ver­sion by spe­cify­ing the ver­sion num­ber using this SQL statement:

SELECT * FROM delta_table VERSION AS OF 2

To roll­back undesired changes, use the RESTORE command:

RESTORE TABLE delta_table TO VERSION AS OF 2

To visu­al­ise the dif­fer­ent ver­sions cor­res­pond­ing to the applied changes, use this command:

DESCRIBE HISTORY table_name

This com­mand shows each ver­sion num­ber, the oper­a­tion executed, and addi­tional inform­a­tion such as who executed it and when.

Stream­ing Processing

Stream­ing pro­cessing presents some chal­lenges, such as bal­an­cing low latency with high through­put, a con­stant trade-off. What’s more, events in a stream may not always arrive in the order they were pro­duced, and fail­ures and crashes might lead to duplic­ate or miss­ing messages.

Delta Lake sup­ports stream­ing sources, enabling you to read data from stream­ing plat­forms and to write the stream­ing data into Delta tables. This is typ­ic­ally done using Apache Spark Struc­tured Stream­ing, a scal­able and fault-tol­er­ant stream pro­cessing engine, which uses check­points to keep track of the stream­ing pro­gress and to ensure fault tol­er­ance. Moreover, its trans­ac­tion log guar­an­tees exactly-once pro­cessing, even when there are other streams or batch quer­ies run­ning con­cur­rently against the table.

And finally, when read­ing from a Delta Lake, you can spe­cify a ver­sion or a timestamp to determ­ine the start­ing point of the stream­ing source without need­ing to pro­cess the entire table.

Optim­isa­tions (Par­ti­tion­ing, Com­pac­tion, Clustering)

The tra­di­tional optim­isa­tion tech­nique to enhance quer­ies over tables in lake­houses is data skip­ping with par­ti­tion prun­ing. This is not a new concept, and it’s sup­por­ted by Delta Lake. How­ever, as this format is already effi­cient, it is not recom­men­ded to cre­ate par­ti­tions unless each par­ti­tion is expec­ted to con­tain at least 1GB of data. Fur­ther­more, table and par­ti­tion inform­a­tion, includ­ing max­imum and min­imum val­ues for each column, is stored to auto­mat­ic­ally improve query per­form­ance through data skip­ping without the need for configuration.

Another com­mon issue with par­ti­tion­ing is the accu­mu­la­tion of small files over time, espe­cially when data is added in small batches; this can end up affect­ing the per­form­ance of your table. A typ­ical way of tack­ling this prob­lem is to repro­cess your data with the Spark repar­ti­tion func­tion to spe­cify the num­ber of files you want in each par­ti­tion (or across the entire table, if unpartitioned).

Delta Lake also offers the OPTIMIZE com­mand for com­pac­tion, which effect­ively merges small files into lar­ger ones, enhan­cing data man­age­ment and access:

OPTIMIZE '/path/to/delta/table'

Along with OPTIMIZE, you can also spe­cify Z‑Ordering, a tech­nique to col­loc­ate related inform­a­tion in the same set of files.

OPTIMIZE '/path/to/delta/table' ZORDER BY (column)

Z‑Ordering is advis­able if you expect a high-car­din­al­ity column to be com­monly used in your queries.

It is worth men­tion­ing that in Dat­ab­ricks Runtime 13.3 LTS and above, the intro­duc­tion of the Liquid clus­ter­ing fea­ture elim­in­ates the need for par­ti­tion­ing and Z‑Ordering, allow­ing Dat­ab­ricks to autonom­ously man­age all aspects of optimisation:

CREATE TABLE delta_table(col string) USING DELTA CLUSTER BY (col)

Other Fea­tures

It is impossible to cover all the Delta Lake fea­tures in this blog post given the extens­ive range of cap­ab­il­it­ies on offer, such as gen­er­ated columns, con­vert­ing Par­quet files to Delta, and a wealth of util­ity com­mands to inter­act with and man­age Delta tables, like renam­ing them, clon­ing, and remov­ing old files with VACUUM, amongst others.

You can get more inform­a­tion in the Delta Lake doc­u­ment­a­tion or at the Dat­ab­ricks Academy, where you’ll also find inter­est­ing mater­ial about Dat­ab­ricks itself.

Con­clu­sion

As we close this art­icle on Delta Lake, we invite you to explore, exper­i­ment, and integ­rate its cut­ting-edge fea­tures into your own data eco­sys­tems. The lake­house era is here, and Delta Lake is at the fore­front, driv­ing us towards a future where data man­age­ment is not just a chal­lenge, but a stra­tegic advantage.

Here at ClearPeaks, we have ample exper­i­ence in build­ing and man­aging data plat­forms with lake­house archi­tec­ture, as well as with tech­no­lo­gies like Dat­ab­ricks and Microsoft Fab­ric that lever­age the Delta Lake format. Don’t miss this oppor­tun­ity to trans­form your data strategy from a mere neces­sity to a com­pet­it­ive advant­age by con­tact­ing us today and let­ting us guide you through the intric­a­cies of optim­ising your data archi­tec­ture for unpar­alleled effi­ciency, per­form­ance and results!