Clash of ACID Data Lakes: Com­par­ing Data Lake­house Formats



Intro­duc­tion

Apache Hudi, Apache Ice­berg, and Delta Lake are state-of-the-art big data stor­age tech­no­lo­gies. These tech­no­lo­gies bring ACID trans­ac­tions to your data lake. ACID is an acronym that refers to 4 prop­er­ties that define a transaction:

Atom­icity: Every trans­ac­tion is treated as a single unit.

Con­sist­ency: Every transaction’s res­ult is pre­dict­able and there are no unex­pec­ted results.

Isol­a­tion: Man­age the con­cur­rent access or change of the table. The con­cur­rent trans­ac­tions don’t inter­fere with each other.

Dur­ab­il­ity: The changes are saved even though the sys­tem fails.

We have developed a con­fig­ur­able frame­work where we can test the three tech­no­lo­gies purely through para­met­ers, without need­ing to change the code. The bench­mark­ing of these three tech­no­lo­gies was done on AWS EMR and the data was stored on AWS S3.

The ana­lyt­ics engine used in our pro­ject is Apache Spark. The open-source ver­sions of these tech­no­lo­gies were used—concretely Apache Spark 3.1.2, Apache Hudi 0.10.1, Apache Ice­berg 0.13.1, and Delta Lake 1.0.0.

Apache Hudi (Hadoop Upserts Deletes and Incrementals)

Apache Hudi was ini­tially developed by Uber in 2016 and open-sourced in 2017. It was con­trib­uted to the Apache soft­ware found­a­tion in 2019. It fully sup­ports Spark in read­ing and writ­ing oper­a­tions. Hudi is the trick­i­est tech­no­logy that we used in our pro­ject. The tun­ing of Hudi table options is cru­cial. The choice of these options can boost per­form­ance or decrease it drastically.

There are mainly two dif­fer­ent types of Hudi tables:

Copy-on-Write tables (CoW): Data here is stored in a Par­quet format (a colum­nar file format) and each update cre­ates new files dur­ing write. This type of table is optim­ized for read-heavy workloads.

Merge-on-Read tables (MoR): Data here is stored in both Par­quet and Avro formats. The Avro format is a row-based file format. Writ­ing per­form­ance using Avro is bet­ter, but read­ing per­form­ance is decreased.

By default, Hudi uses CoW tables, but this can be changed eas­ily. Even though writ­ing per­form­ance is bet­ter in MoR tables, we have used the CoW tables in our pro­ject. When try­ing to update or delete a lot of rows of a Hudi MoR table, we have noticed some irreg­u­lar beha­vior, which we will describe below

There are two oper­a­tions to insert records into a Hudi table. A nor­mal insert and a bulk insert. The latest scales very well for a huge load of data.

The metadata is stored in a folder gen­er­ally called .hoodie. In fact, all the inform­a­tion about save points com­mits, and logs is stored there.

Apache Ice­berg

Ice­berg was ini­tially released by Net­flix. It was donated to the Apache Soft­ware Found­a­tion later. The ver­sion that we used fully sup­ports Spark. Older ver­sions have prob­lems using Spark in terms of per­form­ance, so we recom­mend that you use the latest ver­sion of Ice­berg. Addi­tion­ally, Ice­berg is becom­ing faster. Com­pared to older ver­sions, it became much quicker lately.

Data files are cre­ated in place in Ice­berg. This means that data is not changed, but new ones are cre­ated. When we change a file, a new one is cre­ated with the changes. Ice­berg tracks indi­vidual data files in a table. Every snap­shot tracks the files that belong to that snap­shot and ignores the other files. The state of the table is saved in a metadata file. With every change to the table, a new metadata file is cre­ated that replaces the old one. In this file, a lot of pieces of inform­a­tion are stored related to the table, such as the schema and other prop­er­ties and con­fig­ur­a­tions. Snap­shots are lis­ted in the metadata and it points to which snap­shot is used right now. The files that a snap­shot tracks are stored in a mani­fest file. Every snap­shot has its mani­fest file. This provides total isol­a­tion of snapshots.

Delta Lake

Delta Lake was ini­tially main­tained by Dat­ab­ricks. It is donated to the Linux found­a­tion and it was recently open-sourced com­pletely. It is an open-source pro­ject and it provides deep integ­ra­tion with Spark in read­ing and writ­ing oper­a­tions since Spark was ini­tially cre­ated by Dat­ab­ricks. Delta uses metadata just like Ice­berg and Hudi. The metadata is called delta logs and for every com­mit, a JSON file is cre­ated in which the oper­a­tions done in that com­mit are recor­ded so they can be rever­ted eas­ily while time trav­el­ing. A par­quet file is cre­ated as a check­point file after a cer­tain num­ber of com­mits (gen­er­ally 10). The pre­vi­ously gen­er­ated JSON files are rewrit­ten into that par­quet file to optim­ize access to the logs.

Bench­mark­ing

We have used an emr‑6.6.0 cluster with one mas­ter node and two worker nodes. All of the nodes are m5.2xlarge. We have used the Spark His­tory Server to do the bench­mark­ing and cal­cu­late the exe­cu­tion time. We gen­er­ated some ran­dom data and stored them in Json format on an s3 bucket. We used these files to do the bench­mark­ing. We have got­ten the fol­low­ing results:

Sizes of Res­ult­ing Tables
plot of the resulting table sizes per file size. Table types: Hudi, Delta, Iceberg, Hudi bulk
Fig­ure 1: The res­ult­ing table sizes

The res­ult­ing sizes of the ice­berg tables are the smal­lest and this will lead to a huge optim­iz­a­tion in terms of stor­age in the long run. Delta and Hudi (bulk insert ver­sion) are very close in terms of size with a small advant­age over Hudi. The biggest tables are Hudi tables using the nor­mal insert oper­a­tion. How­ever, it throws errors and excep­tions under a heavy workload.

Insert Oper­a­tion
plot of the runtime of teh creation of the tables per file size. Table types: Hudi, Delta, Iceberg, Hudi bulk
Fig­ure 2: The runtime of the cre­ation of the tables

The per­form­ances of Ice­berg and Delta are the best in terms of the per­form­ance of the cre­ation of the tables. Ice­berg is a little bit faster than Hudi. Hudi is the worst with its own 2 ver­sions. Nor­mal inser­ted Hudi tables have the same prob­lems as men­tioned above.

Update Oper­a­tion

We have con­duc­ted the update bench­mark­ing in two steps.

plot of the runtime of the update of 10% of the tables per total number of records. Table types: Delta, Iceberg, Hudi bulk
Fig­ure 3: The runtime of the update of 10% of the tables

First, we tried updat­ing only 10% of the records. Ice­berg is incred­ibly faster than Hudi and Delta. We have used Hudi tables that were cre­ated using bulk insert oper­a­tion since we had prob­lems cre­at­ing them using nor­mal insert. At first, Hudi was faster than Delta but with the increase in data size, Delta became slightly faster.

plot of the runtime of the update of half of the tables per total number of records. Table types: Delta, Iceberg, Hudi bulk
Fig­ure 4: The runtime of the update of half of the tables

After noti­cing that Hudi becomes slower with the increase in size, we decided to test updat­ing 50% of the records. Ice­berg is always much faster than Hudi and Delta. We noticed that with a high work­load Hudi becomes slower. In this case, Delta is faster than Hudi. Hudi’s runtime of updat­ing half of the record grows expo­nen­tially and it throws errors at a cer­tain threshold.

Remove Oper­a­tion

Just like the update oper­a­tion, we had done two tests using remove.

plot of the runtime of the removal of 10% of the tables per total number of records. Table types: Delta, Iceberg, Hudi bulk
Fig­ure 5: The runtime of the removal of 10% of the tables

We removed 10% of the records of every table and we have noticed that, as always, Ice­berg is far away more per­form­ant than Delta and Hudi. Hudi in this case is much faster than Delta.

plot of the runtime of the removal of half of the tables per total number of records. Table types: Delta, Iceberg, Hudi bulk
Fig­ure 6: The runtime of the removal of half of the tables

We wanted to check the pat­tern of Hudi after remov­ing half of the tables. As expec­ted and just like the update oper­a­tion, Hudi became the slow­est of all tech­no­lo­gies and Ice­berg stayed the fast­est. Hudi’s runtime grew rap­idly until it threw excep­tions just like an update.

Con­clu­sion

We have found that Ice­berg, with the con­fig­ur­a­tions we have done, is the most per­form­ant tech­no­logy. The res­ult­ing sizes of the ice­berg tables are the smal­lest and this will lead to a massive optim­iz­a­tion in terms of stor­age in the long run. Addi­tion­ally, insert, upsert, and delete oper­a­tions are most optim­ized using Ice­berg. Ice­berg tech­no­logy is con­sid­er­ably faster than Hudi and Delta.

The per­form­ances of Hudi and Delta are close. The thing that we noticed is that under a huge amount of work­load Hudi throws some errors and excep­tions related to some timeout and this may be related to the com­plex con­fig­ur­a­tion that Hudi needs to func­tion per­fectly. The bulk insert oper­a­tion throws fewer excep­tions than the nor­mal insert oper­a­tion in Hudi.

These res­ults depend on the con­fig­ur­a­tion of these tech­no­lo­gies, espe­cially Apache Hudi. It can vary a lot. We can con­sider these res­ults insights into what to expect from each technology.

Our pro­ject was based on a data pro­cessing frame­work, developed by us, that offers a pat­tern to sep­ar­ate an ETL pro­cess extrac­tion, trans­form­a­tion, and load­ing logic. We will write a blog post about it soon!

Ref­er­ences:

Com­par­ison of Data Lake Table Formats (Ice­berg, Hudi, and Delta Lake)

Hudi, Ice­berg and Delta Lake: Data Lake Table Formats Compared

Using Ice­berg tables – Amazon Athena

Apache Ice­berg

Hello from Apache Hudi | Apache Hudi

Delta Lake