Which Data Lake­house Tech­no­lo­gies Do Cus­tom­ers Want?



We have already writ­ten about Data Lake­house tech­no­lo­gies and com­pared the most prom­in­ent Data Lake Table Formats. All of them have their strengths and weak­nesses, but are ulti­mately excit­ing tools that enable us to main­tain and util­ize data lakes more efficiently.

Here we wanted to give a rough over­view of our cus­tom­ers’ needs and wants when it comes to Data Lake Table Formats and File Formats. It gives a unique per­spect­ive in which dir­ec­tion the mar­ket is going at the moment.

Cus­tom­ers and numbers

Synvert has worked with over 250 cus­tom­ers with suc­cess stor­ies in many dif­fer­ent indus­tries. Our goal was to take peek at the pref­er­ences of our cus­tom­ers’ ana­lyt­ical stacks, espe­cially at the trend of emer­ging lake­house tech­no­lo­gies. So, what are the res­ults of our questionnaire?

What are the customers currently using?
Fig­ure 1: What are the cus­tom­ers cur­rently using?
What do the customers plan to use in the future?
Fig­ure 2: What do the cus­tom­ers plan to use in the future?

Delta Lake had the biggest push among cus­tom­ers, mostly because of integ­ra­tion with Dat­ab­ricks and Microsoft, who have been using the tech­no­logy as default on their plat­forms. Cus­tom­ers that want use those plat­forms in the future, espe­cially Dat­ab­ricks and Spark, tend to go with Delta Lake. In situ­ations where they stopped using Delta Lake, it has mostly to do with over­head, where Spark was too much, and a shift to a mod­ern data ware­house solu­tion was more than enough for their use case. Still, Delta Lake is the most widely used lake­house stor­age format, and since the announce­ment of Delta Lake 3.0 and it’s com­pat­ib­il­ity for Apache Ice­berg and Apache Hudi, it might only get a lar­ger footprint.

Ice­berg has big future poten­tial as it is recom­men­ded for Cloudera, Dataiku, and Dre­mio envir­on­ments. As an Apache pro­ject it gathered sup­port from many top com­pan­ies, espe­cially in the open source com­munity. A lot of cus­tom­ers plan and want to use Ice­berg in future pro­jects (Fig­ure 1, Fig­ure 2).

Hudi is another Apache pro­ject – and com­par­at­ively to the other two parts of the lake­house trin­ity, it is los­ing pop­ular­ity among cus­tom­ers (Fig­ure 2). That doesn’t mean it is going any­where soon. Hudi is integ­rated in EMR, and still the first choice for many AWS customers.

Other table formats and altern­at­ives include first gen­er­a­tion lake­houses like Hive together with ORC file format, as well as pro­pri­et­ary imple­ment­a­tions, and appro­pri­ated mod­ern data ware­house solu­tions like Red­shift, BigQuery, and Snowflake.

What are the customers interested in?
Fig­ure 3: What are the cus­tom­ers inter­ested in?
Github Star History
Fig­ure 4: Git­hub Star History

In that con­text, it does make sense that a lot of cus­tom­ers are con­sid­er­ing using some of the mod­ern lake­house solu­tions in the future. Cur­rently cus­tom­ers are mostly motiv­ated by their own stack and which solu­tion best fits and has the low­est main­ten­ance costs.

It is great to see that big tech is sup­port­ing all of the main table formats, and espe­cially note­worthy has been the rise of com­ple­ment­ing pro­jects like LakeFS, a data ver­sion con­trol for data lakes. Look­ing at Git­hub Star His­tory (Fig­ure 4) for the pro­jects, it seems it mir­rors our cus­tomer interest (Fig­ure 3) for now. There is always the ques­tion will the industry con­verge? Will Delta Lake leave the Apache pro­jects behind, or will one of them man­age to outperform?