How to Choose Your Data Plat­form – Part 1



In the ever-evolving land­scape of data man­age­ment, choos­ing the right data plat­form tech­no­logy stack is a cru­cial step for busi­nesses aim­ing to fully har­ness the power of their data. Most organ­isa­tions now con­sider the cloud to be the best place for their data plat­forms, thanks to the flex­ib­il­ity and scalab­il­ity offered. With numer­ous options on the mar­ket, includ­ing like AWSAzureGCP and Oracle Cloud, as well as more spe­cial­ised products like ClouderaSnow­flake and Dat­ab­ricks, the decision-mak­ing pro­cess can be daunt­ing, and in this mini-series we’ll throw some light on how to best make that decision.

Here at ClearPeaks we are experts on all these tech­no­lo­gies, so don’t hes­it­ate to con­tact us for fur­ther guid­ance if you are decid­ing which is the best way to go for your organ­isa­tion. We also recom­mend you check our webinar series “Jour­ney to the Cloud”, where we run through all the above-men­tioned tech­no­lo­gies, talk about real-life pro­jects that are using them, and present best prac­tices and recommendations.

Data Plat­form Archi­tec­ture Evolution

Let’s start at the begin­ning. Where do data plat­forms come from? How have they evolved and how will they con­tinue to evolve? Let’s take a look at the dia­gram below, loosely inspired by this image Dat­ab­ricks used in this blog of theirs.

Data Warehouse, Data Lake + Data Warehouse, Data Lakehouse

We can see three data plat­form archi­tec­tures: the data ware­house, the data lake plus data ware­house, and the data lake­house. We are not going to go into the details of each of them because there is already plenty of mater­ial out there cov­er­ing them, like the Dat­ab­ricks blog we just ref­er­enced. Please note that although the above dia­gram might lead you to think that data ware­houses are no longer in fash­ion, in real­ity they are still used a lot, and in some scen­arios we would still recom­mend them.

The dia­gram rep­res­ents the evol­u­tion of data plat­forms over the last few dec­ades; how­ever, these plat­forms have not stopped evolving and they will con­tinue to do so. In the last couple of years, data plat­form vendors have star­ted using new terms to talk about their pro­posed plat­form archi­tec­ture, now offer­ing even more fea­tures than the data lakehouse.

In this blog post and the next, we’re going to focus on the archi­tec­ture and the tech­nical aspects and fea­tures of these plat­forms, not on the data own­er­ship and gov­ernance model (cent­ral­ised vs de-cent­ral­ised, i.e. whether to go for a Data Mesh or not; check out our pre­vi­ous blog post for more inform­a­tion on this).

Data Fab­ric

One of the terms vendors have star­ted using recently is “data fab­ric” (or vari­ations on it). How­ever, as often hap­pens with new terms, each vendor has a slightly dif­fer­ent under­stand­ing of what a data fab­ric is. In a pre­vi­ous blog post, we dis­cussed the data fab­ric archi­tec­ture, and now we’ll sum­mar­ise that below:

Data Fabric, Metadata Activation, Data Integration Layer, AI-Powered Data Governance

A data fab­ric is essen­tially a data lake­house or visu­al­isa­tion layer, aug­men­ted with AI, stream­ing, and advanced data gov­ernance cap­ab­il­it­ies. These enhance­ments sur­pass tra­di­tional data cata­loguing by incor­por­at­ing AI and metadata activ­a­tion. Ideally, it includes DevOps/DataOps, data lin­eage, data qual­ity mech­an­isms, and sup­ports multi-cloud or hybrid (on-prem and cloud) deployments.

Enter­prise Data Platform

In addi­tion to a data fab­ric, there is another term that is also seen a lot when describ­ing mod­ern data plat­forms – Enter­prise Data Plat­form (or slight vari­ations on it). As with data fab­ric, dif­fer­ent vendors use the term to describe slightly dif­fer­ent things. At ClearPeaks we have been using this term and also “Big Data Plat­form” for a while now, and the pro­posed archi­tec­ture is shown below:

DataOps, Data Governance & Security

As you can see, there’s quite an over­lap between a data fab­ric and the Enter­prise Data Plat­form. A data fab­ric puts more focus on Gov­ernance and AI enhance­ments, whilst an Enter­prise Data Plat­form also includes the present­a­tion layer, from stand­ard report­ing and dash­board­ing to exec­ut­ive and cus­tom exper­i­ences (check out our Observation Deck).

Choos­ing Your Data Plat­form Technology

Whether you’re imple­ment­ing a data fab­ric or an enter­prise data plat­form, with either cent­ral­ised or decent­ral­ised own­er­ship and gov­ernance, let’s call it a ‘data plat­form’ for the sake of sim­pli­city. Choos­ing the right tech­no­logy stack can be chal­len­ging, so we’ve con­duc­ted a base assess­ment that first iden­ti­fies the key aspects that a data plat­form should offer, and then eval­u­ates how dif­fer­ent tech­no­logy stacks per­form across these aspects.

In this first instal­ment of our mini-series, we’ll unveil the first part of our base assess­ment, focus­ing on the essen­tial aspects a data plat­form must encom­pass. In the second part, we’ll present some tech vendors suit­able for an imple­ment­a­tion. How­ever, we won’t be shar­ing the second part of the assess­ment, examin­ing how each tech vendor aligns with these aspects: this is to pre­vent any poten­tial mis­guid­ance, as the optimal choice for an organ­isa­tion is heav­ily influ­enced by its unique require­ments. That being said, we’d be happy to help you! Lever­aging our base assess­ment, we offer a cus­tom­ised assess­ment for your organ­isa­tion, tailored to your spe­cific needs, guid­ing you towards the right decision. Con­tact us for more information!

Aspects of a Data Platform

Below you can find the twenty aspects that a data plat­form must be able to cover and that we used in our base assessment:

  1. Data Inges­tion: How the plat­form can bring in data from vari­ous sources, includ­ing both real-time inges­tion and batch ingestion.
  2. Batch Data Engin­eer­ing: The abil­ity to pro­cess large volumes of data in batches in a scal­able and simple way like SQL, Python, or even graph­ical user interfaces.
  3. Stream­ing: Real-time data pro­cessing cap­ab­il­it­ies includ­ing cach­ing mech­an­isms for events (queue sys­tems) and com­plex event pro­cessing via easy-to-use inter­faces (SQL or even graphical).
  4. Data Store Uni­fic­a­tion: The abil­ity to unify how and where data lay­ers are stored. Does the plat­form need to use dif­fer­ent internal tech­no­lo­gies to store the data depend­ing on its layer (raw vs cur­ated; bronze vs sil­ver vs gold; etc.), or is all the data stored and handled in the same way?
  5. Orches­tra­tion: Man­aging and coordin­at­ing data work­flows seamlessly.
  6. Semantic Mod­el­ling: Cre­at­ing mean­ing­ful rela­tion­ships between dif­fer­ent data entit­ies so that end con­sumers can inter­act with the data in a simple and under­stand­able way.
  7. BI Query Per­form­ance Optim­isa­tion: How the plat­form optim­ises query per­form­ance without or with min­imal data engin­eer­ing. Are mech­an­isms such as data­base table index­ing, par­ti­tion­ing, or buck­et­ing auto­mat­ic­ally taken care of by the under­ly­ing engine?
  8. Data Visu­al­isa­tion: The tools and cap­ab­il­it­ies to present data in a visu­ally com­pel­ling man­ner, includ­ing reports and inter­act­ive dashboards.
  9. AI Cap­ab­il­it­ies: The integ­ra­tion of arti­fi­cial intel­li­gence and machine learn­ing ser­vices, includ­ing the abil­ity to cre­ate ML mod­els and deploy them (via APIs), or to inter­act with exist­ing mod­els like Large Lan­guage Models.
  10. DevOps: Sup­port for col­lab­or­at­ive and auto­mated soft­ware devel­op­ment pro­cesses applied to both data engin­eer­ing and ware­housing (DataOps) and data sci­ence (MLOps).
  11. Ease of Use: User-friendly inter­faces and intu­it­ive workflows.
  12. Ease of Admin­is­tra­tion: Sim­pli­fy­ing the man­age­ment and main­ten­ance of the platform.
  13. Access Con­trols: Robust secur­ity meas­ures for access not only to data itself but also to the vari­ous entit­ies in a data plat­form (data pipelines, com­pute engines, etc.). This is espe­cially rel­ev­ant with decent­ral­ised ownership.
  14. Data Gov­ernance – Cata­logue: The abil­ity to cre­ate and man­age a com­pre­hens­ive data cata­logue, as well as other entit­ies like data pipelines.
  15. Data Gov­ernance – Lin­eage: Auto­mat­ic­ally track­ing and doc­u­ment­ing the ori­gins and trans­form­a­tions of data within the plat­form; enabling smart nav­ig­a­tion through this metadata.
  16. Data Gov­ernance – Data Qual­ity: Auto­mat­ic­ally ensur­ing data accur­acy and reli­ab­il­ity via the cre­ation and auto­mated exe­cu­tion of rules or checks via easy-to-use interfaces.
  17. Multi-Cloud and Hybrid: Sup­port for multi-cloud and hybrid cloud and on-prem architectures.
  18. Open­ness: The extent to which the plat­form sup­ports and pro­motes open-source tech­no­lo­gies and interoperability.
  19. Robust­ness: The matur­ity level of the under­ly­ing tech­no­logy and its components.
  20. Price: Cost-effect­ive­ness and trans­par­ent pricing.

Com­ing Next

In this blog post, we’ve explored the evol­u­tion of data plat­form archi­tec­tures at a high level, tra­cing the jour­ney from data ware­houses to the cur­rent trends of data fab­rics and enter­prise data plat­forms. We’ve also intro­duced our base assess­ment, designed to assist organ­isa­tions in select­ing the most suit­able tech­no­logy for their par­tic­u­lar needs, and shared our list of the twenty essen­tial aspects that a data plat­form must cover. In our next blog post, we will show­case vari­ous tech vendors and their solu­tions for build­ing cloud data plat­forms, focus­ing on how they address these twenty aspects. Stay tuned for more insights, and drop us a line if there’s any­thing we can help you with!