Lake­house archi­tec­ture has become a solid option when it comes to design­ing your enter­prise data plat­form. This data man­age­ment frame­work seam­lessly integ­rates the cost-effi­ciency and lim­it­less scalab­il­ity of data lakes with the advanced cap­ab­il­it­ies of data ware­housing, includ­ing ACID trans­ac­tions, data gov­ernance, and a power­ful query­ing engine. This com­bin­a­tion makes Lake­house data plat­forms one of the best solu­tions for hand­ling your BI and ML use cases.

When talk­ing about Lake­house plat­forms, the first pro­vider that comes to mind is Dat­ab­ricks, as they intro­duced the concept in their 2020 paper Lake­house: A New Gen­er­a­tion of Open Plat­forms that Unify Data Ware­housing and Advanced Ana­lyt­ics.

In this blog post, we are going to explain all the factors you should con­sider and the best prac­tices to fol­low when build­ing a Dat­ab­ricks Lake­house data platform.

Fig­ure 1: Dat­ab­ricks Data Intel­li­gence Platform

Con­sid­er­a­tions When Archi­tect­ing a Data Platform

In our pre­vi­ous two-part post about how to choose your data plat­form, we explained the vari­ables to con­sider when design­ing a data plat­form. In this sec­tion, we are going to delve into the vari­ous aspects required to build a Dat­ab­ricks-based data plat­form prop­erly. How­ever, we should first look at some key points.

The cloud pro­vider you choose to imple­ment your data plat­form is one of the first con­sid­er­a­tions. Most com­pan­ies already have a pre­ferred cloud pro­vider to host most of their IT infra­struc­ture. With Dat­ab­ricks this choice is flex­ible, as it is multi-cloud and can be used with any—or all—of the pub­lic cloud pro­viders (AWSAzureGCP). In this blog post, we’ll be using Azure to illus­trate cer­tain aspects and to link to offi­cial documentation.

One of the first things to con­sider is in which regions your com­pany oper­ates and where you are going to store and pro­cess the data gen­er­ated in each region. Usu­ally, for legal reas­ons, it is oblig­at­ory to store private inform­a­tion gen­er­ated in a spe­cific region in a data­centre in the same region. The num­ber of regions will impact the num­ber of Dat­ab­ricks work­spaces needed and how you share inform­a­tion between regions; see Azure geo­graph­ies to find the avail­able regions.

Another key topic regard­ing data plat­forms is the num­ber of envir­on­ments for devel­op­ing and pro­duc­tion­ising your pipelines and data products. It is recom­men­ded to have more than one envir­on­ment – two (dev/pro) or three (dev/pre/pro) – to isol­ate the pro­duct­ive work­loads, where end-users inter­act, from the devel­op­ment car­ried out by the data teams. Nev­er­the­less, it is com­mon prac­tice to start small with a few use cases in a single envir­on­ment. Once the team has become famil­iar with the plat­form, you can add the rest of the envir­on­ments. In either case, start­ing small or with all the envir­on­ments ready, it is import­ant to define a nam­ing con­ven­tion to spe­cify which envir­on­ment each resource belongs to.

A key concept in data man­age­ment is resource segreg­a­tion, which involves sep­ar­at­ing data resources across dif­fer­ent envir­on­ments. Typ­ic­ally, this means rep­lic­at­ing the same resources in each envir­on­ment, albeit with vari­ations in con­fig­ur­a­tion, such as redu­cing com­pute power in a devel­op­ment envir­on­ment com­pared to pro­duc­tion. How­ever, when you have a large team of data pro­fes­sion­als and your com­pany is organ­ised into busi­ness units (e.g., depart­ments or domains in a Data Mesh), it is often bene­fi­cial to mir­ror this divi­sion within your data platform.

Although many factors come into play when design­ing your data plat­form, today’s post will con­cen­trate on the fun­da­ment­als. Let’s delve into the key aspects to con­sider when defin­ing a Dat­ab­ricks platform!

Work­spaces

One of the core com­pon­ents in Dat­ab­ricks is the work­space, where Dat­ab­ricks assets are organ­ised and users can access the data objects and com­pu­ta­tional resources. Design­ing the right work­space lay­out is essen­tial for effi­cient resource man­age­ment, access con­trol, and troubleshoot­ing. It is com­mon prac­tice to have one work­space per envir­on­ment or busi­ness unit as men­tioned above.

Besides the num­ber of work­spaces, the con­fig­ur­a­tion and con­nectiv­ity of each indi­vidual work­space is cru­cial for secur­ity and legal com­pli­ance. Although net­work­ing per se is a com­mon data plat­form com­pon­ent, in Dat­ab­ricks there are a couple of con­fig­ur­a­tions that are worth mentioning:

  • Secure cluster con­nectiv­ity: No open ports and no pub­lic IP addresses (no Pub­lic IP/NPIP) in the data plane.
  • VNet injec­tion: Instead of let­ting Dat­ab­ricks man­age the data plane net­work, use your exist­ing VNet con­fig­ur­a­tion to have full con­trol over it. When defin­ing CIDR ranges for a Dat­ab­ricks work­space, the “rule of 4” should be applied: choose the max­imum num­ber of expec­ted con­cur­rent nodes run­ning, mul­tiply this num­ber by 2 for the pub­lic and private sub­nets required, mul­tiply by 2 again for the flash-over instances (those start­ing up and being shut down), and add 5 addresses that Azure reserves for each sub­net. The res­ult­ing num­ber will give you the required private IPs for your Dat­ab­ricks workspace.

Using these con­fig­ur­a­tions is con­sidered best prac­tice as they effect­ively increase the secur­ity of your Dat­ab­ricks net­work by block­ing pub­lic access to your data plane as well as provid­ing full con­trol over the network.

Fur­ther­more, to restrict pub­lic access to your Dat­ab­ricks con­trol plane, Dat­ab­ricks provides an IP access lists fea­ture that lets you limit access to your account and work­spaces based on spe­cific IP addresses. In con­junc­tion with these cap­ab­il­it­ies, Azure Net­work Secur­ity Groups per­mit the set­ting of addi­tional rules to fur­ther restrict pub­lic access to your resources.

Fig­ure 2: . Dat­ab­ricks NPIP + VNet injection

For com­pan­ies where these secur­ity meas­ures are not enough to meet their stand­ards or legal oblig­a­tions, it is also pos­sible to fully restrict pub­lic access to your Dat­ab­ricks account by enabling private con­nectiv­ity between users and Dat­ab­ricks. You can con­fig­ure your Dat­ab­ricks work­space with Azure Private Link to privately con­nect the Dat­ab­ricks con­trol plane and the data plane, as well as user access to the work­space. This con­fig­ur­a­tion allows you to dis­pense with IP access lists and NSG rules, as all con­nectiv­ity occurs within the private vir­tual net­work. How­ever, it’s a com­plex setup so we won’t explore it fur­ther, but you can find more inform­a­tion in the offi­cial doc­u­ment­a­tion or you can reach out to us dir­ectly to dis­cuss your situ­ation in detail.

Data Gov­ernance with the Unity Catalog

Data gov­ernance in Dat­ab­ricks is handled with the  Unity Cata­log, a uni­fied gov­ernance solu­tion to man­age the access con­trols, audit­ing and lin­eage of all data assets across work­spaces from a cent­ral place. Internal tables are stored in the Unity Cata­log metastore, which are essen­tially Delta Lake tables stored in a cloud store like Azure stor­age accounts Gen2. There is a lim­it­a­tion of only one metastore per region in Databricks.

Unlike the Hive metastore, which has only two namespaces (schema and table) and lacks access con­trol cap­ab­il­it­ies, the Unity Cata­log metastore intro­duces an addi­tional namespace (cata­logue, schema, and table), provid­ing greater flex­ib­il­ity in access control.

Typ­ic­ally, cata­logues are used to seg­ment data between envir­on­ments and busi­ness units, like work­spaces, by assign­ing each work­space to its cor­res­pond­ing cata­logue. For example, the devel­op­ment work­space might have access to read and write in the devel­op­ment cata­logue but be unable to access the pro­duc­tion catalogue.

Fig­ure 3: Unity Cata­log three-level namespace

Another gov­ernance topic is data shar­ing. Many busi­nesses have to share data with part­ners and sup­pli­ers who do not neces­sar­ily use Dat­ab­ricks, or to share data between isol­ated busi­ness units within the same com­pany. To avoid the need to rep­lic­ate your data, Dat­ab­ricks has released the open-source pro­tocol Delta Shar­ing, where you can define which tables you want to share (shares) and who can con­sume them with fine-grained access con­trol (recip­i­ents).

Delta Shar­ing is nat­ively integ­rated with the Dat­ab­ricks Unity Cata­log, enabling you to cent­rally man­age and audit your shares. Delta Shar­ing is par­tic­u­larly use­ful for shar­ing spe­cific tables between regions, as they are stored in dif­fer­ent metastores, without duplications.

Fig­ure 4: Delta Shar­ing diagram

Secur­ity

Dat­ab­ricks provides a wide range of secur­ity mech­an­isms to secure your data plat­form, besides the net­work­ing lay­out men­tioned above and the fine-grained access from the Unity Cata­log. Dat­ab­ricks sup­ports SCIM pro­vi­sion­ing using iden­tity pro­viders such as Microsoft Entra ID (formerly Azure Act­ive Dir­ect­ory) for authen­tic­a­tion. It is best prac­tice to use SCIM pro­vi­sion­ing for users and groups in your Dat­ab­ricks account and to assign them accord­ingly to the cor­res­pond­ing work­spaces with iden­tity fed­er­a­tion enabled. By default, users can’t access any work­space unless access is gran­ted by an account admin­is­trator or a work­space admin­is­trator. In Azure, users and groups assigned as Con­trib­utor or above to the Resource Group con­tain­ing an Azure Dat­ab­ricks work­space will auto­mat­ic­ally become administrators.

In addi­tion to users and groups, Dat­ab­ricks handles ser­vice prin­cipals to access external data sources and to run pro­duc­tion jobs within Dat­ab­ricks work­spaces. Although users can use their cre­den­tials to run work­loads, it is always recom­men­ded to use ser­vice prin­cipals since users might leave the com­pany or move to another depart­ment, mak­ing their cre­den­tials obsol­ete, inap­pro­pri­ate, or insecure.

Com­pute (Clusters and Optimisations)

Dat­ab­ricks is renowned for its com­pute com­pon­ent, not only due to its Spark-based archi­tec­ture,  argu­ably the best on the mar­ket, but also because of the vari­ous com­pute types tailored to dif­fer­ent use cases. Dat­ab­ricks offers the fol­low­ing com­pute types:

  • All-pur­pose/in­ter­act­ive clusters: Pro­vi­sioned clusters to explore and ana­lyse your data in note­books. Although it is pos­sible to attach an inter­act­ive cluster to a Dat­ab­ricks work­load, this is not recom­men­ded for pro­duc­tion­ised workloads.
  • Job com­pute clusters: Clusters cre­ated spe­cific­ally to run auto­mated jobs. This is the best option to run auto­mated pro­duc­tion work­loads, as the clusters are cre­ated to execute the job in an isol­ated envir­on­ment and and are dis­carded once the job is com­pleted, thereby sav­ing costs as clusters can’t be left run­ning in the background.
  • Instance pools: Ready-to-use com­pute instances, ideal for work­loads that need to reduce their start and auto­scal­ing times, ensur­ing quicker and more effi­cient performance.
  • SQL ware­houses: Pro­vi­sioned clusters to explore and query data on Dat­ab­ricks. Server­less cap­ab­il­ity can be enabled to avoid start-up times. They are also used for query­ing the data beneath Dat­ab­ricks Lakeview dash­boards.

Choos­ing the cor­rect cluster types is key to optim­ising your data plat­form costs, and on top of this you can also enhance your resource util­isa­tion and cut costs by right-siz­ing clusters, lever­aging auto­scal­ing, and employ­ing spot instances for non-crit­ical workloads.

What’s more, Dat­ab­ricks allows the setup of cluster policies to restrict cluster cre­ation or to impose stand­ard con­fig­ur­a­tion options on cre­ated clusters, help­ing to pre­vent excess­ive usage, con­trol costs, and sim­plify the user experience.

ML and AI

Although Dat­ab­ricks stands out for its ML and AI cap­ab­il­it­ies, we won’t be dis­cuss­ing them today as they do not require spe­cific infra­struc­ture con­fig­ur­a­tion when archi­tect­ing your Lake­house plat­form. Non­ethe­less, you can read our blog post about MLOps in Dat­ab­ricks, where we explain why you should con­sider Dat­ab­ricks for your use cases and how to build an MLOps pipeline to cre­ate, test, and pro­mote your models.

Con­clu­sions

Dat­ab­ricks is one of the bests altern­at­ives avail­able to develop your Lake­house data plat­form, uni­fy­ing all your data, ana­lyt­ics, and ML work­loads. Gov­ernance across all Dat­ab­ricks assets is ensured by the Unity Cata­log, whilst a cor­rect work­space setup secures con­nectiv­ity to the plat­form. By lever­aging Dat­ab­ricks, busi­nesses can achieve greater effi­ciency, scalab­il­ity, and innov­a­tion in their data strategies.

This blog post has only covered some of the aspects that must be con­sidered when design­ing a Dat­ab­ricks Lake­house data plat­form. Each organ­isa­tion has its own intric­a­cies and require­ments, so con­tact us for expert guid­ance and sup­port in build­ing out your own platform!