An Insight to Auto­scal­ing in CosmosDB



Let’s begin with the name: Cosmos…a word that con­jures thoughts of some­thing com­plex, yet orderly; indi­vidual, yet part of a sys­tem; dif­fer­ent, yet known.

Today’s applic­a­tions must be highly respons­ive and always online. To achieve low latency and high avail­ab­il­ity, instances of these applic­a­tions need to be deployed in data cen­ters  that are close to their users. Applic­a­tions need to respond in real-time to large changes in usage at peak hours, store ever increas­ing volumes of data, and make this data avail­able to users in milliseconds.

To achieve such robust­ness, Microsoft Azure has come up with a glob­ally dis­trib­uted, multi-model data­base ser­vice called Cos­mos DB.

There are a num­ber of bene­fits, which include:

  • 24/7 avail­ab­il­ity
  • Elastic scalab­il­ity of through­put and stor­age, worldwide
  • Guar­an­teed low latency at 99th per­cent­ile, worldwide
  • Pre­cisely defined, mul­tiple con­sist­ency choices
  • No schema or index management
  • Widely avail­able
  • Secure by default and enter­prise ready
  • Sig­ni­fic­ant savings
  • …and the list goes on.

How­ever, Cos­mos DB does have a lim­it­a­tion with respect to auto­scal­ing. It can only scale down to 10 % of the max­imum through­put set by the user. Hence, this can hinder cost optimization.

By the way is auto­scal­ing now enabled at cos­mos DB.

Or we can point in this blog that we imple­men­ted a auto­scal­ing since not avail­able and we will post in another post.

Data Lake Architecture in Microsoft Azure

Fig­ure 1: Data Lake archi­tec­ture in Microsoft Azure

To under­stand this bet­ter, let us have a look at the above use-case. It explains the rela­tion­ship between a num­ber of data lake ser­vices, and their interactions.

  1. First the data is inges­ted to the cloud through innu­mer­able sources.
  2. The Data is then pushed into Strong­hold in ADLS through Data-factory.
  3. Data-fact­ory runs a Dat­ab­ricks note­book, which encrypts the data.
  4. The encryp­ted data is then writ­ten to the raw folder. The hash table is then updated to the Cos­mos DB.
  5. An Azure func­tion scans the folder for any new inges­tion, and registers it into the Data Catalog.
  6. A Data Sci­ent­ist signs in to Dat­ab­ricks using their AD cre­den­tials. They can then work with the raw data, trans­form it as desired, and put it into the APP folder.
  7. The data in the APP folder can be used and ana­lyzed. The code is then pushed into the Git Lab on cloud.
  8. Jen­kins, which is also run­ning on the cloud, builds the new code, pack­ages it into a con­tainer, and pushes the con­tainer into the Azure Con­tainer Registry.
  9. Jen­kins updates the deploy­ment in Azure Kuber­netes Ser­vice, which in turn pulls the new con­tainer image from the Azure Con­tainer Registry.
  10. Log Ana­lyt­ics updates the log inform­a­tion from the resources run­ning on cloud.
  11. There are also mon­it­or­ing alerts avail­able that notify oper­a­tion teams against any con­figured alerts.

Four our pur­poses, the fourth point is the most per­tin­ent, since it deals with an inter­ac­tion with Cos­mos DB. Basic­ally, two dif­fer­ent con­tain­ers of Cos­mos DB were used: namely the Hash­ing con­tainer and the Anonym­iz­a­tion container.

Inside the Hash­ing con­tainer of Cos­mos DB

It con­tains hashed BPID keys and encryp­ted field val­ues inges­ted into Cos­mos DB. The data inges­ted was veri­fied first with keys that were already avail­able in Cos­mos DB (which required a read oper­a­tion before a write operation).

Inside the Anonym­iz­a­tion con­tainer of Cos­mos DB

This col­lec­tion was used to check if some­thing has been already deleted.

The data of around 2 mil­lion cus­tom­ers was inges­ted in Cos­mos DB. In order to attain optimum per­form­ance it was neces­sary to main­tain a sys­tem in which the thor­ough­put could be increased before the read­ing starts, and decreased after suc­cess­ful com­ple­tion of the read­ing. As explained earlier, this is the part where we check the cus­tomer records avail­able on Cos­mos DB before push­ing a new one.

This is because read­ing 2 mil­lion records will take around 7–10 minutes, which is much too long. Leav­ing records to pro­cess in such a man­ner is not optimal in all cases, since the records num­ber is incon­sist­ent, and can increase or decrease in the short-term.

Azure Cos­mos DB allows us to set pro­vi­sioned through­put on data­bases and con­tain­ers. There are two types of pro­vi­sioned through­put: stand­ard (manual), and auto scale. In Azure Cos­mos DB, pro­vi­sioned through­put is meas­ured in terms of request units/second (RUs). RUs meas­ure the cost of both read and write oper­a­tions against your Cos­mos DB con­tainer, as shown in the fol­low­ing image:

Azure Cosmos DB allows us to set provisioned throughput on databases and containers.

The through­put pro­vi­sioned for a con­tainer is evenly dis­trib­uted among its phys­ical par­ti­tions, and, assum­ing a good par­ti­tion key (one that dis­trib­utes the logical par­ti­tions evenly among the phys­ical par­ti­tions) the through­put is also dis­trib­uted evenly across all logical par­ti­tions of the con­tainer. One can­not select­ively spe­cify the through­put for logical par­ti­tions. This is because one or more logical par­ti­tions of a con­tainer can be hos­ted by a phys­ical par­ti­tion, and the phys­ical par­ti­tions belong exclus­ively to the con­tainer (and sup­port the through­put pro­vi­sioned on the container).

If the work­load run­ning on a logical par­ti­tion con­sumes more than the through­put alloc­ated to the under­ly­ing phys­ical par­ti­tion, it’s pos­sible that your oper­a­tions will be rate-lim­ited. What is known as a hot par­ti­tion occurs when one logical par­ti­tion has dis­pro­por­tion­ately more requests than other par­ti­tion key values.

When rate-lim­it­ing occurs, one can either increase the pro­vi­sioned through­put for the entire con­tainer or retry the oper­a­tions. One should also ensure a choice of par­ti­tion key that evenly dis­trib­utes stor­age and request volumes.

The fol­low­ing image shows how a phys­ical par­ti­tion hosts one or more logical par­ti­tions of a container:

a physical partition hosts one or more logical partitions of a container

The prob­lem here is that defin­ing through­put must be done manu­ally, and even if one selects the auto­scal­ing option, this simply scales the through­put between the assigned through­put and 10% of this value. The min­imum through­put can­not be achieved dir­ectly which may not be that cost-effective.

The reason for this is related to the par­al­lel­iz­a­tion of read­ing achieved by logical par­ti­tions. These logical par­ti­tions are com­pletely man­aged by Cos­mos DB, and can­not be influ­enced, which means the num­ber of logical par­ti­tions remain the same no mat­ter how we try to play around with the par­ti­tion key. For example, a par­ti­tion key (i.e. a ran­dom num­ber between 0 and 32) was used to see the impact on the num­ber of logical par­ti­tions, but res­ul­ted in the same num­ber: 4.

Read­ing in par­al­lel from Cos­mos DB is about 7–10 minutes with one executor on each par­ti­tion, and with the through­put being equally shared between the par­ti­tions. A through­put of 4000 RU means that each par­ti­tion should get around 1000 RU. On the high-end, the through­put value can be set to any num­ber. This means that we can us-scale as much as neces­sary to make com­pu­ta­tions faster, but on the low-end the through­put value will still be set to 400 RU.

Now, one might think that a min­imum of 400 RU should be fine, and that we just need to con­cen­trate on the high-end where the most sig­ni­fic­ant per­form­ance gains can be made. But this is not how Microsoft tools work. You see,  down-scal­ing is also not com­pletely inde­pend­ent, and instead depends on the data volume. On top of this, down-scal­ing is as cru­cial as up-scal­ing: it saves huge amounts of money by redu­cing the expense on under util­ized resources

The best solu­tion, at least for the above imple­ment­a­tion, was not espe­cially com­plic­ated. It was simply con­scious optim­iz­a­tion applied using com­mon sense. A small code snip­pet was executed on the spark cluster. The job of this code was to check if the min­imum value (that being 400 RU) is accep­ted or not. Should this value not be accep­ted, then an ERROR is thrown, and its value is increased by 100 RU. This con­tin­ues until an optim­ized down-scal­ing para­meter is finally accep­ted. Of course, the code takes into account the data volume. For up-scal­ing, no such prob­lem was encountered, and hence no work-around was required.

So, how did Microsoft Azure per­form when tested with 2 mil­lion records?

Microsoft Azure Performance test with 2 million records.
Microsoft Azure Performance test with 2 million records.

I per­son­ally think that, see­ing as our solu­tion required field-level data encryp­tion and provid­ing role-based access to the data­base, Post­gr­eSQL or Microsoft SQL could have been good altern­at­ives. From a com­plex­ity point of view, Post­gr­eSQL or Microsoft SQL are simple in terms of writ­ing quer­ies on unen­cryp­ted data. But since a data stor­age API for work­flows was required to write back to leg­acy sys­tems, Cos­mos DB showed more poten­tial at the key level in terms of being a robust database.

Thank you for read­ing. I hope this blog opened doors to new solu­tions. Ciao…

Acronym:

  • CISO : A chief inform­a­tion secur­ity officer
  • Cos­mos DB : A key-value pair data­base from Azure
  • RU: request Unit
  • ADLS: Azure Data Lake Storage
  • BPID:Business pro­cess id
  • LKP: Look up table