Solv­ing per­form­ance issues for cus­tomer facing applic­a­tions at scale



When cre­at­ing a plat­form designed to handle large amounts of data with quickly chan­ging schemas and struc­tures, the util­iz­a­tion of NoSQL data­bases is cru­cial for the neces­sary adapt­ab­il­ity for solvig per­form­ance issues.

Each tool has a dis­tinct role, and choos­ing the most appro­pri­ate option cus­tom­ized to your spe­cific cir­cum­stances and require­ments is vital, given that each tool pos­sesses its own set of advant­ages and dis­ad­vant­ages. In situ­ations where there is a high volume of write oper­a­tions and min­imal read activ­it­ies, opt­ing for a data­base type that pri­or­it­izes write per­form­ance and scalab­il­ity is typ­ic­ally favored. NoSQL data­bases are com­monly deemed more appro­pri­ate than con­ven­tional rela­tional data­bases in such con­texts, espe­cially when it comes to solv­ing per­form­ance issues.

In this spe­cific situ­ation, a sub­stan­tial amount of writ­ing is essen­tial given the dynamic char­ac­ter­ist­ics of the data. NoSQL data­bases address our writ­ing volume chal­lenge, yet they can also present sig­ni­fic­ant chal­lenges when data scales in size and requires fetch­ing and fil­ter­ing for end-users of the applic­a­tions. In this par­tic­u­lar scen­ario, Search Engines can serve as invalu­able allies.

This art­icle elu­cid­ates strategies to address a sig­ni­fic­ant chal­lenge: response times dur­ing the exe­cu­tion of extens­ive fil­ter­ing, data retrieval, and sort­ing oper­a­tions using NoSQL databases.

We will go through:
  1. Under­stand­ing the Challenge
  2. Why you need to rely on data
  3. A/B test­ing with real usage traffic
  4. Load test­ing

Under­stand­ing the Challenge

Indi­vidu­als fre­quently depend on assump­tions rather than facts. We, engin­eers, are designed to address chal­lenges by imple­ment­ing endur­ing solu­tions rather than tem­por­ary patches. There­fore, prior to enga­ging in prac­tical tasks, it is imper­at­ive to grasp the under­ly­ing concept com­pre­hens­ively. In this regard, data visu­al­iz­a­tion and fac­tual inform­a­tion play a cru­cial role.

The recently onboarded pro­ject involves a sig­ni­fic­ant ETL pro­cess tailored for a prom­in­ent cor­por­ate cli­ent headquartered in Europe, spe­cial­iz­ing in the auto­mot­ive industry. The primary object­ive of this ini­ti­at­ive is to extract vehicle data from vari­ous sources, per­form neces­sary trans­form­a­tions, and deliver it to upstream sys­tems for online retail pur­poses while also mak­ing it avail­able through API’s for other con­sumers. It is imper­at­ive for the team to ensure the con­tinu­ous avail­ab­il­ity and func­tion­al­ity of the sys­tem, as any dis­rup­tions could sig­ni­fic­antly affect the cli­ent sales on the online shops that are oper­at­ing across mul­tiple mar­kets in the European Union.

This pro­ject has encountered sig­ni­fic­ant chal­lenges regard­ing response times dur­ing data retrieval primar­ily attrib­uted to the con­tinu­ous growth in data size over time. The data­set has undeni­ably expan­ded over time, yet fore­casts regard­ing its future scale remain absent. This absence of inform­a­tion sig­ni­fic­antly influ­enced our approach to prob­lem-solv­ing, neces­sit­at­ing con­sid­er­a­tion of this unpre­dict­able variable.

These response times have shown an increase cor­res­pond­ing to the level of data fil­ter­ing applied by users, res­ult­ing in an aver­age response time of 1 minute and 4 seconds, with a peak of 3 minutes and 56 seconds. The pro­longed wait of approx­im­ately 4 minutes for users to receive data after adjust­ing a fil­ter is sub­op­timal. In real­ity, the server is likely to time out before the user receives the response. Upon ana­lyz­ing these met­rics, it became appar­ent that employ­ing more fil­ters would likely res­ult in the out­comes align­ing with higher percentiles.

Rely on data, not assumptions

Engin­eers face sig­ni­fic­ant chal­lenges when rely­ing on assump­tions, as our prob­lem-solv­ing approach is groun­ded in address­ing tan­gible issues based on data-driven decisions and not assumptions.

Our source of truth for the data essen­tial to solv­ing the prob­lem was facil­it­ated by our Applic­a­tion Per­form­ance Mon­itor (APM), which provided us com­pre­hens­ive met­rics such as aver­age response times, latency, max­imum val­ues, and per­cent­iles. This tool fur­nished us with essen­tial insights to pin­point the root cause of the issue effect­ively. Addi­tion­ally, APM plat­forms typ­ic­ally offer vis­ib­il­ity into time-con­sum­ing pro­cesses, enabling us to swiftly recog­nize the bot­tle­neck loc­ated within our database.

Upon ana­lyz­ing our data­base mon­it­or­ing data, it became evid­ent that ver­tical scal­ing, which involves increas­ing resources, would not resolve the per­form­ance issues. Nev­er­the­less, we imple­men­ted a tem­por­ary ver­tical scal­ing meas­ure to illus­trate and doc­u­ment that scal­ing alone would not address the under­ly­ing problem.

The per­cent­iles served as the ini­tial met­ric util­ized for estab­lish­ing a basis of com­par­ison. Prior to delving fur­ther, we pin­pointed poten­tial user quer­ies that aligned with the response times cor­res­pond­ing to the per­cent­iles (P50, P75, P90 and P99).

Given the issue arose from the ongo­ing expan­sion of data volume, we opted to scale the data dur­ing col­lec­tion by factors of 2 and 5. This ini­ti­at­ive aimed to assess the sys­tem’s per­form­ance under increased data loads for the future.

Response times with 2x and 5x the data size
Fig­ure 1: Response times with 2x and 5x the data size

Upon exam­in­a­tion of the image provided, it is evid­ent that the data­set’s size dir­ectly cor­rel­ates with an increase in response time over time. Due to the data cach­ing beha­vior of our data­base engine, a meth­od­o­logy was adop­ted where the query is executed 10 times, with the ini­tial two iter­a­tions isol­ated for dis­tinct ana­lysis from the sub­sequent eight iterations.

Response times with 2x and 5x the data size and a vertical scaling
Fig­ure 2: Response times with 2x and 5x the data size and a ver­tical scaling

As pre­vi­ously stated, we con­duc­ted identical tests in a scaled envir­on­ment, demon­strat­ing that ver­tical resource scal­ing would not provide a last­ing solu­tion to our issue.

In the course of this ana­lysis, we dili­gently optim­ized our data­base indexes to their fullest extent. How­ever, we encountered a threshold where sus­tain­ing all indexes along­side the numer­ous applic­able data fil­ters became unfeasible.

Given that scal­ing resources and review­ing indexes did not resolve our per­form­ance issues, we opted to explore altern­at­ive solu­tions. Our focus shif­ted towards com­pre­hend­ing the beha­vior of user-applied fil­ters when executed within a search engine.

A/B test­ing with real usage traffic

In order to guar­an­tee that our solu­tion effect­ively addresses the issue at hand, we opted to devise two dis­tinct scen­arios util­iz­ing vari­ous search engines. This approach aimed to determ­ine the search engine that best aligned with our requirements.

To attain optimal out­comes while pre­serving con­sist­ency, the most effect­ive approach involved the imple­ment­a­tion of A/B test­ing with real usage traffic. This test­ing meth­od­o­logy aimed to cre­ate two dis­tinct end­points within our backend sys­tem, each util­iz­ing a dif­fer­ent search engine. Sub­sequently, upon con­fig­ur­ing the end­points to retrieve data from the respect­ive search engines, we pro­ceeded to modify our fron­tend inter­face. The inter­face was mod­i­fied to con­sist­ently route requests to all avail­able solu­tions. This change ensured that each fil­ter or action ini­ti­ated by our end users yiel­ded res­ults so that we could ana­lyze in the future.

A/B testing implementation with real usage traffic
Fig­ure 3: A/B test­ing imple­ment­a­tion with real usage traffic

In the image provided, it is evid­ent that each user request for data retrieval and fil­ter­ing was duplic­ated across two dis­tinct solu­tions. This approach was adop­ted to offer to our cli­ent the flex­ib­il­ity of choos­ing between two scen­arios, accom­pan­ied by a com­pre­hens­ive delin­eation of the respect­ive advant­ages and draw­backs asso­ci­ated with each option.

We opted to employ Type­sense and Atlas Search Engine to present two dis­tinct solu­tions, des­pite both address­ing the same issue. Atlas Search, included in our cli­ent Mon­goDB sub­scrip­tions, elim­in­ates the need for ser­vice man­age­ment. Con­versely, we sug­ges­ted Type­sense as a self-hos­ted altern­at­ive, requir­ing over­sight of the ser­vice’s infra­struc­ture and avail­ab­il­ity from the infra­struc­ture team.

Upon imple­ment­ing and deploy­ing this solu­tion in the rel­ev­ant envir­on­ments, we revis­ited the ana­lysis after a few days to review the out­comes from the past three days. The res­ults were remark­able and aligned pre­cisely with our expectations.

The aver­age latency of the requests decreased by approx­im­ately 60%, with a reduc­tion of 64.36% using Atlas Search and 56.44% using Type­sense. Con­cur­rently, the fil­ters applied by cli­ents fall­ing within the 75th per­cent­ile saw a decrease of around 57%, with fig­ures of 57.14% for Atlas Search and 56.57% for Type­sense. Not­ably, a sig­ni­fic­ant enhance­ment in response time was observed for requests con­tain­ing intric­ate details and com­plex fil­ters, res­ult­ing in a remark­able reduc­tion of nearly 98% – spe­cific­ally, 98.24% using Atlas Search and 97.13% for Typesense.

A/B testing with real usage traffic during 3 days
Fig­ure 4: A/B test­ing with real usage traffic dur­ing 3 days

Load test­ing

The primary issue stemmed from a lack of foresight regard­ing the data growth over time by the cli­ent. It was imper­at­ive for the team not only to address the imme­di­ate con­cern but also to devise a sus­tain­able solu­tion cap­able of accom­mod­at­ing future data expan­sion autonom­ously. The A/B test­ing instilled con­fid­ence to the team regard­ing the poten­tial of the solu­tion to address our cur­rent state-of-the-art chal­lenges. Given our exist­ing engage­ment with this sub­ject, we opted to delve deeper into the invest­ig­a­tion to gain a more com­pre­hens­ive under­stand­ing of the sys­tem beha­vior tak­ing in con­sid­er­a­tion poten­tial data growth over time.

To deepen our under­stand­ing of this topic, we util­ized our pre­vi­ous research on ver­tical scal­ing of data­bases as ref­er­enced earlier. Sub­sequently, we chose to index the data in the col­lec­tions that we cre­ated with factors of 2 and 5. We sought to com­pre­hend not only the beha­vior of the data­base but also that of our entire infra­struc­ture. Con­sequently, we opted to con­duct load test­ing on our API, which facil­it­ated the pro­vi­sion of data.

The meth­od­o­logy mirrored pre­vi­ous prac­tices already men­tioned in this art­icle, employ­ing suit­able requests and fil­ters across all per­cent­iles to ana­lyze met­ric res­ults encom­passing aver­agemin­imum, and max­imum response times for each poten­tial solu­tion. Sub­sequently, enhance­ments were made to the data­set, increas­ing requests per second to eval­u­ate sys­tem per­form­ance under heightened data volume and user load.

Load testing using Atlas Search with 2x the data size
Fig­ure 5: Load test­ing using Atlas Search with 2x the data size
Load testing using Typesense with 2x the data size
Fig­ure 6: Load test­ing using Type­sense with 2x the data size
Load testing using Atlas Search with 5x the data size
Fig­ure 7: Load test­ing using Atlas Search with 5x the data size
Load testing using Typesense with 5x the data size
Fig­ure 8: Load test­ing using Type­sense with 5x the data size

The load test­ing ana­lysis was con­duc­ted util­iz­ing a tool developed by Grafana Labs known as k6sGrafana is an integ­ral com­pon­ent of the CNCF eco­sys­tem and seam­lessly integ­rates with Cloud Nat­ive envir­on­ments. This robust integ­ra­tion instilled con­fid­ence within the team to delve deeper into the tool’s cap­ab­il­it­ies, thereby enhan­cing the over­all invest­ig­at­ive process.

Final thoughts

Fol­low­ing an extens­ive invest­ig­a­tion, we suc­cess­fully iden­ti­fied a viable long-term solu­tion and presen­ted vari­ous options to the cli­ent, enabling informed decision-mak­ing based on data rather than assumptions.

Opt­ing for the Atlas Search solu­tion was deemed optimal due to its integ­ra­tion within the default Mon­goDB sub­scrip­tion, elim­in­at­ing the neces­sity of man­aging a self-hos­ted infra­struc­ture. While Type­sense also provides a SaaS solu­tion, adopt­ing it would lead to an increase in monthly expenses.

It is imper­at­ive to note that both solu­tions are robust and depend­able. When select­ing a solu­tion for a sim­ilar scen­ario, it is cru­cial to assess the suit­ab­il­ity for your spe­cific use case, rather than solely rely­ing on the choice made in this art­icle regard­ing Atlas Search.

At synvert, we demon­strated a his­tory of pin­point­ing suc­cess­ful long-term IT solu­tions, provid­ing informed decision-mak­ing through data-driven options, and stream­lin­ing infra­struc­ture man­age­ment, mak­ing us a reli­able choice for address­ing your cur­rent challenges.