Observ­ab­il­ity in Cloud Native



Cloud Nat­ive is trans­form­ing the way applic­a­tions are designed and developed, lever­aging the bene­fits of cloud com­put­ing infra­struc­ture and ser­vices. Its applic­a­tions are often asso­ci­ated with micro-ser­vices archi­tec­ture, which makes them flex­ible, scal­able, and self-healing.

At synvert, our team was tasked to improve the developer exper­i­ence and resource optim­iz­a­tion of a developer plat­form. Our solu­tion meant work­ing on a multi-ten­ancy, multi-cloud Kuber­netes solu­tion for a lead­ing auto­mot­ive com­pany. The pro­ject stack includes the usual sus­pects, CI/CD, data­base as a ser­vice, and observ­ab­il­ity, util­iz­ing cloud-nat­ive approaches.

As we all know, observ­ab­il­ity is a crit­ical com­pon­ent of mod­ern sys­tems man­age­ment, enabling organ­iz­a­tions to have a com­pre­hens­ive view of sys­tem per­form­ance and beha­vior, and quickly resolve any issues that arise. How­ever, cent­ral­ized observ­ab­il­ity in Cloud Nat­ive envir­on­ments has been a per­sist­ent challenge.

In this art­icle, we’ll exam­ine the chal­lenges of observ­ab­il­ity in Cloud Nat­ive, intro­duce the solu­tion offered by open tele­metry, and show how it can be used to achieve cent­ral­ized observ­ab­il­ity. Join us as we explore the innov­at­ive tech­no­lo­gies and prac­tices shap­ing the future of Cloud Nat­ive development!

Set­ting up observ­ab­il­ity is hard

Teams typ­ic­ally build observ­ab­il­ity for their sys­tems and applic­a­tions by using a vari­ety of embed­ded mon­it­or­ing fea­tures in pro­pri­et­ary tools and solu­tions. This approach is often chal­len­ging because it requires teams to use dif­fer­ent tools and meth­ods to col­lect and report data from dif­fer­ent ser­vices, which makes it decent­ral­ized and harder to have a com­pre­hens­ive view of the over­all per­form­ance and beha­vior of a sys­tem. Addi­tion­ally, it makes it almost impossible to cor­rel­ate data from dif­fer­ent sys­tems and quickly identify and troubleshoot issues.

Observ­ab­il­ity setup is hard, no ques­tions about it. Our goal, as an addi­tional fea­ture to the developer plat­form product we are devel­op­ing, is to offer cus­tom­ers easy access to tele­metry data through a cent­ral­ized portal. To achieve this, we use tools like Flu­ent Bit, Pro­meth­eus, and OpenSearch to col­lect, trans­form, and load data. But, just tools aren’t enough… an observ­ab­il­ity strategy has to be in place.

Let’s review two com­mon challenges.

1. Non-stand­ard­ised data

To build a cent­ral­ized observ­ab­il­ity plat­form, cus­tom­ers need to imple­ment /metrics end­points and out­put logs in a spe­cific format. This is a chal­len­ging task for big teams in decent­ral­ized envir­on­ments. Teams who are going through the jour­ney of becom­ing a DevOps organ­iz­a­tion still have Ops and Plat­form teams usu­ally rely­ing on the customer’s applic­a­tion lan­guage, frame­works, and lib­rar­ies to extract data that they don’t control.

2. Hard to cor­rel­ate data

The cap­ab­il­ity of doing cor­rel­a­tion between the three data observ­ab­il­ity streams (Log­ging, Mon­it­or­ing, and Tra­cing) is very import­ant when a Cloud Nat­ive sys­tem is being developed. These are the 3 pil­lars of observ­ab­il­ity, how­ever, if they can’t be cor­rel­ated we lose the biggest bene­fit of hav­ing them cent­ral­ized. The method that we see teams doing and usu­ally find faces this prob­lem, each stream of data is treated dif­fer­ently. Besides that, hav­ing non-stand­ard­ized data does not help, mak­ing cor­rel­a­tion even harder.

The prob­lems men­tioned above are keep­ing teams from improv­ing. Our developer plat­form had a dif­fer­ent approach regard­ing observ­ab­il­ity fea­tures for cus­tom­ers. It was part of the strategy to find a solu­tion that ensures the three pil­lars (log­ging, mon­it­or­ing, and tra­cing) could be correlated.

Observ­ab­il­ity strategy with standardization

Open Telemetry

Open tele­metry is an open stand­ard for col­lect­ing and report­ing tele­metry data. It provides a stand­ard way to col­lect data from vari­ous sys­tems and ser­vices, regard­less of the tech­no­logy or vendor. It enables organ­iz­a­tions to col­lect data from a wide range of applic­a­tions, regard­less of the pro­gram­ming lan­guage or code­base used. This is pos­sible because open tele­metry provides a con­sist­ent data model for col­lect­ing and report­ing tele­metry data, and it also provides lib­rar­ies and SDKs for vari­ous pro­gram­ming lan­guages that can be used to instru­ment the code and col­lect data.

With open tele­metry, organ­iz­a­tions can stand­ard­ize their met­rics, logs, and traces by using a com­mon schema, format, and pro­tocol. This makes it pos­sible to ana­lyze data from vari­ous sources using the same tools and pro­cesses and to cor­rel­ate data from dif­fer­ent sources to gain a more com­pre­hens­ive under­stand­ing of the system’s over­all performance.

Single braid and mul­tiple integrations

Open Tele­metry,  solves the major obstacles hinder­ing the improve­ment of our product. Many well-known APM plat­forms, such as Datadog, adopt a sim­ilar approach as Open Tele­metry, offer­ing a lan­guage-spe­cific pro­pri­et­ary SDK to auto­mat­ic­ally trans­mit tele­metry data for their services.

Open Tele­metry can be divided into two main high level components:

1. Instru­ment­a­tion

Open tele­metry provides an offi­cial SDK for the most pop­u­lar lan­guages. This allows developers, by includ­ing this SDK on the code base, to start instru­ment­ing their applic­a­tions. For lan­guages like Java and PHP, which rely on a runtime to run, is even pos­sible to use the OTL SDK without updat­ing the code using open tele­metry add-on/ex­ten­sion.

2. Col­lector

The col­lector com­pon­ent is a cent­ral hub respons­ible for receiv­ing and pro­cessing tele­metry data from vari­ous sources, such as applic­a­tions or lib­rar­ies. It per­forms tasks such as data val­id­a­tion, aggreg­a­tion, and rout­ing before for­ward­ing the data to a back-end for stor­age or ana­lysis. The OTL col­lector is com­poun­ded by the fol­low­ing com­pon­ents, receiv­ers, pro­cessors, and exporters.

The 5 Com­mand­ments of Open Telemetry

After gath­er­ing Open Tele­metry best prac­tices, we built a frame­work on how to apply Open Tele­metry to our applic­a­tions. The 5 Com­mand­ments of Open Tele­metry out­lines a 5‑step pro­cess for achiev­ing cent­ral­ized observ­ab­il­ity with Open Tele­metry. The steps involve instru­ment­ing applic­a­tions, col­lect­ing data from these applic­a­tions, stor­ing it in a cent­ral loc­a­tion, ana­lyz­ing the data, and cor­rel­at­ing data from dif­fer­ent sources.

The 5 Commandments of Open Telemetry

1. Instru­ment­a­tion - The first step is to instru­ment the applic­a­tions with Open tele­metry lib­rar­ies and/or SDKs. These lib­rar­ies and SDKs provide a con­sist­ent way to col­lect met­rics, logs, and traces from the applic­a­tions, regard­less of the tech­no­logy or vendor. We cre­ated cus­tom doc­u­ment­a­tion so that our cli­ents can adapt their applic­a­tions to use Open Tele­metry SDKs.

2. Col­lec­tion - Once the applic­a­tions are instru­mented, the next step is to col­lect the data using an open tele­metry col­lector or an agent. The col­lector or agent receives the data from the applic­a­tions and sends it to a cent­ral loc­a­tion for stor­age and ana­lysis. Once tele­metry data arrives at the col­lector, we are enrich­ing it using data that comes from other sources.

3. Stor­age - The col­lec­ted data is then stored in a cent­ral loc­a­tion, such as a time series data­base, log man­age­ment sys­tem, or mon­it­or­ing plat­form. This allows teams to access and ana­lyze the data from a single location.

4. Ana­lysis - Once the data is stored, users can use Open tele­metry com­pat­ible tools to ana­lyze the data. These tools provide a con­sist­ent way to visu­al­ize, search, and ana­lyze the data, regard­less of the tech­no­logy or vendor. Tools like OpenSearch and Pro­meth­eus can be used to achieve this purpose.

5. Cor­rel­a­tion - With the data from dif­fer­ent ser­vices stored in a cent­ral loc­a­tion, we can now cor­rel­ate data from dif­fer­ent sources to gain a more com­pre­hens­ive under­stand­ing of the applic­a­tions’ past and cur­rent behavior.

By fol­low­ing these steps, teams can achieve cent­ral­ized observ­ab­il­ity with Open Tele­metry by hav­ing a con­sist­ent way to col­lect and report data from vari­ous sys­tems and ser­vices, and by being able to ana­lyze the data using a single set of tools, which allows us to gain a com­pre­hens­ive under­stand­ing of the per­form­ance and beha­vior of a sys­tem, as well as to quickly identify and troubleshoot issues.

Final Thoughts

It’s worth not­ing that Open Tele­metry is still a rel­at­ively new stand­ard and it’s still evolving, but it has already been adop­ted by many com­pan­ies and organ­iz­a­tions. The Open Tele­metry com­munity is con­stantly work­ing on improv­ing the stand­ard and adding new fea­tures, such as adding sup­port for new lan­guages, pro­to­cols, and plat­forms, which will make it even more power­ful and easy to use.

There’s still a lot to be developed in cent­ral­ized observ­ab­il­ity, but an import­ant aspect of Open Tele­metry is that it’s designed to work with other observ­ab­il­ity tools and stand­ards, such as OpenSearch and Pro­meth­eus, which makes it easy to integ­rate with exist­ing observ­ab­il­ity solu­tions like ours. This allows us to lever­age the strengths of dif­fer­ent tools and to use the best tool for the job, while still being able to cor­rel­ate data from dif­fer­ent sources and quickly identify and troubleshoot issues.

Cent­ral­ized observ­ab­il­ity is a crit­ical cap­ab­il­ity for any mod­ern high-per­form­ing team and Open Tele­metry provides a power­ful solu­tion for achiev­ing it.

At synvert, we help cus­tom­ers define an observ­ab­il­ity strategy and imple­ment it in com­plex envir­on­ments and organizations.