Observ­ab­il­ity in diverse Cloud Envir­on­ments (4/4) – Inges­tion in Splunk



In this last part of the blog post series, we will look at how we inges­ted our tele­metry out­puts to Splunk, our final achieve­ments, and fin­ish it with a per­sonal resume on the topic.

Logs and Logs Based Met­rics in Splunk

There are mul­tiple ways to ingest log sig­nals into Splunk Observ­ab­il­ity Cloud, to men­tion some:

  • you can setup a splunk com­pat­ible open­tele­metry col­lector to which applic­a­tions off­load logs (and poten­tially met­rics and traces)
  • you can send your logs via cloud­watch log sub­scrip­tion fil­ters to a lambda func­tion that will intern­ally call the Splunk API endpoints
  • you could instru­ment your applic­a­tions to dir­ectly export logs (and poten­tially met­rics and traces) against the splunk API endpoints

Out of those three options, option one is prob­ably the most scal­able and sens­ible option. It does how­ever require that your applic­a­tions have been instru­mented with the open­tele­metry SDKs, which our applic­a­tion land­scape was not (again, due to frac­tions of the open­tele­metry spe­cific­a­tion still being exper­i­mental back in the days).

Option three can quickly bring you into a vendor lock-in, where chan­ging the observ­ab­il­ity backend implic­ates a refact­or­ing of applic­a­tion logic. This doesn’t make sense and also res­ults in addi­tional API latency, as well as for­cing applic­a­tion code to per­form retry mech­an­isms against third party API end­points etc.

Option two has thus been the most sens­ible option in our scen­ario since we didn’t want to run into a vendor lock-in and using an open­tele­metry col­lector was not really an option for us.

Back in the days, splunk offered a pre­b­uilt lambda func­tion, called aws-log-col­lector, which con­tained the busi­ness logic to pro­cess batches of log events and ingest them into Splunk Observ­ab­il­ity Cloud via HTTP request against one of the older sig­nalfx API end­points. In addi­tion to the pre­b­uilt lambda func­tion, Splunk provided some pre­b­uilt tem­plates that would more eas­ily allow you to auto­mat­ic­ally deploy these lambda func­tions using AWS CloudFormation.

Albeit I am not a big fan of mix­ing dif­fer­ent IAC frame­works (in this case ter­ra­form and AWS Cloud­Form­a­tion) this was at least pretty easy to setup. You just had to attach your log groups via log sub­scrip­tion fil­ters to the lambda func­tion and have your logs pop up in Splunk Observ­ab­il­ity Cloud. Actu­ally quiet sim­ilar to how we setup our per­sist­ent layer men­tioned above 🤷‍♀ 🤷‍♂

As soon as we had our logs inges­ted into Splunk Observ­ab­il­ity Cloud, we imme­di­ately star­ted to chal­lenge Splunk with things we had already accom­plished in AWS. This was primar­ily the log inter­faces we had setup pre­vi­ously using Cloud­watch Logs Insights. The pendant in Splunk is called Splunk Log Observer, and at a first glance, it gave us everything we needed.

We could def­in­itely eas­ily repro­duce the real­time log tables that we saw before in cloudwatch.

realtime log tables in cloudwatch

Another thing that felt way more intu­it­ive then cloud­watch is that Log Observer also provides a lot more click inter­ac­tions, like hov­er­ing over a struc­tured log entry, and adding fields to fil­ter expres­sions via drop down menus 😍

hovering over a structured log entry, and adding fields to filter expressions via drop down menus

While Log Observer felt way more intu­it­ive, it also felt a bit less mature from an auto­ma­tion per­spect­ive, com­pared to cloud­watch logs insights.

Man­aging log quer­ies is kind of awk­ward. Yes, you can save your log quer­ies to review them later, but there are some arti­facts to be pre­pared for:

  • First, when you edit an exist­ing query and safe it, you essen­tially cre­ate a copy of your query, which has the same name and can only be dif­fer­en­ti­ated by an internal ID which is not vis­ible in the UI 🙄 This really feels awkward
  • Second, there appears to be no way to share a saved query with other col­leagues. Coop mode? Disabled
  • Third, there is no way to auto­mate the defin­i­tion of such quer­ies using the offi­cial ter­ra­form sig­nalfx provider

The only work­around for all of these issues is to cre­ate a dash­board hold­ing your log query instead that is man­aged by ter­ra­form to sup­port a con­tinu­ous devel­op­ment flow. Man­aging dash­boards via ter­ra­form, as well as the way that Log Observer and Dash­boards inter­act with each other also have a lot of flaws (as I might point out later or in another blog post), but at least there is a way to automation.

Inter­me­di­ate resume at this point? More intu­it­ive and user friendly, but also less mature in some points.

The next thing we wanted to inspect is the way that Splunk deals with log based met­rics, a concept which is called met­ri­ciz­a­tion rules in Splunk Observ­ab­il­ity Cloud. The concept might sound com­pel­ling, but there where three reas­ons that instantly blocked the usage of this feature:

  • no way to auto­ma­tion — The sig­nalfx ter­ra­form pro­vider simply does not sup­port to auto­mat­ic­ally man­age these rules declar­at­ively via ter­ra­form. I might be a hard­liner on that one, but if there’s no way to auto­ma­tion, then such a fea­ture may at most be clas­si­fied as exper­i­mental IMHO.
  • restrict­ive quotas — you can at most define up to 128 of these rules per organization
  • inap­pro­pri­ate visu­al­iz­a­tion res­ults — in our exper­i­ments with the met­ri­ciz­a­tion rules, we con­stantly saw pat­terns where the res­ult­ing met­rics would show blocks of data that simply should not exist. Or we had blocks of data that simply wouldn’t aggreg­ate to an expec­ted num­ber. Keep in mind that we had a very accur­ate cross val­id­a­tion sys­tem in place — the glue data cata­log where we stored all of our cus­tom logs and met­rics. The res­ult­ing met­rics from splunk simply couldn’t be val­id­ated against the glue data catalog.

How­ever, find­ing num­ber one has already been suf­fi­cient any­way to not opt in to this fea­ture. This has been a big downer ser­i­ously. But then again, we where already quiet pro­fi­cient in imple­ment­ing work­arounds for non func­tional cloud pro­vider or in this case third party pro­vider features.

In return, we built our own lambda func­tion that would push logs and log based met­rics to splunk. The res­ult­ing stack looked some­what like this:

lambda function that would push logs and log based metrics to splunk

The lambda func­tion in essence would be triggered with batches of log events received from vari­ous log groups via log sub­scrip­tion fil­ters, derive log based met­rics dynam­ic­ally using gen­eric con­fig­ur­a­tion pat­terns and send

  • logs to the sig­nalfx API end­point https://ingest.<region>.signalfx.com/v1/log
  • log based met­rics to the sig­nalfx API end­point https://ingest.<region>.signalfx.com/v2/datapoint

While extract­ing met­rics from logs and send­ing them to the v2/datapoint end­point was more or less trivial, send­ing logs to the v1/log end­point was kind of annoy­ing . This is, because the API end­point was not doc­u­mented (any­more?), so we had to reverse engin­eer the open code lambda func­tion provided by splunk to see how the end­point is called. The rest was exper­i­ment­a­tion and trial and error. This isn’t a very sat­is­fy­ing exper­i­ence, but this could be one of those arti­facts that occurs if one com­pany is taken over by another one — I’ve talked about that before.

Another thing that felt kind of awk­ward, was the fact that some of the API end­points would return 2xx status codes even for mal­formed request. That made exper­i­ment­a­tion really tricky. Hon­estly, I might have expec­ted to receive 4xx status codes for bad requests, but that isn’t the case.

In essence, you could send both end­points (iirc, but this is at least still the case for the v2/datapoint end­point) com­pletely arbit­rary JSON pay­loads, such as this:

{
"this": {
"makes": {
"no": "sense"
}
}
}

And you would still see a 200 OK response. In case of invalid request (against an undoc­u­mented API end­point when it comes to logs), you simply wouldn’t see data in Splunk Observ­ab­il­ity Cloud. That, IMHO, is a pretty awk­ward API design, but per­haps I’m miss­ing some points here that might be com­mon prac­tice somehow?

Gladly, we could at least deprec­ate the undoc­u­mented leg­acy v1 log end­point some months later in favor of a new end­point https://<Splunk instance that runs HEC>/services/collector. This end­point how­ever does not ingest logs into Splunk Observ­ab­il­ity dir­ectly, but instead into Splunk Enter­prise / Splunk Cloud Plat­form (don’t ask me how the licens­ing model works). To bridge both products, Splunk built an adapter called Log Observer Con­nect, that essen­tially con­nects both products with each other, while also prom­ising more performance.

What was the res­ult of our met­ric extrac­tion? We essen­tially dynam­ic­ally derived two types of met­rics, one counter met­ric for every log event, and one gauge met­ric for every key value pair in the met­rics sec­tion of any log (given it con­tained any at all).

Take this as an example again:

{
"id": "43748eba-b877-4343-9ada-edf98cd1e919",
"timestamp": "2024-05-10T12:13:05.368848+00:00",
"resource": {
"namespace": "processing",
"name": "ingest"
},
"instrumentation_scope": {
"name": "some_lib",
"version": "1.5.0"
},
"type": "WriteSucceeded",
"severity": "INFO",
"labels": {
"use_case": "fancy-ml",
"workflow_name": "foo",
"workflow_version": "1.0.0"
},
"body": {
"metrics": {
"duration_seconds": 15.7,
"row_count": 123412.0
},
"target": "s3://some-bucket/some/key"
},
"traceback": null,
"location": {
"file_system_path": "/foo/bar/foo.py",
"module": "foo",
"lineno": 20
}
}

This would res­ult in a counter met­ric like this (sim­pli­fied):

{
"metric": "<some-prefix>.event_count",
"dimensions": {
"type": "WriteSucceeded",
"severity": "INFO",
"labels.use_case": "fancy-ml",
"labels.workflow_name": "foo",
"labels.workflow_version": "1.0.0"
},
"value": 1,
"timestamp": 1715343185369
}

And two gauge met­rics like these (sim­pli­fied):

[
{
"metric": "<some-prefix>.duration_seconds",
"dimensions": {
"type": "WriteSucceeded",
"severity": "INFO",
"labels.use_case": "fancy-ml",
"labels.workflow_name": "foo",
"labels.workflow_version": "1.0.0"
},
"value": 15.7,
"timestamp": 1715343185369
},
{
"metric": "<some-prefix>.row_count",
"dimensions": {
"type": "WriteSucceeded",
"severity": "INFO",
"labels.use_case": "fancy-ml",
"labels.workflow_name": "foo",
"labels.workflow_version": "1.0.0"
},
"value": 123412.0,
"timestamp": 1715343185369
}
]

This wasn’t too dif­fi­cult to setup, and we finally had log based met­rics extrac­ted from struc­tured log events avail­able in Splunk. As you might see, we made sure that the dimen­sion­al­ity of the log event was attached to each met­ric as well, which was key in set­ting up gen­eric dash­board fil­ters or vari­ables such as labels.use_case. We also benefitet from this strategy later on, when we auto­mated alerts and noti­fic­a­tions. But more on that later.

And there we go! We could now start to setup dash­boards that would visu­al­ize all our struc­tured logs and log based met­rics from a busi­ness perspective 🚀

One thing that was really out­stand­ing is the defin­i­tion of a gen­eric counter met­ric that is derived dynam­ic­ally for each pro­cessed log event. This pat­tern felt so power­ful, that it eas­ily made up 30–40% of all our dash­board fea­tures. This is a pat­tern I can only sug­gest to setup for any plat­form that issues struc­tured logs.

Want to count the num­ber of inges­tion jobs cre­ated (sig­nal­flow syn­tax incoming)?

data(
"<some-prefix>.event_count",
filter=filter(
{
"resource.namespace": "ingestion",
"type": "JobCreated",
# ...
}
)
).publish(label="Jobs Created")

Want to count the over­all num­ber of errors ori­gin­at­ing from any inges­tion related component?

data(
"<some-prefix>.event_count",
filter=filter(
{
"resource.namespace": "ingestion",
"severity": "ERROR",
# ...
}
)
).publish(label="# Ingestion Errors")

You get the point. Using this gen­eric log counter met­ric enabled us to visu­al­ize so many dif­fer­ent pat­terns using one gen­eric met­ric 😲 I would def­in­itely do the same thing in upcom­ing pro­jects again.

So, what do we still miss? A, right! AWS man­aged met­rics, such as CPU and Memory Util­iz­a­tion provided by AWS for dif­fer­ent ser­vices out of the box. Stay with me 🍿

AWS man­aged met­rics in Splunk

As with logs, there are mul­tiple ways to ingest AWS Cloud­watch Met­rics into Splunk Observ­ab­il­ity Cloud, out of which we’ve tested two:

  • pull based — where a sched­uled pro­cess inside Splunk Observ­ab­il­ity Cloud reg­u­larly pulls new met­ric data­points from cloudwatch
  • push based — where you can setup a Cloud­watch Met­ric Stream that is triggered event based as new met­ric data­points occur. The stream can dir­ectly deliver the met­rics via a fire­hose deliv­ery stream to a known AWS Part­ner des­tin­a­tion end­point. Amongst those part­ners is Splunk Observ­ab­il­ity Cloud.

Per­son­ally, I would always favor a push based inges­tion approach over a pull based approach for such scen­arios. I can­not image how a sched­uled pull based approach could ever be as scal­able and cost effi­cient as an event based push approach, backed up by server­less func­tions. But the pull based approach was some­what easier to setup — so we went down this route first.

For the begin­ning, we didn’t really need much more then a prove of concept that met­rics from AWS could some­how be send to Splunk Observ­ab­il­ity Cloud. For the time being, this didn’t had to be per­fectly auto­mated. We just needed to have AWS Cloud­watch Met­rics avail­able in Splunk to start exper­i­ment­a­tion and get a hands-on feel­ing on how well we could com­bine the AWS man­aged met­rics with our cus­tom logs and log based metrics.

This is where a couple of issues imme­di­ately sur­faced, that all have their root cause in Cloud­watch or under­ly­ing AWS ser­vices, and fall under one of the fol­low­ing cat­egor­ies that we will shortly look at.

Miss­ing met­ric busi­ness dimensions

This one is gen­er­ally applic­able to every AWS Ser­vice (and per­haps other cloud pro­viders as well), but none of the AWS provided ser­vice met­rics (like cpu and memory util­iz­a­tion of sage­maker end­points, glue jobs, etc.) con­tain any busi­ness metadata — they where not aware of the fact that a given sage­maker train­ing job was part of a lar­ger ML work­flow that belonged to a spe­cific use case. And by not aware, I mean that the met­rics do not con­tain any addi­tional dimen­sion­al­ity that would allow one to fil­ter or aggreg­ate those met­rics from a busi­ness perspective.

Let’s take a look at the memory and cpu util­iz­a­tion met­rics of AWS Glue. These can be found under the cloud­watch met­ric namespace Glue.

Meas­ur­ing memory and cpu usage across your job’s work­ers could be achieved f.i. using one of these met­rics (more on the broken nam­ing in a second 🤦‍♂):

  • glue.<executor-id>.jvm.heap.usage
  • glue.<executor-id>.system.cpuSystemLoad

How­ever, the only dimen­sions asso­ci­ated with these met­rics are:

  • JobName
  • JobRunId
  • Type

The same goes for other AWS Ser­vices. There are a set of pre­defined tech­nical dimen­sions that you can use to fil­ter and ana­lyze those met­rics, but you can­not look at those met­rics from a busi­ness per­spect­ive. Fil­ter­ing an inges­tion dash­board with expres­sions such as labels.use_case=… in return obvi­ously had no impact on these cloud­watch met­rics because they are not aware of these busi­ness labels.

So far so good — you can­not blame AWS for not know­ing what kind of busi­ness model is applied to your cloud stack. What you can blame AWS for, is the fact that there appears to be no tech­nical inter­face what­so­ever to enhance pre­b­uilt AWS man­aged met­rics with addi­tional busi­ness dimen­sions on a per resource level. Again, see the chapter above about stand­ard­ized metadata propaga­tion and how crit­ical that is from an observ­ab­il­ity perspective.

In return how­ever, all pre­b­uilt AWS man­aged met­rics where more or less useless 🤔

Again: What did we want to achieve? We wanted to be able to visu­al­ize all of our cus­tom logs and met­rics along­side all pre­b­uilt AWS man­aged met­rics in cent­ral­ized ter­ra­form man­aged dash­boards that could all be quer­ied and filtered from a busi­ness per­spect­ive — sounds easy, right? Nope.

Broken Met­ric Names

Let’s make this short. What’s a nice way to drive con­sumers of your met­rics to decent levels of anger and mad­ness? Right, include dynamic iden­ti­fi­ers of any type in a met­ric name that is (as far as I have seen so far) always ref­er­enced stat­ic­ally 🤡

  • glue.driver.jvm.heap.usage
  • glue.ALL.jvm.heap.usage
  • glue.1.jvm.heap.usage
  • glue.2.jvm.heap.usage
  • glue.3.jvm.heap.usage
  • glue.n.jvm.heap.usage
  • glue.1.system.cpuSystemLoad
  • glue.2.system.cpuSystemLoad
  • glue.3.system.cpuSystemLoad
  • glue.n.system.cpuSystemLoad

Please! Stuff like executor-id belongs into met­ric dimen­sions, not into the met­ric name 🙈 Over the course of the plat­form devel­op­ment his­tory, we cre­ated, executed and des­troyed about 10 thou­sand Glue Job Defin­i­tions. There is no way you can define a dynamic gen­eric dash­board with these met­rics. How do you build a gen­eric inges­tion dash­board, which you want to use to mon­itor all your inges­tion jobs over the past two weeks? The pain­ful answer is, you can’t, not even in cloud­watch. Well that is unless you know how much work­ers a given job star­ted with and other spark con­fig­ur­a­tion options that gov­ern the amount of execut­ors that will be used by the job. Well, but then comes autoscaling…

But even then, you‘re run­ning into a situ­ation where you need to write a dash­board chart which essen­tially plots met­rics with a range expres­sion (e.g. glue.1.jvm.heap.usage -> glue.100.jvm.heap.usage ) just to make sure that you included all execut­ors. Again, some­thing like this is usu­ally not pos­sible since met­ric names are static. I haven’t seen any observ­ab­il­ity tool yet that enables to dynam­ic­ally com­pose met­ric names and plot them 🤔 This is bad met­ric design. A way bet­ter design would be to rename the met­ric to some­thing like glue.jvm.heap.usage and attach an addi­tional dimen­sion (named e.g. spark_process or some­thing like this) with dimen­sion val­ues like executor-1 , executor-2 , executor-n , ALL , driver .

This would enable to write a single met­ric expres­sion which is simple to fil­ter to executor cpu util­iz­a­tion only and then group it by executor instance id:

data(
"glue.jvm.heap.usage",
filter=(not filter('spark_process', ['driver', 'ALL']))
).mean(by=['spark_process']).publish(label="% CPU Usage")

In the absence of a san­it­ized met­ric, you are doomed to run into a mad­ness like this:

data("glue.jvm.1.heap.usage").publish(label="% CPU Usage (executor 1)")
data("glue.jvm.2.heap.usage").publish(label="% CPU Usage (executor 2)")
data("glue.jvm.3.heap.usage").publish(label="% CPU Usage (executor 3)")
data("glue.jvm.4.heap.usage").publish(label="% CPU Usage (executor 4)")
data("glue.jvm.5.heap.usage").publish(label="% CPU Usage (executor 5)")
data("glue.jvm.6.heap.usage").publish(label="% CPU Usage (executor 6)")
data("glue.jvm.7.heap.usage").publish(label="% CPU Usage (executor 7)")
data("glue.jvm.8.heap.usage").publish(label="% CPU Usage (executor 8)")
data("glue.jvm.9.heap.usage").publish(label="% CPU Usage (executor 9)")
data("glue.jvm.10.heap.usage").publish(label="% CPU Usage (executor 10)")
data("glue.jvm.11.heap.usage").publish(label="% CPU Usage (executor 11)")
data("glue.jvm.12.heap.usage").publish(label="% CPU Usage (executor 12)")
data("glue.jvm.13.heap.usage").publish(label="% CPU Usage (executor 13)")
data("glue.jvm.14.heap.usage").publish(label="% CPU Usage (executor 14)")
data("glue.jvm.15.heap.usage").publish(label="% CPU Usage (executor 15)")
data("glue.jvm.16.heap.usage").publish(label="% CPU Usage (executor 16)")
data("glue.jvm.17.heap.usage").publish(label="% CPU Usage (executor 17)")
data("glue.jvm.18.heap.usage").publish(label="% CPU Usage (executor 18)")
data("glue.jvm.19.heap.usage").publish(label="% CPU Usage (executor 19)")
# ...
data("glue.jvm.n.heap.usage").publish(label="% CPU Usage (executor n)")

This is essen­tially the same mad­ness that AWS is forced to do with the pre­b­uilt Glue Job Dashboards:

prebuilt Glue Job Dashboards

I really hope this does not need more explan­a­tion or prove: Never include dynamic iden­ti­fi­ers (or lit­er­ally any high car­din­al­ity data) in static met­ric names! Actu­ally, I’m quiet shocked that this really needs to be dis­cussed apparently.

Broken Met­ric Values

Spoiler alert: This is one of the issues we haven’t fixed up until today. Mainly because we became kind of exhausted to fix issues in cloud pro­vider metrics.

But this is another show­case for broken met­rics that you might run into. Fea­tur­ing Sage­maker again this time. Sorry AWS, but the ser­vice just offers too much a tar­get for cri­ti­cism — and we had many situ­ations where we where very temp­ted to write a User Epic with the title Deprecate Amazon Sagemaker.

Let’s make this quick: If you spin up an infer­ence end­point, you get cpu and memory met­rics out the box. That’s pretty nice! What’s not so nice, is how the CPUUtilization met­ric in real­time infer­ence end­points is defined, as taken from the offi­cial AWS Docs:

The sum of each indi­vidual CPU core’s util­iz­a­tion. The CPU util­iz­a­tion of each core range is 0–100. For example, if there are four CPUs, the CPUUtil­iz­a­tion range is 0%–400%

Do I really need to say more? What if you want to cre­ate an alert that noti­fies users about resource over-pro­vi­sion­ing for end­points whose cpu util­iz­a­tion is below a threshold of f.i. 20% over 10 minutes? The pain­ful answer is, you can’t. Well, that is, unless you are will­ing to write a met­ric pro­cessing logic, which:

  • detects the end­point a given met­ric point belongs to
  • infer the instance type that is asso­ci­ated with the endpoint
  • lookup the vCPU count using some map­ping from instance type → vCPU count (which hypo­thet­ic­ally is pos­sible also dynam­ic­ally using the AWS Pri­cing API)
  • and than divide the met­ric points value by the num­ber of cores
  • which would return a san­it­ized met­ric with a clean range of 0..100 percent

Again, over the course of the plat­form devel­op­ment his­tory we deployed, scored and des­troyed more then 1.000 of such end­points — each with poten­tially indi­vidu­ally spe­cified ML instance types — of which there are dozens of dif­fer­ent types, each with indi­vidual amounts of vCPUs.

Yes, there is a tech­nical way to fix this, but at some point you really need to look at ROI. At this point we simply accep­ted that we do not want to imple­ment an alert­ing on sage­maker end­point over-pro­vi­sion­ing. In return, the cpu util­iz­a­tion met­rics for sage­maker infer­ence end­points still look some­what like this (obvi­ously with vary­ing ranges like 0..400, 0..800, 0..1600 per­cent — depend­ing on instance types):

cpu utilization metrics for sagemaker inference endpoints

Yes, there is a new concept called infer­ence com­pon­ents in sage­maker, which also con­tains a san­it­ized met­ric named CPUUtilizationNormalized. But then again, util­iz­ing new sage­maker fea­tures is not at all the dir­ec­tion that this ML plat­form was headed for. The oppos­ite is the case. Over the course of the plat­form devel­op­ment his­tory we gradu­ally deprec­ated more and more sage­maker fea­tures with cus­tom solu­tions and imple­ment­a­tions due to an over­all lack of stand­ards and sta­bil­ity, that I don’t think any­one con­trib­ut­ing to the plat­form really feels like test­ing new sage­maker fea­tures.

Fixes

Okay, so what could we do to fix at least some of the issues? First of all, the exist­ing ingest approach for AWS man­aged met­rics had to be adjus­ted, since we needed a way to inject cus­tom pro­cessing logic which isn’t pos­sible with the pull based approach man­aged by splunk.

Using the push based approach, you could how­ever add a cus­tom trans­former lambda func­tion, right? Yes!

The final stack looked some­what like this:

ustom transformer lambda function

This is in essence what the trans­former lambda did:

  • lookup and cache miss­ing tele­metry attrib­utes for met­rics via the AWS Tag­ging API
    — this only worked, because, whenever a resource was cre­ated on the fly out of the exe­cu­tion graph of an ML work­flow, we tried to attach the tele­metry labels also to each resource via resource tags (where that was pos­sible — not all AWS resources sup­port tags)
    —moreover, this approach also only work for met­rics, that do some­how con­tain a pointer to the resource in it’s dimen­sions. This was luck­ily the case for the met­rics we where inter­ested in. E.g. the JobName dimen­sion for glue gob met­rics, Host for sage­maker train­ing job met­rics, EndpointName for infer­ence end­points, etc.
  • enrich all met­ric points with miss­ing metadata
  • trans­form broken met­ric names with dynamic con­fig­ur­able patterns

Let’s use the fol­low­ing cloud­watch met­ric points as an example:

[
{
"namespace": "Glue",
"metric_name": "glue.1.system.cpuSystemLoad",
"dimensions": {
"JobName": "foo",
"JobRunId": "bar",
"Type": "baz"
},
"timestamp": 1715343185369,
"unit": "Count",
"value": {
"count": 1.0,
"sum": 0.35,
"max": 0.35,
"min": 0.35
}
},
{
"namespace": "Glue",
"metric_name": "glue.driver.system.cpuSystemLoad",
"dimensions": {
"JobName": "foo",
"JobRunId": "bar",
"Type": "baz"
},
"timestamp": 1715343185369,
"unit": "Count",
"value": {
"count": 1.0,
"sum": 0.35,
"max": 0.35,
"min": 0.35
}
}
]

The trans­former lambda would in essence:

  • lookup and cache the AWS resource tags of the Glue Job iden­ti­fied by JobName = foo
  • use the cache to enrich all match­ing met­ric points with tele­metry labels attached as resource tags
  • cleanup broken met­ric names, e.g. rename glue.1.system.cpuSystemLoad to aws.glue.jobs.cpu_utilization
    and add an addi­tional dimen­sion spark_process with val­ues such as 1ALLdriver

The res­ults would look some­what like this:

[
{
"metric": "aws.glue.jobs.cpu_utilization",
"dimensions": {
"JobName": "foo",
"JobRunId": "bar",
"Type": "baz",
"spark_process": "1",
"labels.use_case": "fancy-ml",
"labels.workflow_name": "foo",
"labels.workflow_version": "1.0.0"
},
"value": 0.35,
"timestamp": 1715343185369
},
{
"metric": "aws.glue.jobs.cpu_utilization",
"dimensions": {
"JobName": "foo",
"JobRunId": "bar",
"Type": "baz",
"spark_process": "driver",
"labels.use_case": "fancy-ml",
"labels.workflow_name": "foo",
"labels.workflow_version": "1.0.0"
},
"value": 0.35,
"timestamp": 1715343185369
}
]

Phew, we finally had logs, log based met­rics and AWS man­aged met­rics in a usable and con­sist­ently labeled format in Splunk 😍 🚀

log based metrics and AWS managed metrics in a usable and consistently labeled format in Splunk

This finally enabled us to visu­al­ize all tele­metry sig­nals, regard­less the source and own­er­ships in cent­ral­ized dash­boards, using busi­ness cri­teria to fil­ter and aggreg­ate all signals.

Resume

All these invests finally brought us into a situ­ation where we could cre­ate ter­ra­form man­aged splunk dash­boards, that would f.i. allow to mon­itor batch pre­dic­tions at a single glance, including

  • cpu and memory util­iz­a­tion met­rics from batch pre­dic­tion jobs
  • cpu and memory util­iz­a­tion met­rics from infer­ence endpoints
  • infer­ence laten­cies (broken down into deseri­al­iz­a­tion, pre­dic­tion, and seri­al­iz­a­tion time)
  • total num­ber of rows pre­dicted, total job dur­a­tion and derived velo­city met­rics (e.g. rows pre­dicted per inves­ted hour over two weeks)
  • total num­ber of log warn­ings and errors from any pre­dict related com­pon­ent (MWAA, glue, sage­maker, api gate­way, lambda, …) over two weeks
  • as well as log table charts, dis­play­ing
    — all pre­dict related logs
    — con­fig­ur­a­tion options for deployed infer­ence end­points (e.g. ML model name and ver­sion, instance type and count, etc.) as extrac­ted from struc­tured log mes­sages
    — con­fig­ur­a­tion options for deployed batch pre­dic­tion jobs (e.g. worker type and count, etc.) as extrac­ted from struc­tured log messages

And all of these charts could be filtered eas­ily using pre­defined dash­board vari­ables such as

  • the ML use case
  • the ML work­flow and it’s version
  • etc.

Moreover, we could write dash­board spe­cific SQL reports that helped us cross val­id­ate all plot­ted met­rics against the glue data cata­log. In fact, at some point I almost felt a tempta­tion to write auto­mated dash­board tests, that would com­pare met­ric val­ues inside the glue data cata­log against met­ric val­ues dis­played in Splunk. In the­ory this might even be pos­sible via sig­nal­flow API end­points such as this. But we haven’t really fol­lowed this path (yet).

Our tele­metry out­puts and the design philo­sophy behind our sig­nals also allowed for eas­ily man­age­able ML use case spe­cific alerts, that would route alerts to use case spe­cific noti­fic­a­tion chan­nels. These alerts where essen­tially also pretty generic:

errors = data(
"<some-prefix>.event_count"
, filter=filter(
{
"severity": "ERROR",
"labels.use_case": "<use-case-name>",
# ...
}
)
, rollup="sum"
).sum(
by=[
"type",
"exc_type",
"exc_value",
# ...
]
).publish(label="errors")
detect(when(errors > 0, "1s")).publish(label="errors")

And there you go, all log errors belong­ing to one use case, regard­less the source (MWAA, Glue, Sage­maker Train­ing Jobs, Sage­maker Infer­ence End­points, ECS Tasks, Lambda Func­tions, …), covered by one alert defin­i­tion that you just had to cre­ate mul­tiple times for each known use case using the cas­ual ter­ra­form for_each logic and a map of use case → splunk integ­ra­tion id 🚀

Want to cre­ate alerts based on glue memory usage? Same procedure:

glue_memory_usage = data(
"aws.glue.jobs.memory_utilization"
, filter=filter(
{
"JobRunId": "ALL",
"labels.use_case": "<use-case-name>",
# ...
}
)
and not filter({"spark_process": ["ALL"]})
, rollup="max"
).scale(100).max(
by=[
"JobName",
"spark_process",
"labels.workflow_name",
# ...
]
).fill(0).publish(label="glue_memory_usage")
detect(when(glue_memory_usage > 80, '1m')).publish(label='max glue memory usage > 80% over 1m')

I don’t want to claim that bug hunt­ing in this plat­form was delight­ful, but it def­in­itely became a no-brainer and brought us to the point where we where usu­ally able to spot the root cause of a bug within seconds or minutes. More import­antly though, for most bugs that ori­gin­ated from mis­con­fig­ur­a­tion or other user-input related errors, we (Engineering/DevOps) weren’t even con­tac­ted by Data Sci­ent­ist in the first place, since they usu­ally quickly under­stood the root cause of some pro­cess inter­rup­tion them­selves quickly.

We had finally estab­lished an MLOps plat­form, that would involve little to no Engin­eer­ing or DevOps resources for the onboard­ing pro­cess of new Data Sci­ent­ists, main­ten­ance of exist­ing ML Use Cases, as well as boot­strap­ping — thanks to invests into cul­ture and stand­ards, test­ing, doc­u­ment­a­tion, auto­ma­tion, and lastly, observ­ab­il­ity.

I hope this blo­g­post helped a bit to under­stand what the term really means besides pure terminology.

My final resume on the topic?

  • observ­ab­il­ity is a massive boost to oper­a­tional excel­lence, if not an abso­lute requirement
  • observ­ab­il­ity is noth­ing that you could quickly opt-in for though. It’s not like you can click a but­ton in the cloud and sud­denly your cloud stack becomes magic­ally observ­able. It takes decent invests, it takes cul­ture and stand­ards, it might take a lot of bug fixes and work­arounds, it takes a lot of DevOps and Engin­eer­ing work. Open­tele­metry has the poten­tial though, to apply observ­ab­il­ity in a stand­ard­ized way, which may very well omit the need to do the same research and prove of concept work again and again for dif­fer­ent cus­tom­ers and dif­fer­ent cloud providers.
  • It may be for these reas­ons, that open­tele­metry is on it’s way to become a global stand­ard (just derived on the broad list of vendors that claim to sup­port the standard)
  • We need stand­ards for observ­ab­il­ity from spe­cific­a­tion to API to SDK — which is exactly what open­tele­metry aims to achieve. Though, from our most recent prove of concept work in AWS (exper­i­ment­ing with ADOT, and other open­tele­metry col­lector dis­tri­bu­tions as well as dif­fer­ent col­lector deploy­ment modes and their integ­ra­tions with dif­fer­ent AWS ser­vices), I per­son­ally think it needs another couple of years until open­tele­metry could be usable in AWS in diverse cloud stacks. The cur­rent AWS imple­ment­a­tion state is, again, 100% lim­ited to the server­less microservice realm (EC2, ECS, EKS, Lambda, …) 🙄. Another thing that still speaks against open­tele­metry, is it’s huge tech­nical com­plex­ity. As of the point of writ­ing this, I would per­son­ally say that open­tele­metry is writ­ten by and for observ­ab­il­ity experts. It’s def­in­itely going to cause night­mares for a non tech­nical audi­ence — and would in it’s cur­rent state pre­sum­ably be rejec­ted by such an audience.
  • plat­forms like pydantic log­fire may very well have the poten­tial though, to provide this layer of sim­pli­fic­a­tion to be usable by a less tech­incal audience

What I per­son­ally dream of?

  • observ­ab­il­ity to become a first class cit­izen of every Cloud Pro­vider Ser­vice and every open source lib­rary that is known to run at scale — regard­less it’s domain (ELT, ETL, AI, ML, Serverless/Microservice Architectures, …)
  • the open­tele­metry logs SDK for python becom­ing finally stable 💸
  • lib­rary own­ers to become more aware of the topic (e.g. apache-air­flow, apache-spark, mlflow, terraform, ….)
  • a gen­eral shift away from leg­acy unstruc­tured log prose to struc­tured logging
  • a gen­eral shift in mind that observ­ab­il­ity is a topic that all cloud stacks need — not just server­less microservice stacks
  • in gen­eral, a more broader imple­ment­a­tion of the open­tele­metry stack in lib­rary code through­out dif­fer­ent soft­ware stacks and pro­gram­ming languages
  • an aware­ness for the need of inter­faces that allow to inject cus­tom tele­metry metadata into pro­cesses for both, lib­rary code and man­aged cloud pro­vider services
  • an aware­ness for observ­ab­il­ity from a cloud ser­vice design per­spect­ive (think about sage­maker end­point log groups again, albeit that could per­haps also be of minor import­ance when using an open­tele­metry col­lector with ded­ic­ated exporters)
  • an aware­ness for best prac­tices when it comes to met­ric design

Who knows? Per­haps one day even Git­Hub work­flows may issue open­tele­metry com­pat­ible logs, met­rics and traces, paired with a reworked UI that sim­pli­fies the pro­cess of cent­rally ana­lyz­ing and mon­it­or­ing such tele­metry sig­nals in a dis­trib­uted CI architecture? 👀

I hope you enjoyed this. Thanks a lot for reading 🍻

Cheers!