In part 3 of this blog post series, we will look at the imple­ment­a­tion of our per­sist­ent layer, and how we star­ted to lay­out our volat­ile layer.

Off­line Reporting

Why even send your struc­tured tele­metry data to a SQL-like backend in the first place? Well, because every mon­it­or­ing solu­tion out there will hit it’s lim­its w.r.t some ana­lysis use cases — be it only due to lim­it­a­tions in a given stream pro­cessing lan­guage that is often offered by such tools. Another lim­it­a­tion can often be time ranges that can be ana­lyzed and visu­al­ized in such tools. This is because such real­time mon­it­or­ing solu­tions often put a TTL on data. Don’t get me wrong, this is good from a per­form­ance and cost per­spect­ive, but you do loose insights on your tele­metry out­puts after some time. Being able to query your struc­tured tele­metry out­puts via SQL over the full timeline of your pro­ject just gives you so much more ana­lysis power that could never even slightly be rep­res­en­ted by any mon­it­or­ing solu­tion I’ve got in touch with so far. Think about joins, uni­ons, arbit­rary com­plex fil­ter expres­sions on nes­ted but well struc­tured and ver­sioned data.

Per­son­ally, this is the dif­fer­ence between mon­it­or­ing and report­ing in my head — two dif­fer­ent approaches that are both valid and can very well sup­ple­ment each other (as we will see later).

It might sound a bit weird that we tackled this per­sist­ent layer first, but if you’ve worked with struc­tured log­ging in the GCP before and ever used a log sink that would effort­lessly route all your struc­tured logs to a Big Query data­set you might know how power­ful such a fea­ture is.

How­ever, the plat­form was built and hos­ted on AWS and alas, AWS has noth­ing to offer to com­pete with GCP in that point. We were pretty con­fid­ent that we would be able to setup some mon­it­or­ing solu­tion to setup dash­boards and alarms, but less so con­fid­ent that we could estab­lish a SQL based inter­face to our tele­metry out­puts — so we tackled this layer first.

In return, we actu­ally tried to build such a gen­eric log sink stack our­self, kind of rep­lic­at­ing what the GCP pre­sum­ably did — just with dif­fer­ent cloud ser­vices obviously.

The core build­ing blocks for this stack were:

If there’s one thing that’s always pos­sible in AWS, then it’s hook­ing some lambda func­tion to some pro­cess — and AWS did not fail us here 😃

To put it simple, we wanted:

  • to cre­ate a gen­eric lambda func­tion that would ingest log events received via log sub­scrip­tion fil­ters into the glue data cata­log where it would be query­able via Athena
  • and then gradu­ally attach all of our log groups via ter­ra­form man­aged sub­scrip­tion fil­ters to this func­tion (gladly, we had already fixed the design issues of sage­maker infer­ence end­points upfront by send­ing our own struc­tured log events to one cent­ral­ized log group)

Before we talk about the chal­lenges that met us, it might be worth look­ing at some of the imple­ment­a­tion bound­ar­ies that we setup upfront.

First, we did not setup a schema registry for log events where every applic­a­tion ( → train­ing jobs, pro­cessing jobs, microservices, … → diverse) would register it’s log mes­sages upfront, e.g. as part of a soft­ware release pro­cess before log events are actu­ally emit­ted from that applic­a­tion. Why that? To put it simple, when we star­ted to develop the stack, we weren’t sure about the out­come and it’s bene­fits, so we didn’t want to invest that early into such a registry and the required CI over­head. Don’t get me wrong, we needed a schema registry for log events any­way for this stack, but we chose against pop­u­lat­ing such a registry upfront.

Second, we quickly agreed on keep­ing glue crawl­ers out of the stack. This is because using Glue Crawl­ers on nes­ted JSON data isn’t as com­fort­able as one might expect. Con­sider this simple JSON object

{"foo": [], "bar": {}, "baz": null}

If you run a glue crawler with default con­fig­ur­a­tions on such a JSON snip­pet, you end up with a table schema like this:

[
{
"Name": "foo",
"Type": "array<string>"
},
{
"Name": "bar",
"Type": "string"
},
{
"Name": "baz",
"Type": "string"
}
]

Hon­estly? None of these derived data types make any sense what­so­ever. You simply can­not derive a data type for an undefined JSON object. Not a single of above derived types makes any sense. Declar­ing that $.foo would be an array of strings? What if once the value is pop­u­lated for the first time, it would con­tain data like this [{"hello": ["world]"}]? Declar­ing that the empty struct loc­ated at $.bar would be a string 😬? Declar­ing that the com­pletely undefined object loc­ated at $.baz would be a string 😲? None of these data types makes sense. There is a huge chance that once a real value appears in any of these sec­tions it essen­tially requires the need for a break­ing change to that column data type definition.

Des­pite ques­tion­able design decisions of crawler intern­als such as those lis­ted above, there was another big point that spoke against crawl­ers. We wanted to catch incom­pat­ible events before they are event routed to a given tar­get table to ensure data is always query­able via Athena. If you’ve routed an incom­pat­ible event to that table and only make the veri­fic­a­tion after­wards… That’s essen­tially too late. Good luck spot­ting those incom­pat­ible events in S3 and get rid of them — I hope you’re good at doing bin­ary search 😅

Why not use a clas­sical rela­tional data­base with sup­port for nes­ted data struc­tures then? Good ques­tion! Well, that is mainly for two reasons:

  • cost effi­ciency — again, since we weren’t sure about the out­come and it’s bene­fits, we didn’t want to spin up a scal­able rela­tional data­base ser­vice that might even­tu­ally be per­man­ently up and run­ning (in case it didn’t sup­port a scale down to zero approach) and thus gen­er­ate costs. Hav­ing your data in s3 only pro­duces stor­age costs, but no com­pute costs as long as you don’t issue quer­ies against it. And the costs for resources such as glue data­bases, tables and the likes are pretty much neglectable.
  • imple­ment­a­tion com­plex­ity — writ­ing a lambda func­tion that per­man­ently needs to man­age data­base ses­sions, fire SQL quer­ies against some inform­a­tional data­base tables to fetch table schemata and updat­ing those where needed is way more com­plex than being able to call struc­tured rest­ful APIs that don’t force you to con­stantly await and parse query results

These imple­ment­a­tion bound­ar­ies essen­tially cre­ated the need to write a lambda func­tion that would

  • pro­cess batches of struc­tured log events from vari­ous origins
  • per­form event cleanup, infer­ence and evol­u­tion (for minor changes) on the fly
  • push log events to s3 using stand­ard­ized url path conventions
  • and man­age all related glue resources on the fly (data­bases, tables, schemata, …)

This cre­ated the need to cap­ture and man­age table schemata at some place, and glue tables them­selves are a very poor solu­tion to do so. As awk­ward as that might sound. But there is not a single AWS API end­point I’m aware of that returns the schema of a glue table as a fully struc­tured object. The glue.get_table API call does return an array of columns includ­ing the data type of each column, but the data type defin­i­tion of that column is simply returned as a hive data type string (e.g. struct<namespace:string,name:string>), which is not eas­ily machine par­seable — espe­cially if you oper­ate on highly nes­ted data.

As a work­around, we instead used Glue Schema Regis­tries, which at the time only sup­por­ted AVRO (not my per­sonal favor­ite) in con­junc­tion with Glue Tables, but at least we had a struc­tured way of man­aging table schemata.

At this point we almost had a work­ing and con­fig­ur­able Glue Cata­log Log Sink stack, if we wouldn’t have run into one more issue, and that is API throt­tling. This was essen­tially caused by the glue APIs that throttled our lambda func­tion so hard, that even 10 con­figured retries couldn’t solve this issue.

To fix this, we decoupled our log groups and the lambda func­tion via an SQS Queue, which allowed us to heav­ily increase the batch size of pro­cessed log events.

The final scal­able stack looked some­what like this:

log groups and the lambda function via an SQS Queue

And it could eas­ily grow. Want to attach addi­tional log sources? Just extend the list of ter­ra­form man­aged log sub­scrip­tion fil­ters! Again, sage­maker? Gladly we had fixed this already.

Yeah! We essen­tially brought SQS and Data Cata­log Log Sinks to AWS 🚀

Once we had the stack up and run­ning, we were able to write quer­ies like these:

SELECT
labels.use_case
, body.model.name
, body.model.version
, body.python_version
, body.python_packages
/* ... */
FROM "some_glue_db"."some_glue_table"
/* ... */
;

And sud­denly, you have access to the python ver­sion and the full set of installed python pack­ages of every model and ver­sion ever trained 🚀 That is quiet handy if you think of deprec­a­tion pro­cesses such as:

  • drop­ping sup­port for python 3.7, 3.8, 3.9, …
  • drop­ping plat­form wide sup­port for some third party lib­rary such as pydantic v1

Think­ing of a con­sist­ent logs data model? Nice, every data­base table has the same outer struc­ture inher­ited from the shared data model. That makes union oper­a­tions way easier to write 🚀

Think­ing of a con­sist­ent tele­metry metadata? This made join oper­a­tions so much easier to write.

INNER JOIN log_db_1.log_table_1 on x.labels.workflow_name ...,
INNER JOIN log_db_2.log_table_2 on x.labels.workflow_name ...,
/* ... */
INNER JOIN log_db_n.log_table_n on x.labels.workflow_name...,

Hav­ing all of our logs and cus­tom met­rics query­able via SQL also was a big lifesaver when writ­ing and more import­antly val­id­at­ing our mon­it­or­ing dash­boards — see next chapters.

Real­time Monitoring

Set­ting up the volat­ile layer to estab­lish real time mon­it­or­ing is prob­ably where a good 30% to 40% of all our invests went into, and it came with a decent amount of frus­tra­tion and pain — albeit the final res­ults were def­in­itely worth it.

What did we want to achieve? We wanted to be able to visu­al­ize all of our cus­tom logs and met­rics along­side all pre­b­uilt AWS man­aged met­rics in cent­ral­ized ter­ra­form man­aged dash­boards that could all be quer­ied and filtered from a busi­ness per­spect­ive — sounds easy, right?

An example could be a data ingest dash­board, that would visu­al­ize all ingest related cus­tom logs, log based met­rics and AWS man­aged cloud­watch met­rics (e.g. memory & cpu util­iz­a­tion) across all executed inges­tion jobs over a given time period which could be drilled down using fil­ter expres­sions like labels.use_case=’fancy-ml’ to only look at inges­tion logs and met­rics that belong to that par­tic­u­lar use case.

Choos­ing an observ­ab­il­ity tool

Des­pite some of our pre­vi­ous pain­ful exper­i­ences with cloud­watch, or more pre­cisely the way that spe­cific AWS ser­vices are integ­rated with cloud­watch, we first eval­u­ated if we could use the out of the box cloud nat­ive ser­vice provided by AWS to achieve our goals.

After all, we did also make some good exper­i­ences with cloud­watch, espe­cially cloud­watch logs insights — albeit it would def­in­itely be nice to add some click inter­ac­tions to the service.

How­ever, after build­ing our first cloud­watch dash­boards, we imme­di­ately ran into sev­eral issues, such as:

  • miss­ing sup­port for some chart types
  • miss­ing cus­tom­iz­a­tion options
  • restrict­ive ser­vice quotas, like the lim­it­a­tion that the cloud­watch met­ric query lan­guage is only able to query up to at most 3 hours of data 🤔

Along­side a (from our per­spect­ive) pretty unin­tu­it­ive defin­i­tion and usage of met­rics, dimen­sions and namespaces, cloud­watch quickly lost us as a cus­tomer. This decision was even easier since we already had a com­pet­itor at hand — Splunk Observ­ab­il­ity Cloud which was already ready to use 🤷‍♀🤷‍♂

Sorry to put it this way, but based on our pre­vi­ous exper­i­ences we didn’t feel like spend­ing much more time into cus­tom­iz­ing cloud­watch dash­boards with cus­tom lambda func­tions or some­thing like that and pretty much instantly stopped our tests with cloud­watch and con­tin­ued to chal­lenge Splunk Observ­ab­il­ity Cloud instead.

Some notes about the product upfront:

  • Splunk Observ­ab­il­ity Cloud was added to the product land­scape of Splunk as the res­ult of the acquis­i­tion of sig­nalfx (Octo­ber 2019 if I’m not mistaken)
  • Splunk itself was now recently pur­chased by Cisco (March 2024 if I’m not mistaken)

In return:

  • We already felt a lot of move­ment and work in pro­gress in place while build­ing our solu­tions around Splunk Observ­ab­il­ity Cloud (like parts of the doc­u­ment­a­tion slowly van­ish­ing or mov­ing towards other places, API Docs not being avail­able any­more or being moved, API End­points being com­pletely deprec­ated in favor of new API end­points etc.)
  • This move­ment and work in pro­gress might or might not increase even more now that Splunk was pur­chased by Cisco
  • no offense on that, but it might be that some of the links con­tained in the upcom­ing chapters might become out­dated and return 404s. I will provide them non­ethe­less if just for the sake of completeness.

In the last part we will look at the data ingest for logs, pro­pri­et­ary and AWS man­aged met­rics into Splunk. We will look at the issues we faced and how we solved them.

So stay tuned one last time 😉