Observability in diverse Cloud Environments (3/4)

In part 3 of this blog post series, we will look at the implementation of our persistent layer, and how we started to layout our volatile layer.

Offline Reporting

Why even send your structured telemetry data to a SQL-like backend in the first place? Well, because every monitoring solution out there will hit it’s limits w.r.t some analysis use cases — be it only due to limitations in a given stream processing language that is often offered by such tools. Another limitation can often be time ranges that can be analyzed and visualized in such tools. This is because such realtime monitoring solutions often put a TTL on data. Don’t get me wrong, this is good from a performance and cost perspective, but you do loose insights on your telemetry outputs after some time. Being able to query your structured telemetry outputs via SQL over the full timeline of your project just gives you so much more analysis power that could never even slightly be represented by any monitoring solution I’ve got in touch with so far. Think about joins, unions, arbitrary complex filter expressions on nested but well structured and versioned data.

Personally, this is the difference between monitoring and reporting in my head — two different approaches that are both valid and can very well supplement each other (as we will see later).

It might sound a bit weird that we tackled this persistent layer first, but if you’ve worked with structured logging in the GCP before and ever used a log sink that would effortlessly route all your structured logs to a Big Query dataset you might know how powerful such a feature is.

However, the platform was built and hosted on AWS and alas, AWS has nothing to offer to compete with GCP in that point. We were pretty confident that we would be able to setup some monitoring solution to setup dashboards and alarms, but less so confident that we could establish a SQL based interface to our telemetry outputs — so we tackled this layer first.

In return, we actually tried to build such a generic log sink stack ourself, kind of replicating what the GCP presumably did — just with different cloud services obviously.

The core building blocks for this stack were:

AWS cloudwatch log subscription filters (also supported by terraform)
lambda functions

If there’s one thing that’s always possible in AWS, then it’s hooking some lambda function to some process — and AWS did not fail us here 😃

To put it simple, we wanted:

to create a generic lambda function that would ingest log events received via log subscription filters into the glue data catalog where it would be queryable via Athena
and then gradually attach all of our log groups via terraform managed subscription filters to this function (gladly, we had already fixed the design issues of sagemaker inference endpoints upfront by sending our own structured log events to one centralized log group)

Before we talk about the challenges that met us, it might be worth looking at some of the implementation boundaries that we setup upfront.

First, we did not setup a schema registry for log events where every application ( → training jobs, processing jobs, microservices, … → diverse) would register it’s log messages upfront, e.g. as part of a software release process before log events are actually emitted from that application. Why that? To put it simple, when we started to develop the stack, we weren’t sure about the outcome and it’s benefits, so we didn’t want to invest that early into such a registry and the required CI overhead. Don’t get me wrong, we needed a schema registry for log events anyway for this stack, but we chose against populating such a registry upfront.

Second, we quickly agreed on keeping glue crawlers out of the stack. This is because using Glue Crawlers on nested JSON data isn’t as comfortable as one might expect. Consider this simple JSON object

{"foo": [], "bar": {}, "baz": null}

If you run a glue crawler with default configurations on such a JSON snippet, you end up with a table schema like this:

[
    {
        "Name": "foo",
        "Type": "array<string>"
    },
    {
        "Name": "bar",
        "Type": "string"
    },
    {
        "Name": "baz",
        "Type": "string"
    }
]

Honestly? None of these derived data types make any sense whatsoever. You simply cannot derive a data type for an undefined JSON object. Not a single of above derived types makes any sense. Declaring that $.foo would be an array of strings? What if once the value is populated for the first time, it would contain data like this [{"hello": ["world]"}]? Declaring that the empty struct located at $.bar would be a string 😬? Declaring that the completely undefined object located at $.baz would be a string 😲? None of these data types makes sense. There is a huge chance that once a real value appears in any of these sections it essentially requires the need for a breaking change to that column data type definition.

Despite questionable design decisions of crawler internals such as those listed above, there was another big point that spoke against crawlers. We wanted to catch incompatible events before they are event routed to a given target table to ensure data is always queryable via Athena. If you’ve routed an incompatible event to that table and only make the verification afterwards… That’s essentially too late. Good luck spotting those incompatible events in S3 and get rid of them — I hope you’re good at doing binary search 😅

Why not use a classical relational database with support for nested data structures then? Good question! Well, that is mainly for two reasons:

cost efficiency — again, since we weren’t sure about the outcome and it’s benefits, we didn’t want to spin up a scalable relational database service that might eventually be permanently up and running (in case it didn’t support a scale down to zero approach) and thus generate costs. Having your data in s3 only produces storage costs, but no compute costs as long as you don’t issue queries against it. And the costs for resources such as glue databases, tables and the likes are pretty much neglectable.
implementation complexity — writing a lambda function that permanently needs to manage database sessions, fire SQL queries against some informational database tables to fetch table schemata and updating those where needed is way more complex than being able to call structured restful APIs that don’t force you to constantly await and parse query results

These implementation boundaries essentially created the need to write a lambda function that would

process batches of structured log events from various origins
perform event cleanup, inference and evolution (for minor changes) on the fly
push log events to s3 using standardized url path conventions
and manage all related glue resources on the fly (databases, tables, schemata, …)

This created the need to capture and manage table schemata at some place, and glue tables themselves are a very poor solution to do so. As awkward as that might sound. But there is not a single AWS API endpoint I’m aware of that returns the schema of a glue table as a fully structured object. The glue.get_table API call does return an array of columns including the data type of each column, but the data type definition of that column is simply returned as a hive data type string (e.g. struct<namespace:string,name:string>), which is not easily machine parseable — especially if you operate on highly nested data.

As a workaround, we instead used Glue Schema Registries, which at the time only supported AVRO (not my personal favorite) in conjunction with Glue Tables, but at least we had a structured way of managing table schemata.

At this point we almost had a working and configurable Glue Catalog Log Sink stack, if we wouldn’t have run into one more issue, and that is API throttling. This was essentially caused by the glue APIs that throttled our lambda function so hard, that even 10 configured retries couldn’t solve this issue.

To fix this, we decoupled our log groups and the lambda function via an SQS Queue, which allowed us to heavily increase the batch size of processed log events.

The final scalable stack looked somewhat like this:

And it could easily grow. Want to attach additional log sources? Just extend the list of terraform managed log subscription filters! Again, sagemaker? Gladly we had fixed this already.

Yeah! We essentially brought SQS and Data Catalog Log Sinks to AWS 🚀

Once we had the stack up and running, we were able to write queries like these:

SELECT
    labels.use_case
    , body.model.name
    , body.model.version
    , body.python_version
    , body.python_packages
    /* ... */
FROM "some_glue_db"."some_glue_table"
/* ... */
;

And suddenly, you have access to the python version and the full set of installed python packages of every model and version ever trained 🚀 That is quiet handy if you think of deprecation processes such as:

dropping support for python 3.7, 3.8, 3.9, …
dropping platform wide support for some third party library such as pydantic v1

Thinking of a consistent logs data model? Nice, every database table has the same outer structure inherited from the shared data model. That makes union operations way easier to write 🚀

Thinking of a consistent telemetry metadata? This made join operations so much easier to write.

INNER JOIN log_db_1.log_table_1 on x.labels.workflow_name ...,
INNER JOIN log_db_2.log_table_2 on x.labels.workflow_name ...,
/* ... */
INNER JOIN log_db_n.log_table_n on x.labels.workflow_name...,

Having all of our logs and custom metrics queryable via SQL also was a big lifesaver when writing and more importantly validating our monitoring dashboards — see next chapters.

Realtime Monitoring

Setting up the volatile layer to establish real time monitoring is probably where a good 30% to 40% of all our invests went into, and it came with a decent amount of frustration and pain — albeit the final results were definitely worth it.

What did we want to achieve? We wanted to be able to visualize all of our custom logs and metrics alongside all prebuilt AWS managed metrics in centralized terraform managed dashboards that could all be queried and filtered from a business perspective — sounds easy, right?

An example could be a data ingest dashboard, that would visualize all ingest related custom logs, log based metrics and AWS managed cloudwatch metrics (e.g. memory & cpu utilization) across all executed ingestion jobs over a given time period which could be drilled down using filter expressions like labels.use_case=’fancy-ml’ to only look at ingestion logs and metrics that belong to that particular use case.

Choosing an observability tool

Despite some of our previous painful experiences with cloudwatch, or more precisely the way that specific AWS services are integrated with cloudwatch, we first evaluated if we could use the out of the box cloud native service provided by AWS to achieve our goals.

After all, we did also make some good experiences with cloudwatch, especially cloudwatch logs insights — albeit it would definitely be nice to add some click interactions to the service.

However, after building our first cloudwatch dashboards, we immediately ran into several issues, such as:

missing support for some chart types
missing customization options
restrictive service quotas, like the limitation that the cloudwatch metric query language is only able to query up to at most 3 hours of data 🤔

Alongside a (from our perspective) pretty unintuitive definition and usage of metrics, dimensions and namespaces, cloudwatch quickly lost us as a customer. This decision was even easier since we already had a competitor at hand — Splunk Observability Cloud which was already ready to use 🤷‍♀🤷‍♂

Sorry to put it this way, but based on our previous experiences we didn’t feel like spending much more time into customizing cloudwatch dashboards with custom lambda functions or something like that and pretty much instantly stopped our tests with cloudwatch and continued to challenge Splunk Observability Cloud instead.

Some notes about the product upfront:

Splunk Observability Cloud was added to the product landscape of Splunk as the result of the acquisition of signalfx (October 2019 if I’m not mistaken)
Splunk itself was now recently purchased by Cisco (March 2024 if I’m not mistaken)

In return:

We already felt a lot of movement and work in progress in place while building our solutions around Splunk Observability Cloud (like parts of the documentation slowly vanishing or moving towards other places, API Docs not being available anymore or being moved, API Endpoints being completely deprecated in favor of new API endpoints etc.)
This movement and work in progress might or might not increase even more now that Splunk was purchased by Cisco
no offense on that, but it might be that some of the links contained in the upcoming chapters might become outdated and return 404s. I will provide them nonetheless if just for the sake of completeness.

In the last part we will look at the data ingest for logs, proprietary and AWS managed metrics into Splunk. We will look at the issues we faced and how we solved them.

So stay tuned one last time 😉