Observability in diverse Cloud Environments (4/4) – Ingestion in Splunk

In this last part of the blog post series, we will look at how we ingested our telemetry outputs to Splunk, our final achievements, and finish it with a personal resume on the topic.

Logs and Logs Based Metrics in Splunk

There are multiple ways to ingest log signals into Splunk Observability Cloud, to mention some:

you can setup a splunk compatible opentelemetry collector to which applications offload logs (and potentially metrics and traces)
you can send your logs via cloudwatch log subscription filters to a lambda function that will internally call the Splunk API endpoints
you could instrument your applications to directly export logs (and potentially metrics and traces) against the splunk API endpoints

Out of those three options, option one is probably the most scalable and sensible option. It does however require that your applications have been instrumented with the opentelemetry SDKs, which our application landscape was not (again, due to fractions of the opentelemetry specification still being experimental back in the days).

Option three can quickly bring you into a vendor lock-in, where changing the observability backend implicates a refactoring of application logic. This doesn’t make sense and also results in additional API latency, as well as forcing application code to perform retry mechanisms against third party API endpoints etc.

Option two has thus been the most sensible option in our scenario since we didn’t want to run into a vendor lock-in and using an opentelemetry collector was not really an option for us.

Back in the days, splunk offered a prebuilt lambda function, called aws-log-collector, which contained the business logic to process batches of log events and ingest them into Splunk Observability Cloud via HTTP request against one of the older signalfx API endpoints. In addition to the prebuilt lambda function, Splunk provided some prebuilt templates that would more easily allow you to automatically deploy these lambda functions using AWS CloudFormation.

Albeit I am not a big fan of mixing different IAC frameworks (in this case terraform and AWS CloudFormation) this was at least pretty easy to setup. You just had to attach your log groups via log subscription filters to the lambda function and have your logs pop up in Splunk Observability Cloud. Actually quiet similar to how we setup our persistent layer mentioned above 🤷‍♀ 🤷‍♂

As soon as we had our logs ingested into Splunk Observability Cloud, we immediately started to challenge Splunk with things we had already accomplished in AWS. This was primarily the log interfaces we had setup previously using Cloudwatch Logs Insights. The pendant in Splunk is called Splunk Log Observer, and at a first glance, it gave us everything we needed.

We could definitely easily reproduce the realtime log tables that we saw before in cloudwatch.

Another thing that felt way more intuitive then cloudwatch is that Log Observer also provides a lot more click interactions, like hovering over a structured log entry, and adding fields to filter expressions via drop down menus 😍

hovering over a structured log entry, and adding fields to filter expressions via drop down menus

While Log Observer felt way more intuitive, it also felt a bit less mature from an automation perspective, compared to cloudwatch logs insights.

Managing log queries is kind of awkward. Yes, you can save your log queries to review them later, but there are some artifacts to be prepared for:

First, when you edit an existing query and safe it, you essentially create a copy of your query, which has the same name and can only be differentiated by an internal ID which is not visible in the UI 🙄 This really feels awkward
Second, there appears to be no way to share a saved query with other colleagues. Coop mode? Disabled
Third, there is no way to automate the definition of such queries using the official terraform signalfx provider

The only workaround for all of these issues is to create a dashboard holding your log query instead that is managed by terraform to support a continuous development flow. Managing dashboards via terraform, as well as the way that Log Observer and Dashboards interact with each other also have a lot of flaws (as I might point out later or in another blog post), but at least there is a way to automation.

Intermediate resume at this point? More intuitive and user friendly, but also less mature in some points.

The next thing we wanted to inspect is the way that Splunk deals with log based metrics, a concept which is called metricization rules in Splunk Observability Cloud. The concept might sound compelling, but there where three reasons that instantly blocked the usage of this feature:

no way to automation — The signalfx terraform provider simply does not support to automatically manage these rules declaratively via terraform. I might be a hardliner on that one, but if there’s no way to automation, then such a feature may at most be classified as experimental IMHO.
restrictive quotas — you can at most define up to 128 of these rules per organization
inappropriate visualization results — in our experiments with the metricization rules, we constantly saw patterns where the resulting metrics would show blocks of data that simply should not exist. Or we had blocks of data that simply wouldn’t aggregate to an expected number. Keep in mind that we had a very accurate cross validation system in place — the glue data catalog where we stored all of our custom logs and metrics. The resulting metrics from splunk simply couldn’t be validated against the glue data catalog.

However, finding number one has already been sufficient anyway to not opt in to this feature. This has been a big downer seriously. But then again, we where already quiet proficient in implementing workarounds for non functional cloud provider or in this case third party provider features.

In return, we built our own lambda function that would push logs and log based metrics to splunk. The resulting stack looked somewhat like this:

The lambda function in essence would be triggered with batches of log events received from various log groups via log subscription filters, derive log based metrics dynamically using generic configuration patterns and send

logs to the signalfx API endpoint https://ingest.<region>.signalfx.com/v1/log
log based metrics to the signalfx API endpoint https://ingest.<region>.signalfx.com/v2/datapoint

While extracting metrics from logs and sending them to the v2/datapoint endpoint was more or less trivial, sending logs to the v1/log endpoint was kind of annoying . This is, because the API endpoint was not documented (anymore?), so we had to reverse engineer the open code lambda function provided by splunk to see how the endpoint is called. The rest was experimentation and trial and error. This isn’t a very satisfying experience, but this could be one of those artifacts that occurs if one company is taken over by another one — I’ve talked about that before.

Another thing that felt kind of awkward, was the fact that some of the API endpoints would return 2xx status codes even for malformed request. That made experimentation really tricky. Honestly, I might have expected to receive 4xx status codes for bad requests, but that isn’t the case.

In essence, you could send both endpoints (iirc, but this is at least still the case for the v2/datapoint endpoint) completely arbitrary JSON payloads, such as this:

{
    "this": {
        "makes": {
            "no": "sense"
        }
    }
}

And you would still see a 200 OK response. In case of invalid request (against an undocumented API endpoint when it comes to logs), you simply wouldn’t see data in Splunk Observability Cloud. That, IMHO, is a pretty awkward API design, but perhaps I’m missing some points here that might be common practice somehow?

Gladly, we could at least deprecate the undocumented legacy v1 log endpoint some months later in favor of a new endpoint https://<Splunk instance that runs HEC>/services/collector. This endpoint however does not ingest logs into Splunk Observability directly, but instead into Splunk Enterprise / Splunk Cloud Platform (don’t ask me how the licensing model works). To bridge both products, Splunk built an adapter called Log Observer Connect, that essentially connects both products with each other, while also promising more performance.

What was the result of our metric extraction? We essentially dynamically derived two types of metrics, one counter metric for every log event, and one gauge metric for every key value pair in the metrics section of any log (given it contained any at all).

Take this as an example again:

{
    "id": "43748eba-b877-4343-9ada-edf98cd1e919",
    "timestamp": "2024-05-10T12:13:05.368848+00:00",
    "resource": {
        "namespace": "processing",
        "name": "ingest"
    },
    "instrumentation_scope": {
        "name": "some_lib",
        "version": "1.5.0"
    },
    "type": "WriteSucceeded",
    "severity": "INFO",
    "labels": {
        "use_case": "fancy-ml",
        "workflow_name": "foo",
        "workflow_version": "1.0.0"
    },
    "body": {
        "metrics": {
            "duration_seconds": 15.7,
            "row_count": 123412.0
        },
        "target": "s3://some-bucket/some/key"
    },
    "traceback": null,
    "location": {
        "file_system_path": "/foo/bar/foo.py",
        "module": "foo",
        "lineno": 20
    }
}

This would result in a counter metric like this (simplified):

{
    "metric": "<some-prefix>.event_count",
    "dimensions": {
        "type": "WriteSucceeded",
        "severity": "INFO",
        "labels.use_case": "fancy-ml",
        "labels.workflow_name": "foo",
        "labels.workflow_version": "1.0.0"
    },
    "value": 1,
    "timestamp": 1715343185369
}

And two gauge metrics like these (simplified):

[
    {
        "metric": "<some-prefix>.duration_seconds",
        "dimensions": {
            "type": "WriteSucceeded",
            "severity": "INFO",
            "labels.use_case": "fancy-ml",
            "labels.workflow_name": "foo",
            "labels.workflow_version": "1.0.0"
        },
        "value": 15.7,
        "timestamp": 1715343185369
    },
    {
        "metric": "<some-prefix>.row_count",
        "dimensions": {
            "type": "WriteSucceeded",
            "severity": "INFO",
            "labels.use_case": "fancy-ml",
            "labels.workflow_name": "foo",
            "labels.workflow_version": "1.0.0"
        },
        "value": 123412.0,
        "timestamp": 1715343185369
    }
]

This wasn’t too difficult to setup, and we finally had log based metrics extracted from structured log events available in Splunk. As you might see, we made sure that the dimensionality of the log event was attached to each metric as well, which was key in setting up generic dashboard filters or variables such as labels.use_case. We also benefitet from this strategy later on, when we automated alerts and notifications. But more on that later.

And there we go! We could now start to setup dashboards that would visualize all our structured logs and log based metrics from a business perspective 🚀

One thing that was really outstanding is the definition of a generic counter metric that is derived dynamically for each processed log event. This pattern felt so powerful, that it easily made up 30–40% of all our dashboard features. This is a pattern I can only suggest to setup for any platform that issues structured logs.

Want to count the number of ingestion jobs created (signalflow syntax incoming)?

data(
    "<some-prefix>.event_count", 
    filter=filter(
        {
            "resource.namespace": "ingestion",
            "type": "JobCreated",
            # ...
        }
    )
).publish(label="Jobs Created")

Want to count the overall number of errors originating from any ingestion related component?

data(
    "<some-prefix>.event_count", 
    filter=filter(
        {
            "resource.namespace": "ingestion",
            "severity": "ERROR",
            # ...
        }
    )
).publish(label="# Ingestion Errors")

You get the point. Using this generic log counter metric enabled us to visualize so many different patterns using one generic metric 😲 I would definitely do the same thing in upcoming projects again.

So, what do we still miss? A, right! AWS managed metrics, such as CPU and Memory Utilization provided by AWS for different services out of the box. Stay with me 🍿

AWS managed metrics in Splunk

As with logs, there are multiple ways to ingest AWS Cloudwatch Metrics into Splunk Observability Cloud, out of which we’ve tested two:

pull based — where a scheduled process inside Splunk Observability Cloud regularly pulls new metric datapoints from cloudwatch
push based — where you can setup a Cloudwatch Metric Stream that is triggered event based as new metric datapoints occur. The stream can directly deliver the metrics via a firehose delivery stream to a known AWS Partner destination endpoint. Amongst those partners is Splunk Observability Cloud.

Personally, I would always favor a push based ingestion approach over a pull based approach for such scenarios. I cannot image how a scheduled pull based approach could ever be as scalable and cost efficient as an event based push approach, backed up by serverless functions. But the pull based approach was somewhat easier to setup — so we went down this route first.

For the beginning, we didn’t really need much more then a prove of concept that metrics from AWS could somehow be send to Splunk Observability Cloud. For the time being, this didn’t had to be perfectly automated. We just needed to have AWS Cloudwatch Metrics available in Splunk to start experimentation and get a hands-on feeling on how well we could combine the AWS managed metrics with our custom logs and log based metrics.

This is where a couple of issues immediately surfaced, that all have their root cause in Cloudwatch or underlying AWS services, and fall under one of the following categories that we will shortly look at.

Missing metric business dimensions

This one is generally applicable to every AWS Service (and perhaps other cloud providers as well), but none of the AWS provided service metrics (like cpu and memory utilization of sagemaker endpoints, glue jobs, etc.) contain any business metadata — they where not aware of the fact that a given sagemaker training job was part of a larger ML workflow that belonged to a specific use case. And by not aware, I mean that the metrics do not contain any additional dimensionality that would allow one to filter or aggregate those metrics from a business perspective.

Let’s take a look at the memory and cpu utilization metrics of AWS Glue. These can be found under the cloudwatch metric namespace Glue.

Measuring memory and cpu usage across your job’s workers could be achieved f.i. using one of these metrics (more on the broken naming in a second 🤦‍♂):

glue.<executor-id>.jvm.heap.usage
glue.<executor-id>.system.cpuSystemLoad

However, the only dimensions associated with these metrics are:

JobName
JobRunId
Type

The same goes for other AWS Services. There are a set of predefined technical dimensions that you can use to filter and analyze those metrics, but you cannot look at those metrics from a business perspective. Filtering an ingestion dashboard with expressions such as labels.use_case=… in return obviously had no impact on these cloudwatch metrics because they are not aware of these business labels.

So far so good — you cannot blame AWS for not knowing what kind of business model is applied to your cloud stack. What you can blame AWS for, is the fact that there appears to be no technical interface whatsoever to enhance prebuilt AWS managed metrics with additional business dimensions on a per resource level. Again, see the chapter above about standardized metadata propagation and how critical that is from an observability perspective.

In return however, all prebuilt AWS managed metrics where more or less useless 🤔

Again: What did we want to achieve? We wanted to be able to visualize all of our custom logs and metrics alongside all prebuilt AWS managed metrics in centralized terraform managed dashboards that could all be queried and filtered from a business perspective — sounds easy, right? Nope.

Broken Metric Names

Let’s make this short. What’s a nice way to drive consumers of your metrics to decent levels of anger and madness? Right, include dynamic identifiers of any type in a metric name that is (as far as I have seen so far) always referenced statically 🤡

glue.driver.jvm.heap.usage
glue.ALL.jvm.heap.usage
glue.1.jvm.heap.usage
glue.2.jvm.heap.usage
glue.3.jvm.heap.usage
glue.n.jvm.heap.usage
glue.1.system.cpuSystemLoad
glue.2.system.cpuSystemLoad
glue.3.system.cpuSystemLoad
glue.n.system.cpuSystemLoad

Please! Stuff like executor-id belongs into metric dimensions, not into the metric name 🙈 Over the course of the platform development history, we created, executed and destroyed about 10 thousand Glue Job Definitions. There is no way you can define a dynamic generic dashboard with these metrics. How do you build a generic ingestion dashboard, which you want to use to monitor all your ingestion jobs over the past two weeks? The painful answer is, you can’t, not even in cloudwatch. Well that is unless you know how much workers a given job started with and other spark configuration options that govern the amount of executors that will be used by the job. Well, but then comes autoscaling…

But even then, you‘re running into a situation where you need to write a dashboard chart which essentially plots metrics with a range expression (e.g. glue.1.jvm.heap.usage -> glue.100.jvm.heap.usage ) just to make sure that you included all executors. Again, something like this is usually not possible since metric names are static. I haven’t seen any observability tool yet that enables to dynamically compose metric names and plot them 🤔 This is bad metric design. A way better design would be to rename the metric to something like glue.jvm.heap.usage and attach an additional dimension (named e.g. spark_process or something like this) with dimension values like executor-1 , executor-2 , executor-n , ALL , driver .

This would enable to write a single metric expression which is simple to filter to executor cpu utilization only and then group it by executor instance id:

data(
    "glue.jvm.heap.usage", 
    filter=(not filter('spark_process', ['driver', 'ALL']))
).mean(by=['spark_process']).publish(label="% CPU Usage")

In the absence of a sanitized metric, you are doomed to run into a madness like this:

data("glue.jvm.1.heap.usage").publish(label="% CPU Usage (executor 1)")
data("glue.jvm.2.heap.usage").publish(label="% CPU Usage (executor 2)")
data("glue.jvm.3.heap.usage").publish(label="% CPU Usage (executor 3)")
data("glue.jvm.4.heap.usage").publish(label="% CPU Usage (executor 4)")
data("glue.jvm.5.heap.usage").publish(label="% CPU Usage (executor 5)")
data("glue.jvm.6.heap.usage").publish(label="% CPU Usage (executor 6)")
data("glue.jvm.7.heap.usage").publish(label="% CPU Usage (executor 7)")
data("glue.jvm.8.heap.usage").publish(label="% CPU Usage (executor 8)")
data("glue.jvm.9.heap.usage").publish(label="% CPU Usage (executor 9)")
data("glue.jvm.10.heap.usage").publish(label="% CPU Usage (executor 10)")
data("glue.jvm.11.heap.usage").publish(label="% CPU Usage (executor 11)")
data("glue.jvm.12.heap.usage").publish(label="% CPU Usage (executor 12)")
data("glue.jvm.13.heap.usage").publish(label="% CPU Usage (executor 13)")
data("glue.jvm.14.heap.usage").publish(label="% CPU Usage (executor 14)")
data("glue.jvm.15.heap.usage").publish(label="% CPU Usage (executor 15)")
data("glue.jvm.16.heap.usage").publish(label="% CPU Usage (executor 16)")
data("glue.jvm.17.heap.usage").publish(label="% CPU Usage (executor 17)")
data("glue.jvm.18.heap.usage").publish(label="% CPU Usage (executor 18)")
data("glue.jvm.19.heap.usage").publish(label="% CPU Usage (executor 19)")
# ...
data("glue.jvm.n.heap.usage").publish(label="% CPU Usage (executor n)")

This is essentially the same madness that AWS is forced to do with the prebuilt Glue Job Dashboards:

I really hope this does not need more explanation or prove: Never include dynamic identifiers (or literally any high cardinality data) in static metric names! Actually, I’m quiet shocked that this really needs to be discussed apparently.

Broken Metric Values

Spoiler alert: This is one of the issues we haven’t fixed up until today. Mainly because we became kind of exhausted to fix issues in cloud provider metrics.

But this is another showcase for broken metrics that you might run into. Featuring Sagemaker again this time. Sorry AWS, but the service just offers too much a target for criticism — and we had many situations where we where very tempted to write a User Epic with the title Deprecate Amazon Sagemaker.

Let’s make this quick: If you spin up an inference endpoint, you get cpu and memory metrics out the box. That’s pretty nice! What’s not so nice, is how the CPUUtilization metric in realtime inference endpoints is defined, as taken from the official AWS Docs:

The sum of each individual CPU core’s utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the CPUUtilization range is 0%–400%

Do I really need to say more? What if you want to create an alert that notifies users about resource over-provisioning for endpoints whose cpu utilization is below a threshold of f.i. 20% over 10 minutes? The painful answer is, you can’t. Well, that is, unless you are willing to write a metric processing logic, which:

detects the endpoint a given metric point belongs to
infer the instance type that is associated with the endpoint
lookup the vCPU count using some mapping from instance type → vCPU count (which hypothetically is possible also dynamically using the AWS Pricing API)
and than divide the metric points value by the number of cores
which would return a sanitized metric with a clean range of 0..100 percent

Again, over the course of the platform development history we deployed, scored and destroyed more then 1.000 of such endpoints — each with potentially individually specified ML instance types — of which there are dozens of different types, each with individual amounts of vCPUs.

Yes, there is a technical way to fix this, but at some point you really need to look at ROI. At this point we simply accepted that we do not want to implement an alerting on sagemaker endpoint over-provisioning. In return, the cpu utilization metrics for sagemaker inference endpoints still look somewhat like this (obviously with varying ranges like 0..400, 0..800, 0..1600 percent — depending on instance types):

cpu utilization metrics for sagemaker inference endpoints

Yes, there is a new concept called inference components in sagemaker, which also contains a sanitized metric named CPUUtilizationNormalized. But then again, utilizing new sagemaker features is not at all the direction that this ML platform was headed for. The opposite is the case. Over the course of the platform development history we gradually deprecated more and more sagemaker features with custom solutions and implementations due to an overall lack of standards and stability, that I don’t think anyone contributing to the platform really feels like testing new sagemaker features.

Fixes

Okay, so what could we do to fix at least some of the issues? First of all, the existing ingest approach for AWS managed metrics had to be adjusted, since we needed a way to inject custom processing logic which isn’t possible with the pull based approach managed by splunk.

Using the push based approach, you could however add a custom transformer lambda function, right? Yes!

The final stack looked somewhat like this:

This is in essence what the transformer lambda did:

lookup and cache missing telemetry attributes for metrics via the AWS Tagging API
— this only worked, because, whenever a resource was created on the fly out of the execution graph of an ML workflow, we tried to attach the telemetry labels also to each resource via resource tags (where that was possible — not all AWS resources support tags)
—moreover, this approach also only work for metrics, that do somehow contain a pointer to the resource in it’s dimensions. This was luckily the case for the metrics we where interested in. E.g. the JobName dimension for glue gob metrics, Host for sagemaker training job metrics, EndpointName for inference endpoints, etc.
enrich all metric points with missing metadata
transform broken metric names with dynamic configurable patterns

Let’s use the following cloudwatch metric points as an example:

[
    {
        "namespace": "Glue",
        "metric_name": "glue.1.system.cpuSystemLoad",
        "dimensions": {
            "JobName": "foo",
            "JobRunId": "bar",
            "Type": "baz"
        },
        "timestamp": 1715343185369,
        "unit": "Count",
        "value": {
            "count": 1.0,
            "sum": 0.35,
            "max": 0.35,
            "min": 0.35
        }
    },
    {
        "namespace": "Glue",
        "metric_name": "glue.driver.system.cpuSystemLoad",
        "dimensions": {
            "JobName": "foo",
            "JobRunId": "bar",
            "Type": "baz"
        },
        "timestamp": 1715343185369,
        "unit": "Count",
        "value": {
            "count": 1.0,
            "sum": 0.35,
            "max": 0.35,
            "min": 0.35
        }
    }
]

The transformer lambda would in essence:

lookup and cache the AWS resource tags of the Glue Job identified by JobName = foo
use the cache to enrich all matching metric points with telemetry labels attached as resource tags
cleanup broken metric names, e.g. rename glue.1.system.cpuSystemLoad to aws.glue.jobs.cpu_utilization
and add an additional dimension spark_process with values such as 1, ALL, driver

The results would look somewhat like this:

[
    {
        "metric": "aws.glue.jobs.cpu_utilization",
        "dimensions": {
            "JobName": "foo",
            "JobRunId": "bar",
            "Type": "baz",
            "spark_process": "1",
            "labels.use_case": "fancy-ml",
            "labels.workflow_name": "foo",
            "labels.workflow_version": "1.0.0"
        },
        "value": 0.35,
        "timestamp": 1715343185369
    },
    {
        "metric": "aws.glue.jobs.cpu_utilization",
        "dimensions": {
            "JobName": "foo",
            "JobRunId": "bar",
            "Type": "baz",
            "spark_process": "driver",
            "labels.use_case": "fancy-ml",
            "labels.workflow_name": "foo",
            "labels.workflow_version": "1.0.0"
        },
        "value": 0.35,
        "timestamp": 1715343185369
    }
]

Phew, we finally had logs, log based metrics and AWS managed metrics in a usable and consistently labeled format in Splunk 😍 🚀

This finally enabled us to visualize all telemetry signals, regardless the source and ownerships in centralized dashboards, using business criteria to filter and aggregate all signals.

Resume

All these invests finally brought us into a situation where we could create terraform managed splunk dashboards, that would f.i. allow to monitor batch predictions at a single glance, including

cpu and memory utilization metrics from batch prediction jobs
cpu and memory utilization metrics from inference endpoints
inference latencies (broken down into deserialization, prediction, and serialization time)
total number of rows predicted, total job duration and derived velocity metrics (e.g. rows predicted per invested hour over two weeks)
total number of log warnings and errors from any predict related component (MWAA, glue, sagemaker, api gateway, lambda, …) over two weeks
as well as log table charts, displaying
— all predict related logs
— configuration options for deployed inference endpoints (e.g. ML model name and version, instance type and count, etc.) as extracted from structured log messages
— configuration options for deployed batch prediction jobs (e.g. worker type and count, etc.) as extracted from structured log messages

And all of these charts could be filtered easily using predefined dashboard variables such as

the ML use case
the ML workflow and it’s version
etc.

Moreover, we could write dashboard specific SQL reports that helped us cross validate all plotted metrics against the glue data catalog. In fact, at some point I almost felt a temptation to write automated dashboard tests, that would compare metric values inside the glue data catalog against metric values displayed in Splunk. In theory this might even be possible via signalflow API endpoints such as this. But we haven’t really followed this path (yet).

Our telemetry outputs and the design philosophy behind our signals also allowed for easily manageable ML use case specific alerts, that would route alerts to use case specific notification channels. These alerts where essentially also pretty generic:

errors = data(
    "<some-prefix>.event_count"
    , filter=filter(
        {
            "severity": "ERROR",
            "labels.use_case": "<use-case-name>",
            # ...
        }
    )
    , rollup="sum"
).sum(
    by=[
        "type",
        "exc_type",
        "exc_value",
        # ...
    ]
).publish(label="errors")
detect(when(errors > 0, "1s")).publish(label="errors")

And there you go, all log errors belonging to one use case, regardless the source (MWAA, Glue, Sagemaker Training Jobs, Sagemaker Inference Endpoints, ECS Tasks, Lambda Functions, …), covered by one alert definition that you just had to create multiple times for each known use case using the casual terraform for_each logic and a map of use case → splunk integration id 🚀

Want to create alerts based on glue memory usage? Same procedure:

glue_memory_usage = data(
    "aws.glue.jobs.memory_utilization"
    , filter=filter(
        {
            "JobRunId": "ALL",
            "labels.use_case": "<use-case-name>",
            # ...
        }
    )
    and not filter({"spark_process": ["ALL"]})
    , rollup="max"
).scale(100).max(
    by=[
        "JobName",
        "spark_process",
        "labels.workflow_name",
        # ...
    ]
).fill(0).publish(label="glue_memory_usage")
detect(when(glue_memory_usage > 80, '1m')).publish(label='max glue memory usage > 80% over 1m')

I don’t want to claim that bug hunting in this platform was delightful, but it definitely became a no-brainer and brought us to the point where we where usually able to spot the root cause of a bug within seconds or minutes. More importantly though, for most bugs that originated from misconfiguration or other user-input related errors, we (Engineering/DevOps) weren’t even contacted by Data Scientist in the first place, since they usually quickly understood the root cause of some process interruption themselves quickly.

We had finally established an MLOps platform, that would involve little to no Engineering or DevOps resources for the onboarding process of new Data Scientists, maintenance of existing ML Use Cases, as well as bootstrapping — thanks to invests into culture and standards, testing, documentation, automation, and lastly, observability.

I hope this blogpost helped a bit to understand what the term really means besides pure terminology.

My final resume on the topic?

observability is a massive boost to operational excellence, if not an absolute requirement
observability is nothing that you could quickly opt-in for though. It’s not like you can click a button in the cloud and suddenly your cloud stack becomes magically observable. It takes decent invests, it takes culture and standards, it might take a lot of bug fixes and workarounds, it takes a lot of DevOps and Engineering work. Opentelemetry has the potential though, to apply observability in a standardized way, which may very well omit the need to do the same research and prove of concept work again and again for different customers and different cloud providers.
It may be for these reasons, that opentelemetry is on it’s way to become a global standard (just derived on the broad list of vendors that claim to support the standard)
We need standards for observability from specification to API to SDK — which is exactly what opentelemetry aims to achieve. Though, from our most recent prove of concept work in AWS (experimenting with ADOT, and other opentelemetry collector distributions as well as different collector deployment modes and their integrations with different AWS services), I personally think it needs another couple of years until opentelemetry could be usable in AWS in diverse cloud stacks. The current AWS implementation state is, again, 100% limited to the serverless microservice realm (EC2, ECS, EKS, Lambda, …) 🙄. Another thing that still speaks against opentelemetry, is it’s huge technical complexity. As of the point of writing this, I would personally say that opentelemetry is written by and for observability experts. It’s definitely going to cause nightmares for a non technical audience — and would in it’s current state presumably be rejected by such an audience.
platforms like pydantic logfire may very well have the potential though, to provide this layer of simplification to be usable by a less techincal audience

What I personally dream of?

observability to become a first class citizen of every Cloud Provider Service and every open source library that is known to run at scale — regardless it’s domain (ELT, ETL, AI, ML, Serverless/Microservice Architectures, …)
the opentelemetry logs SDK for python becoming finally stable 💸
library owners to become more aware of the topic (e.g. apache-airflow, apache-spark, mlflow, terraform, ….)
a general shift away from legacy unstructured log prose to structured logging
a general shift in mind that observability is a topic that all cloud stacks need — not just serverless microservice stacks
in general, a more broader implementation of the opentelemetry stack in library code throughout different software stacks and programming languages
an awareness for the need of interfaces that allow to inject custom telemetry metadata into processes for both, library code and managed cloud provider services
an awareness for observability from a cloud service design perspective (think about sagemaker endpoint log groups again, albeit that could perhaps also be of minor importance when using an opentelemetry collector with dedicated exporters)
an awareness for best practices when it comes to metric design

Who knows? Perhaps one day even GitHub workflows may issue opentelemetry compatible logs, metrics and traces, paired with a reworked UI that simplifies the process of centrally analyzing and monitoring such telemetry signals in a distributed CI architecture? 👀

I hope you enjoyed this. Thanks a lot for reading 🍻

Cheers!