A Scalable Use Case Architecture

In this blog article I want to talk about the importance of culture and standards in cross functional agile working environments and how that is critical from a business scaling perspective to create scalable use cases.

The most predominant examples for such environments from my most recent project history are presumably ML and AI environments, where Data Scientists, Data and DevOps Engineers are working together to enable and maintain an ever growing amount of business use cases.

Over the past years, I’ve contributed to multiple of such environments in different industries, and I want to share my observations about why I believe some of them had more success then others. In all of these projects, it’s been a business desire to be able to scale out to new use cases quickly.

By scaling, I am explicitly referring to the ability:

to keep maintenance costs of existing use cases low
while being able to quickly bootstrap and bring new use cases to production

By use case, I am referring to things like:

building a recommender which is used to provide personalized product recommendations in your web shop based on a trained machine learning model
the personalization of email or social media marketing campaigns based on a machine learning model
the optimization of marketing and sales strategies based on sales forecasts provided by a machine learning model
and many more

Over the course of this article, I want to share my thoughts on why this business desire can only be fulfilled with culture and standards, and what the consequences are in the absence of these terms.

Observable patterns

For this blog post, I will continuously use cross functional ML and AI environments as an example, but the patterns I will describe can probably be abstracted to any cross functional environment. Just replace any of above mentioned personas with Data Analyst, Software Engineer, DWH Engineer, or whatever terminology applies best.

When businesses start to invest into ML and AI, the initial situation often looks somewhat like this:

Business Use Case Engineering Data Science DevOps

Through internal recruiting or hiring of external resources, there is a pool of Data Scientists, Data and DevOps Engineers available
An initial use case is defined, that is supposed to generate business value and help to get started on the topic
The requirements of this use case are communicated to the cross functional environment (often using Data Scientists as interface)
These requirements are translated into a POC
The POC is eventually promoted into a MVP
The MVP might find it’s way into productive usage and, eventually, it might even generate the desired business value

Now, once that first use case somehow found it’s way to productive use, you can often observe a business expectation, that it should now be possible to easily scale out to additional use cases, right?

Well, depending on how this initial use case was executed and what happened both inside (between Engineering, Data Science and DevOps) and around this cross functional environment that might as well become true. However, it might as well be that the following situation occurs:

Business Recruiting Use Case Engineering Data Science DevOps

This might be the worst case scenario, where additional use cases implicitly create the need for an additional process step — recruiting. In thiso case, scalability is complete gone, not because HR departments tend to do a poor job in hiring new people, but because market demand is high and recruiting professional resources takes time — albeit this situation may also differ across regions and countries. Regardless, this also increases staff costs for internal and/or external resources and ultimately lowers return on investments.

A better, yet not completely optimal scenario occurs, if you can at least offload multiple use cases to a couple of environments:

So what actually happens inside and around this cross functional environment, and what are successful patterns to a scalable use case architecture?

Culture

Obviously, the first thing that got to be respected, is that cross functional environments consist of personas coming from completely different backgrounds.

I do well remember my first interaction (as a Data Engineer) with such environments and the immense buzzword storm that I had to make sense of — with terms like MLOps, AIOps, training, hosting, serving, experimentation, experiment tracking, model registries, feature selection, feature engineering, hyperparameters, hyperparameter tuning and optimization, retraining, infrastructure as code, terraform, terragrunt, cloudformation, DevOps, SecOps, and many many more.

The same must be true from a Data Scientist perspective if you’re confronted for the first time with topics like Restful API Design, software engineering best practices, Testing Strategies, Observability, Infrastructure as Code, etc.

And obviously, a fraction of both worlds must apply to the DevOps perspective as well.

Now optimally, as these different personas start to lay out and implement a road map for different features that contribute to a given use case success, this will create a bidirectional knowledge flow between all domains, where terminology from each domain enters a shared knowledge domain. This can happen naturally with the help of the usual agile routines such as dailies, refinements, plannings, retrospectives etc. It actually only requires a social attitude of all personas, as well as an implicit understanding that a use case can only successfully be implemented with knowledge from all domains.

What am I talking about? Take the serving of a machine learning model as an example. I would personally claim that none of the participating personas have a complete detailed understanding of the process. Instead, everyone presumably knows a fraction of the complete truth.

I would claim that only Data Science really understands the internals and the business logic contained in the served ML model. However, Data Scientists will probably not completely understand the infrastructure stack and it’s internals that are required to have the model up an running.

Realistically, the complete truth about the infrastructure stack only exists in the DevOps domain.

Data Scientists will probably also not completely understand the API wrapper that is put in front of the served model because that presumably is knowledge that resides in the Data Engineering domain.

Optimally, the truths from each individual domain have an overlapping area. This intersection can implicitly create an abstracted perspective of a given process, such as serving and scoring of an ML model. The bigger this intersection, the clearer the abstraction across the team!

This will slowly start to grow a common ground that I would refer to as cross functional knowledge domain. This cross functional domain is tightly coupled to standardization that can lead to automation, testing, documentation, security and observability — all of which contribute to scalability.

Cross functional Knowlege Domain Engineering DevOps Data Science

Even for very experienced personas with years of experience in such environments, this generates a decent amount of cognitive and emotional overhead and stress to balance. The first thing to respect from a business and management perspective is:

This process takes time

One of the things that can go wrong this early in the project phase is a mono directional knowledge flow, where a given domain ever only emits their outputs into a process without also receiving inputs:

One observation I made in the past years again and again, is that often a request for a new feature actually involves contributions from all or at least multiple domains. This is only achievable with a culture of sharing and pairing and a bidirectional knowledge flow between all domains. Examples for anti-patterns could include:

Data Engineers, that ever only offload their requirements to DevOps without gaining an understanding of the design principles and boundaries that are in place for tooling like terraform
Data Scientists, that ever only emit their requirements to DevOps and Data Engineering without the ability to understand that sometimes the best business solution might make no technical sense whatsoever or even cause security vulnerabilities
etc.

In an optimal world, this shouldn’t really be discussed, but sometimes you do run into difficult social personalities that can hinder the development of a sane cross functional knowledge domain.

Standards

Establishing a sane cross functional culture is going to be the very foundation that is needed to establish standards. Why do we need standards? That is, because in the absence of either public or proprietary standards, you can easily run into an overall lack of testing, documentation, automation, observability and security that will ultimately lead to an overall lack of scalability.

I want to back this up by an observation (short technical deep dive coming in). Take this as an example. The link points to one of the Sagemaker SDKs that is supposed to ease the process of:

training and deploying machine learning models on Amazon SageMaker

More precisely, the link points to the place inside the SDK where Amazon stores a reference to a large amount of ML framework, framework version and python version specific training and/or serving images, that are available in specific AWS Regions and their respective accounts.

I’ve tried to extract a simplified fraction of this configuration for demonstration purposes:

- pytorch, 0.4.0, py2, py3
- pytorch, 1.0.0, py2, py3
- pytorch, 1.1.0, py2, py3
- pytorch, 1.2.0, py2, py3
- pytorch, 1.3.1, py2, py3
- pytorch, 1.4.0, py2, py3, py36
- pytorch, 1.5.0, py3, py36
- pytorch, 1.6.0, py3, py36
- pytorch, 1.7.1, py3, py36
- pytorch, 1.8.0, py3, py36
- pytorch, 1.8.1, py3, py36
- pytorch, 1.9.0, py38
- pytorch, 1.9.1, py38
- pytorch, 1.10.0, py38
- pytorch, 1.10.2, py38
- pytorch, 1.11.0, py38
- pytorch, 1.12.0, py38
- pytorch, 1.12.1, py38
- pytorch, 1.13.1, py39
- pytorch, 2.0.0, py310
- pytorch, 2.0.1, py310
- pytorch, 2.1.0, py310
- pytorch, 2.2.0, py310
- mxnet, 0.12.1, py2, py3
- mxnet, 1.0.0, py2, py3
- mxnet, 1.1.0, py2, py3
- mxnet, 1.2.1, py2, py3
- mxnet, 1.3.0, py2, py3
- mxnet, 1.4.0, py2, py3
- mxnet, 1.6.0, py2, py3
- mxnet, 1.7.0, py3
- mxnet, 1.8.0, py37
- mxnet, 1.9.0, py38
- pytorch-neuron, 1.11.0, py38
- xgboost, 0.90-1, py3
- xgboost, 0.90-2, py3
- xgboost, 1.0-1, py3
- autogluon, 0.3.1, py37
- autogluon, 0.3.2, py38
- autogluon, 0.4.0, py38
- autogluon, 0.4.2, py38
- autogluon, 0.4.3, py38
- autogluon, 0.5.2, py38
- autogluon, 0.6.1, py38
- autogluon, 0.6.2, py38
- autogluon, 0.7.0, py39
- autogluon, 0.8.2, py39
- autogluon, 1.0.0, py310
- autogluon, 1.1.0, py310
- pytorch-smp, 2.0.1, py310
- pytorch-smp, 2.1.2, py310
- pytorch-smp, 2.2.0, py310
- pytorch-smp, 2.3.0, py310
- pytorch-smp, 2.3.1, py310
- pytorch-training-compiler, 1.12.0, py38
- pytorch-training-compiler, 1.13.1, py39
- sklearn, 0.20.0, py3
- sklearn, 0.23-1, py3
- sklearn, 1.0-1, py3
- sklearn, 1.2-1, py3
- tensorflow, 1.4.1, py2
- tensorflow, 1.5.0, py2
- tensorflow, 1.6.0, py2
- tensorflow, 1.7.0, py2
- tensorflow, 1.8.0, py2
- tensorflow, 1.9.0, py2
- tensorflow, 1.10.0, py2
- tensorflow, 1.11.0, py2, py3
- tensorflow, 1.12.0, py2, py3
- tensorflow, 1.14.0, py2, py3
- tensorflow, 1.15.0, py2, py3
- tensorflow, 1.15.2, py2, py3, py37
- tensorflow, 1.15.3, py2, py3, py37
- tensorflow, 1.15.4, py3, py36, py37
- tensorflow, 1.15.5, py3, py36, py37
- tensorflow, 2.0.0, py2, py3
- tensorflow, 2.0.1, py2, py3
- tensorflow, 2.0.2, py2, py3
- tensorflow, 2.0.3, py3, py36
- tensorflow, 2.0.4, py3, py36
- tensorflow, 2.1.0, py2, py3
- tensorflow, 2.1.1, py2, py3
- tensorflow, 2.1.2, py3, py36
- tensorflow, 2.1.3, py3, py36
- tensorflow, 2.2.0, py37
- tensorflow, 2.2.1, py37
- tensorflow, 2.2.2, py37
- tensorflow, 2.3.0, py37
- tensorflow, 2.3.1, py37
- tensorflow, 2.3.2, py37
- tensorflow, 2.4.1, py37
- tensorflow, 2.4.3, py37
- tensorflow, 2.5.0, py37
- tensorflow, 2.5.1, py37
- tensorflow, 2.6.0, py38
- tensorflow, 2.6.2, py38
- tensorflow, 2.6.3, py38
- tensorflow, 2.7.1, py38
- tensorflow, 2.8.0, py39
- tensorflow, 2.9.2, py39
- tensorflow, 2.10.1, py39
- tensorflow, 2.11.0, py39
- tensorflow, 2.12.0, py310
- tensorflow, 2.13.0, py310
- tensorflow, 2.14.1, py310

Now, I know nothing about how the sagemaker SDKs and all of above images are internally managed (albeit being open code), but through hands-on experience with the Sagemaker Service over more then 2.5 years:

I do observe a severe lack of testing. Some of the images listed above are supposed to be compatible with python 3.10, or literally any python 3 version if I understand the py3 version specifier correctly.
— The sagemaker-training toolkit (commonly used in Sagemaker Training Jobs and thus images listed above) is however only tested against python 3.8 and 3.9.
— The sagemaker-inference toolkit (commonly used in Sagemaker Inference endpoints and thus images listed above) is however only tested against 3.8, 3.9 and 3.10, albeit you could hypothetically request a tensorflow serving image with 3.13, or even Python 2?
Over the past 2.5 years, I was also able to detect and overall lack of observability best practices (structured logging, metrics, tracing, opentelemetry?).
I was also able of observe a lack of concise documentation, that is consumable by all, Data Scientists, Data and DevOps Engineers, with artifacts such as missing, distributed and redundant or incorrect documentation.
I can also see a lack of security, given that it’s apparently still possible to request training or serving images to run on python versions that won’t get any security fixes anymore such as python 3.7, or even python 2

And lastly, I cannot possibly see a way for this to scale well, since every new framework or framework version either needs a new image, or a new image version, that needs to be bootstrapped and maintained by someone, needs additional documentation, testing, etc. In a nutshell, it appears that there are no (observable, obvious) standards applied whatsoever. This results in an implementation explosion of docker images for an ever growing list of varying ML frameworks (and their versions), python versions and other parameters. And in fact, if you start to browse the GitHub issues of some of the sagemaker repositories such as MMS, one can easily get the impression that some of these tools are more or less unmaintained — or at least it appears no one is reacting to bug or feature reports anymore.

Don’t get me wrong, I am not claiming that standardization in the world of ML and AI is trivial. The opposite is the case. I just want to showcase what can happen in the absence of either public or proprietary standards.

My personal hope is, that the ML and AI hype will at some point reach it’s peak, which hopefully is followed by a phase of standardization and establishment of well documented best practices. But as of the point of writing this, I cannot yet observe any standardization whatsoever in that area, with hundreds if not thousands of different tools and frameworks that is ever growing even at the point of writing this. Obviously, it is very hard to create such standards in a phase of rapid growth, but at some point I would definitely like to see an overall stabilization of the whole ML and AI stack.

Until we are there, there will probably not be any public standards whatsoever, that would help to state:

how models are trained
how models are serialized and deserialized
how models are scored (#model signature)
and many more things

Still questioning the importance of standards? Just try to imagine for a brief second how a world without the internet protocol would look like 🤡

In the absence of existing public standards, we need to establish proprietary standards, otherwise we run into a non scalable implementation explosion. Obviously that is much harder from a globally acting cloud provider perspective. But the artifacts of a lack of standardization are hopefully obvious.

So, how do we define proprietary standards in the absence of existing public standards? Start simple, and expand. Defining such standards isn’t actually that hard if you can control scope, and could start as simple as this:

source code for training routines must be managed in dedicated python packages (as opposed to notebooks)
source code packages must be managed with a dedicated package manager X (e.g. poetry, hatch, pdm)
source code must be annotated with type hints and must be validated using e.g. mypy
— to support this, we will establish coaching programs for less experienced personas
source code should (where possible) be covered with unit tests which are executed with e.g. pytest
— to support this, we will establish coaching programs for less experienced personas
source code should be properly documented with the numpy or rst doc style

These simple standards can already lead to the point, where you are able

to create generic reusable CI patterns that make it easy to run type checks, unit tests, autodoc builds and package builds for arbitrary source code packages with minimal effort

From there on, you may at some point proceed to more advanced topics such as:

models must be trained in containerized environments, and the containers are managed in repository x where the following release and versioning concept is in place […]
Training jobs must run on cloud service X
models must be stored in model registry X
when models are stored, we require the presence of specific custom proprietary metadata that helps to automate the serving process
…

This might lead to a situation where you can at some point completely automate both, the training and serving process with generic reusable building blocks.

As you might see already, these standards may gradually pile up on each other, which sooner or later will lead to a network of interconnected standards that build up on each other — either public or proprietary standards (as depicted with different colors below)

In some cases, it might be that the appliance of a proprietary standard may enable the usage of a public standard and vice versa.

One positive thing I repeatedly observed over the past years:

Standards that result in successful automation and enjoyable features establish trust automatically and pave the path to additional standardization

However, defining these standards needs the cross functional knowledge domain and an implicit understanding of ‘who we are’ and ‘what we do’. If this is done correctly, standardization will also further sharpen the abstraction of processes such as ‘’training an ml model’ or ‘serving an ml model’ — which is positive!

Obviously, it may happen that you end up with bad standards. In a network of interconnected standards, a bad standard also negatively impacts downstream standards:

While this cannot be completely avoided, one thing that helps to mitigate this somewhat, is to identify strong social, visionary and technical lead personas in each domain, that are backed up with trust from their respective domain. Find them, and bring them together! The earlier this is done, the better.

In my opinion, this is the most important task from a management perspective. Not doing so at all, or picking the wrong candidates for the job is going to have a huge negative impact on your cross functional environment and will heavily impact the scalability of the scalable use case architecture. It might even be, that of all mistakes you may make, this one might have the biggest impact. Do not underestimate the importance of this decision!

Another thing to be aware of are exactly-one and one-of standards. Sometimes you need to allow variance whereas sometimes that can be a bad idea. Some obvious examples where an exactly-one standard makes no sense:

training routines must always use xgboost — other ML frameworks are not supported
data transformations must always be done via SQL
or the contrary: data transformations must always be done via Spark

The same way that exactly-one standards can be bad, this is also true for one-of standards. Let’s simulate this with a realistic yet perhaps exaggerated example:

Source code for training routines can be managed in the following ways:

as a notebook
as a single main.py file
as a python package

Moreover, we support all of the following package managers

poetry
pdm
uv
hatch

This already puts you into a situation where you need to support CI/CD patterns for 6 different ways of managing source code artifacts for training routines (notebooks where listed for demonstrative purposes only— shipping the source code of notebook cells to some cloud provider, running tests, type checks or autodoc builds on them might not even be possible technically):

CI/CD poetry pdm uv hatch package notebook main.py

Let’s make things worse by adding an additional one-of standard on top:

CI/CD patterns can be implemented either via GitHub Workflows or via AWS CodeBuild

Github-Codebuild-poetry pdm uv hatch package notebook main.py

And lastly:

training jobs can either run on Amazon Sagemaker or on AWS Glue (using spark ml)

sagemaker-glue-Github-Codebuild-poetry pdm uv hatch package notebook main.py

That escalated quickly — there are now already 24 implementation approaches to automate the process of shipping training source code to one of the supported cloud services.

This isn’t feasible anymore to support from a use case overarching platform perspective. What’s the result of such “standardization”? You will more or less be forced to offload the implementation and maintenance burden of CI/CD patterns down to each individual use case.

Even worse, such a stack of one-of standards can easily result in an architecture where a given requested generic feature might become completely unfeasible to implement.

As a general rule of thumb:

the lower you are in the hierarchy of your standards
the less one-of standards you should have!

Things like Source code management are so fundamental that you should try to avoid variance wherever possible.

A better way of standardization looks like this:

source code for training routines must always be managed in packages (public standard)
python packages must always be managed with package manager X (proprietary standard)
training jobs can either be run on AWS Glue or Amazon Sagemaker (proprietary standard)

This way, you end up with 2 instead of 24 implementation approaches that can easily be implemented and maintained centrally:

Sagemaker-Glue-Github-package-managerx-package

A Scalable Use Case Architecture

If we put all pieces together, the optimal big picture looks like this:

scalable-tested-documented-automated-observable-secure-data-engineering-cross-functional-knowlege-domain

A culture of pairing and sharing is used to define standards which, depending on the quality of the standards, can result in successful testing, documentation, automation, observability and security of a platform. This results in a scalable use case architecture, because features that have been developed once, can be reused across use cases — there is less need to support dozens of different approaches to tackle the same technical or business request.

Culture is the very foundation to success here
without culture, there won’t be acceptance for standards
without standards there is no scalability, because your forced to plan and implement against unbounded challenges

Recovery

So what can we do if things have evolved completely differently? Bad news upfront: Over the past 10 years I have seen different industries where shared culture and standards where not considered in the first place and I haven’t seen a single case where that was successfully implemented retrospectively.

In the absence of shared culture and standards across teams, I’ve continuously observed patterns such as:

this one recommender on prod starts to produce awkward predictions and no one can fix it because the only person that knows how the model is trained and deployed to prod is on holiday. No one else can fix it, because no one understands how the scalable use case architecture looks like, and it’s done different compared to other cases
the same situation occurs if a person is leaving the company, becomes sick or is gone temporarily for other reasons
There might be some sort of platform team that was initially founded to support different scalable use cases, however this platform team can barely handle the huge amount of different support requests, because every scalable use case has it’s own way of handling things
bringing new scalable use cases to production is a time consuming task, simply because there are no existing standards you can re-use and every newscalable use case needs to find it’s own culture and standards
etc.

Instead of creating a shared culture, you have thus created a universe of different micro cultures implemented across a variety of different teams. That is, because people (generally speaking at least) have a very archaic desire for culture that appears to be part of our very being since the beginning of human history. If there is no company wide culture they can attach to, they will create their own use case or product internal micro culture. Within these isolated use case cultures there will also grow standards anyway. This appears to be another human desire — to break down complexity by applying some sort of framing which makes things easier to comprehend. Don’t get me wrong, cultural diversity is a beautiful thing, but rather then letting lots of isolated micro cultures grow, you should use this cultural diversity to influence a bigger use case overarching culture.

Now, there is one approach I can promise will fail: Attempting to enforce new use case spanning standards from a management perspective in a push down approach. Why is this guaranteed to fail?

Well, let’s create a realistic example. You have different use cases or products. In some use cases, people use python packages and dedicated package managers. Other use cases just use notebooks. And yet others just write main.py modules with 1000 lines of code and perhaps some requirements.txt. Some use case write tests, others do not. Some use cases use type hints, others not. Some use cases are documented whereas others are not. Some use cases might document their work using sphinx, others write markdown and yet others use a company wiki. So what happens if you want to enforce that all use cases must now f.i. manage their source code in python packages with package manger x? You will cause a massive disruption which ultimately can be classified as an approach to enforce culture. And any attempt in human history that I can currently think of in doing so was sooner or later doomed to run into resistance — in most cases coupled with catastrophic results and violence. Now at least the last point is hopefully not to be expected in a business environment (🤞)— but you will see resistance. Standards that are attempted to be enforced in a top down approach are going to be rejected — because you are causing disruption for a micro culture that has defined it’s own culture and standards.

You cannot enforce standards because you cannot enforce culture

Albeit I have not seen this personally yet, I believe the only way to fix this, is if you can sense a growing bottom up desire for a shared culture. Such a desire presumably only grows, if individual micro cultures start to sense that “the way we do things” is inefficient and becomes a personal burden and pain.

If you sense such a bottom up desire, you can support it from a management perspective by:

identifying and promoting social and technical lead personas into a position where they are able to form a scalable use case overarching culture
providing the budgets to implement a new standardized use case platform

In a nutshell, doing the things that should have been done in the first place.

There is another thing I am optimistic to predict: Even if you get the chance to correct things, you are going to be confronted with a huge scale migration, which could take weeks, months or even years to complete, primarily depending on the complexity, variety and the amount of different use cases participating.

Resume

Working in cross functional environments is a huge cognitive and emotional challenge. For some reason though, it’s these environments that I love working in most. There’s no place to learn faster than in such environments. Over the past 10 years, it’s been cross functional environments that provided the most enriching input to my personal career.

Working in such environments can become a nightmare though if businesses fail to understand the importance of culture and standards and how to provide a proper framework for a use case spanning culture. And you’re not necessarily always in the project position to shape such a use case spanning culture yourself. It’s a management responsibility to identify and implement these lead personas and providing the budgets to form such a culture and standards.

My ending note? I don’t think that anything I wrote here is revolutionary or new. In fact, even though I didn’t do the research myself, I would expect that there is quiet some literature on the topic. Writing this down though definitely helps to digest and remember patterns that I saw went well and bad in past projects.

I hope you enjoyed reading this — cheers!

P.S. Do you know a good tool that can help to write, link and visualize standards in a graph like structure? Drop me a link the comments 🙂 I haven’t yet found a good one.