A Scal­able Use Case Architecture



In this blog art­icle I want to talk about the import­ance of cul­ture and stand­ards in cross func­tional agile work­ing envir­on­ments and how that is crit­ical from a busi­ness scal­ing per­spect­ive to cre­ate scal­able use cases.

The most pre­dom­in­ant examples for such envir­on­ments from my most recent pro­ject his­tory are pre­sum­ably ML and AI envir­on­ments, where Data Sci­ent­ists, Data and DevOps Engin­eers are work­ing together to enable and main­tain an ever grow­ing amount of busi­ness use cases.

Over the past years, I’ve con­trib­uted to mul­tiple of such envir­on­ments in dif­fer­ent indus­tries, and I want to share my obser­va­tions about why I believe some of them had more suc­cess then oth­ers. In all of these pro­jects, it’s been a busi­ness desire to be able to scale out to new use cases quickly.

By scal­ing, I am expli­citly refer­ring to the ability:

  • to keep main­ten­ance costs of exist­ing use cases low
  • while being able to quickly boot­strap and bring new use cases to production

By use case, I am refer­ring to things like:

  • build­ing a recom­mender which is used to provide per­son­al­ized product recom­mend­a­tions in your web shop based on a trained machine learn­ing model
  • the per­son­al­iz­a­tion of email or social media mar­ket­ing cam­paigns based on a machine learn­ing model
  • the optim­iz­a­tion of mar­ket­ing and sales strategies based on sales fore­casts provided by a machine learn­ing model
  • and many more

Over the course of this art­icle, I want to share my thoughts on why this busi­ness desire can only be ful­filled with cul­ture and stand­ards, and what the con­sequences are in the absence of these terms.

Observ­able patterns

For this blog post, I will con­tinu­ously use cross func­tional ML and AI envir­on­ments as an example, but the pat­terns I will describe can prob­ably be abstrac­ted to any cross func­tional envir­on­ment. Just replace any of above men­tioned per­so­nas with Data Analyst, Soft­ware Engin­eer, DWH Engin­eer, or whatever ter­min­o­logy applies best.

When busi­nesses start to invest into ML and AI, the ini­tial situ­ation often looks some­what like this:

Business Use Case Engineering Data Science DevOps
  • Through internal recruit­ing or hir­ing of external resources, there is a pool of Data Sci­ent­ists, Data and DevOps Engin­eers available
  • An ini­tial use case is defined, that is sup­posed to gen­er­ate busi­ness value and help to get star­ted on the topic
  • The require­ments of this use case are com­mu­nic­ated to the cross func­tional envir­on­ment (often using Data Sci­ent­ists as interface)
  • These require­ments are trans­lated into a POC
  • The POC is even­tu­ally pro­moted into a MVP
  • The MVP might find it’s way into pro­duct­ive usage and, even­tu­ally, it might even gen­er­ate the desired busi­ness value

Now, once that first use case some­how found it’s way to pro­duct­ive use, you can often observe a busi­ness expect­a­tion, that it should now be pos­sible to eas­ily scale out to addi­tional use cases, right?

Business Use Case Engineering Data Science DevOps

Well, depend­ing on how this ini­tial use case was executed and what happened both inside (between Engin­eer­ing, Data Sci­ence and DevOps) and around this cross func­tional envir­on­ment that might as well become true. How­ever, it might as well be that the fol­low­ing situ­ation occurs:

Business Recruiting Use Case Engineering Data Science DevOps

This might be the worst case scen­ario, where addi­tional use cases impli­citly cre­ate the need for an addi­tional pro­cess step — recruit­ing. In thiso case, scalab­il­ity is com­plete gone, not because HR depart­ments tend to do a poor job in hir­ing new people, but because mar­ket demand is high and recruit­ing pro­fes­sional resources takes time — albeit this situ­ation may also dif­fer across regions and coun­tries. Regard­less, this also increases staff costs for internal and/or external resources and ulti­mately lowers return on investments.

A bet­ter, yet not com­pletely optimal scen­ario occurs, if you can at least off­load mul­tiple use cases to a couple of environments:

So what actu­ally hap­pens inside and around this cross func­tional envir­on­ment, and what are suc­cess­ful pat­terns to a scal­able use case architecture?

Cul­ture

Obvi­ously, the first thing that got to be respec­ted, is that cross func­tional envir­on­ments con­sist of per­so­nas com­ing from com­pletely dif­fer­ent backgrounds.

Culture Engineering Data Science DevOps

I do well remem­ber my first inter­ac­tion (as a Data Engin­eer) with such envir­on­ments and the immense buzzword storm that I had to make sense of — with terms like MLOps, AIOps, train­ing, host­ing, serving, exper­i­ment­a­tion, exper­i­ment track­ing, model regis­tries, fea­ture selec­tion, fea­ture engin­eer­ing, hyper­para­met­ers, hyper­para­meter tun­ing and optim­iz­a­tion, retrain­ing, infra­struc­ture as code, ter­ra­form, ter­ra­grunt, cloud­form­a­tion, DevOps, SecOps, and many many more.

The same must be true from a Data Sci­ent­ist per­spect­ive if you’re con­fron­ted for the first time with top­ics like Rest­ful API Design, soft­ware engin­eer­ing best prac­tices, Test­ing Strategies, Observ­ab­il­ity, Infra­struc­ture as Code, etc.

And obvi­ously, a frac­tion of both worlds must apply to the DevOps per­spect­ive as well.

Now optim­ally, as these dif­fer­ent per­so­nas start to lay out and imple­ment a road map for dif­fer­ent fea­tures that con­trib­ute to a given use case suc­cess, this will cre­ate a bid­irec­tional know­ledge flow between all domains, where ter­min­o­logy from each domain enters a shared know­ledge domain. This can hap­pen nat­ur­ally with the help of the usual agile routines such as dailies, refine­ments, plan­nings, ret­ro­spect­ives etc. It actu­ally only requires a social atti­tude of all per­so­nas, as well as an impli­cit under­stand­ing that a use case can only suc­cess­fully be imple­men­ted with know­ledge from all domains.

What am I talk­ing about? Take the serving of a machine learn­ing model as an example. I would per­son­ally claim that none of the par­ti­cip­at­ing per­so­nas have a com­plete detailed under­stand­ing of the pro­cess. Instead, every­one pre­sum­ably knows a frac­tion of the com­plete truth.

I would claim that only Data Sci­ence really under­stands the intern­als and the busi­ness logic con­tained in the served ML model. How­ever, Data Sci­ent­ists will prob­ably not com­pletely under­stand the infra­struc­ture stack and it’s intern­als that are required to have the model up an running.

Real­ist­ic­ally, the com­plete truth about the infra­struc­ture stack only exists in the DevOps domain.

Data Sci­ent­ists will prob­ably also not com­pletely under­stand the API wrap­per that is put in front of the served model because that pre­sum­ably is know­ledge that resides in the Data Engin­eer­ing domain.

Optim­ally, the truths from each indi­vidual domain have an over­lap­ping area. This inter­sec­tion can impli­citly cre­ate an abstrac­ted per­spect­ive of a given pro­cess, such as serving and scor­ing of an ML model. The big­ger this inter­sec­tion, the clearer the abstrac­tion across the team!

This will slowly start to grow a com­mon ground that I would refer to as cross func­tional know­ledge domain. This cross func­tional domain is tightly coupled to stand­ard­iz­a­tion that can lead to auto­ma­tion, test­ing, doc­u­ment­a­tion, secur­ity and observ­ab­il­ity — all of which con­trib­ute to scalability.

Cross functional Knowlege Domain Engineering DevOps Data Science

Even for very exper­i­enced per­so­nas with years of exper­i­ence in such envir­on­ments, this gen­er­ates a decent amount of cog­nit­ive and emo­tional over­head and stress to bal­ance. The first thing to respect from a busi­ness and man­age­ment per­spect­ive is:

This pro­cess takes time

One of the things that can go wrong this early in the pro­ject phase is a mono dir­ec­tional know­ledge flow, where a given domain ever only emits their out­puts into a pro­cess without also receiv­ing inputs:

Cross Functional Knowlege Domain

One observation I made in the past years again and again, is that often a request for a new fea­ture actu­ally involves con­tri­bu­tions from all or at least mul­tiple domains. This is only achiev­able with a cul­ture of shar­ing and pair­ing and a bid­irec­tional know­ledge flow between all domains. Examples for anti-pat­terns could include:

  • Data Engin­eers, that ever only off­load their require­ments to DevOps without gain­ing an under­stand­ing of the design prin­ciples and bound­ar­ies that are in place for tool­ing like terraform
  • Data Sci­ent­ists, that ever only emit their require­ments to DevOps and Data Engin­eer­ing without the abil­ity to under­stand that some­times the best busi­ness solu­tion might make no tech­nical sense what­so­ever or even cause secur­ity vulnerabilities
  • etc.

In an optimal world, this shouldn’t really be dis­cussed, but some­times you do run into dif­fi­cult social per­son­al­it­ies that can hinder the devel­op­ment of a sane cross func­tional know­ledge domain.

Stand­ards

Estab­lish­ing a sane cross func­tional cul­ture is going to be the very found­a­tion that is needed to estab­lish stand­ards. Why do we need stand­ards? That is, because in the absence of either pub­lic or pro­pri­et­ary stand­ards, you can eas­ily run into an over­all lack of test­ing, doc­u­ment­a­tion, auto­ma­tion, observ­ab­il­ity and secur­ity that will ulti­mately lead to an over­all lack of scalab­il­ity.

I want to back this up by an observation (short tech­nical deep dive com­ing in). Take this as an example. The link points to one of the Sage­maker SDKs that is sup­posed to ease the pro­cess of:

train­ing and deploy­ing machine learn­ing mod­els on Amazon SageMaker

More pre­cisely, the link points to the place inside the SDK where Amazon stores a ref­er­ence to a large amount of ML frame­work, frame­work ver­sion and python ver­sion spe­cific train­ing and/or serving images, that are avail­able in spe­cific AWS Regions and their respect­ive accounts.

I’ve tried to extract a sim­pli­fied frac­tion of this con­fig­ur­a­tion for demon­stra­tion purposes:

- pytorch, 0.4.0, py2, py3
- pytorch, 1.0.0, py2, py3
- pytorch, 1.1.0, py2, py3
- pytorch, 1.2.0, py2, py3
- pytorch, 1.3.1, py2, py3
- pytorch, 1.4.0, py2, py3, py36
- pytorch, 1.5.0, py3, py36
- pytorch, 1.6.0, py3, py36
- pytorch, 1.7.1, py3, py36
- pytorch, 1.8.0, py3, py36
- pytorch, 1.8.1, py3, py36
- pytorch, 1.9.0, py38
- pytorch, 1.9.1, py38
- pytorch, 1.10.0, py38
- pytorch, 1.10.2, py38
- pytorch, 1.11.0, py38
- pytorch, 1.12.0, py38
- pytorch, 1.12.1, py38
- pytorch, 1.13.1, py39
- pytorch, 2.0.0, py310
- pytorch, 2.0.1, py310
- pytorch, 2.1.0, py310
- pytorch, 2.2.0, py310
- mxnet, 0.12.1, py2, py3
- mxnet, 1.0.0, py2, py3
- mxnet, 1.1.0, py2, py3
- mxnet, 1.2.1, py2, py3
- mxnet, 1.3.0, py2, py3
- mxnet, 1.4.0, py2, py3
- mxnet, 1.6.0, py2, py3
- mxnet, 1.7.0, py3
- mxnet, 1.8.0, py37
- mxnet, 1.9.0, py38
- pytorch-neuron, 1.11.0, py38
- xgboost, 0.90-1, py3
- xgboost, 0.90-2, py3
- xgboost, 1.0-1, py3
- autogluon, 0.3.1, py37
- autogluon, 0.3.2, py38
- autogluon, 0.4.0, py38
- autogluon, 0.4.2, py38
- autogluon, 0.4.3, py38
- autogluon, 0.5.2, py38
- autogluon, 0.6.1, py38
- autogluon, 0.6.2, py38
- autogluon, 0.7.0, py39
- autogluon, 0.8.2, py39
- autogluon, 1.0.0, py310
- autogluon, 1.1.0, py310
- pytorch-smp, 2.0.1, py310
- pytorch-smp, 2.1.2, py310
- pytorch-smp, 2.2.0, py310
- pytorch-smp, 2.3.0, py310
- pytorch-smp, 2.3.1, py310
- pytorch-training-compiler, 1.12.0, py38
- pytorch-training-compiler, 1.13.1, py39
- sklearn, 0.20.0, py3
- sklearn, 0.23-1, py3
- sklearn, 1.0-1, py3
- sklearn, 1.2-1, py3
- tensorflow, 1.4.1, py2
- tensorflow, 1.5.0, py2
- tensorflow, 1.6.0, py2
- tensorflow, 1.7.0, py2
- tensorflow, 1.8.0, py2
- tensorflow, 1.9.0, py2
- tensorflow, 1.10.0, py2
- tensorflow, 1.11.0, py2, py3
- tensorflow, 1.12.0, py2, py3
- tensorflow, 1.14.0, py2, py3
- tensorflow, 1.15.0, py2, py3
- tensorflow, 1.15.2, py2, py3, py37
- tensorflow, 1.15.3, py2, py3, py37
- tensorflow, 1.15.4, py3, py36, py37
- tensorflow, 1.15.5, py3, py36, py37
- tensorflow, 2.0.0, py2, py3
- tensorflow, 2.0.1, py2, py3
- tensorflow, 2.0.2, py2, py3
- tensorflow, 2.0.3, py3, py36
- tensorflow, 2.0.4, py3, py36
- tensorflow, 2.1.0, py2, py3
- tensorflow, 2.1.1, py2, py3
- tensorflow, 2.1.2, py3, py36
- tensorflow, 2.1.3, py3, py36
- tensorflow, 2.2.0, py37
- tensorflow, 2.2.1, py37
- tensorflow, 2.2.2, py37
- tensorflow, 2.3.0, py37
- tensorflow, 2.3.1, py37
- tensorflow, 2.3.2, py37
- tensorflow, 2.4.1, py37
- tensorflow, 2.4.3, py37
- tensorflow, 2.5.0, py37
- tensorflow, 2.5.1, py37
- tensorflow, 2.6.0, py38
- tensorflow, 2.6.2, py38
- tensorflow, 2.6.3, py38
- tensorflow, 2.7.1, py38
- tensorflow, 2.8.0, py39
- tensorflow, 2.9.2, py39
- tensorflow, 2.10.1, py39
- tensorflow, 2.11.0, py39
- tensorflow, 2.12.0, py310
- tensorflow, 2.13.0, py310
- tensorflow, 2.14.1, py310

Now, I know noth­ing about how the sage­maker SDKs and all of above images are intern­ally man­aged (albeit being open code), but through hands-on exper­i­ence with the Sage­maker Ser­vice over more then 2.5 years:

  • I do observe a severe lack of test­ing. Some of the images lis­ted above are sup­posed to be com­pat­ible with python 3.10, or lit­er­ally any python 3 ver­sion if I under­stand the py3 ver­sion spe­cifier cor­rectly.
    — The sage­maker-train­ing toolkit (com­monly used in Sage­maker Train­ing Jobs and thus images lis­ted above) is how­ever only tested against python 3.8 and 3.9.
    — The sage­maker-infer­ence toolkit (com­monly used in Sage­maker Infer­ence end­points and thus images lis­ted above) is how­ever only tested against 3.8, 3.9 and 3.10, albeit you could hypo­thet­ic­ally request a tensor­flow serving image with 3.13, or even Python 2?
  • Over the past 2.5 years, I was also able to detect and over­all lack of observ­ab­il­ity best prac­tices (struc­tured log­ging, met­rics, tra­cing, opentelemetry?).
  • I was also able of observe a lack of con­cise doc­u­ment­a­tion, that is con­sum­able by all, Data Sci­ent­ists, Data and DevOps Engin­eers, with arti­facts such as miss­ing, dis­trib­uted and redund­ant or incor­rect documentation.
  • I can also see a lack of secur­ity, given that it’s appar­ently still pos­sible to request train­ing or serving images to run on python ver­sions that won’t get any secur­ity fixes any­more such as python 3.7, or even python 2

And lastly, I can­not pos­sibly see a way for this to scale well, since every new frame­work or frame­work ver­sion either needs a new image, or a new image ver­sion, that needs to be boot­strapped and main­tained by someone, needs addi­tional doc­u­ment­a­tion, test­ing, etc. In a nut­shell, it appears that there are no (observ­able, obvi­ous) stand­ards applied what­so­ever. This res­ults in an imple­ment­a­tion explo­sion of docker images for an ever grow­ing list of vary­ing ML frame­works (and their ver­sions), python ver­sions and other para­met­ers. And in fact, if you start to browse the Git­Hub issues of some of the sage­maker repos­it­or­ies such as MMS, one can eas­ily get the impres­sion that some of these tools are more or less unmain­tained — or at least it appears no one is react­ing to bug or fea­ture reports anymore.

Don’t get me wrong, I am not claim­ing that stand­ard­iz­a­tion in the world of ML and AI is trivial. The oppos­ite is the case. I just want to show­case what can hap­pen in the absence of either pub­lic or pro­pri­et­ary standards.

My per­sonal hope is, that the ML and AI hype will at some point reach it’s peak, which hope­fully is fol­lowed by a phase of stand­ard­iz­a­tion and estab­lish­ment of well doc­u­mented best prac­tices. But as of the point of writ­ing this, I can­not yet observe any stand­ard­iz­a­tion what­so­ever in that area, with hun­dreds if not thou­sands of dif­fer­ent tools and frame­works that is ever grow­ing even at the point of writ­ing this. Obvi­ously, it is very hard to cre­ate such stand­ards in a phase of rapid growth, but at some point I would def­in­itely like to see an over­all sta­bil­iz­a­tion of the whole ML and AI stack.

Until we are there, there will prob­ably not be any pub­lic stand­ards what­so­ever, that would help to state:

  • how mod­els are trained
  • how mod­els are seri­al­ized and deserialized
  • how mod­els are scored (#model signature)
  • and many more things

Still ques­tion­ing the import­ance of stand­ards? Just try to ima­gine for a brief second how a world without the inter­net pro­tocol would look like 🤡

In the absence of exist­ing pub­lic stand­ards, we need to estab­lish pro­pri­et­ary stand­ards, oth­er­wise we run into a non scal­able imple­ment­a­tion explo­sion. Obvi­ously that is much harder from a glob­ally act­ing cloud pro­vider per­spect­ive. But the arti­facts of a lack of stand­ard­iz­a­tion are hope­fully obvious.

So, how do we define pro­pri­et­ary stand­ards in the absence of exist­ing pub­lic stand­ards? Start simple, and expand. Defin­ing such stand­ards isn’t actu­ally that hard if you can con­trol scope, and could start as simple as this:

  • source code for train­ing routines must be man­aged in ded­ic­ated python pack­ages (as opposed to notebooks)
  • source code pack­ages must be man­aged with a ded­ic­ated pack­age man­ager X (e.g. poetry, hatch, pdm)
  • source code must be annot­ated with type hints and must be val­id­ated using e.g. mypy
    — to sup­port this, we will estab­lish coach­ing pro­grams for less exper­i­enced personas
  • source code should (where pos­sible) be covered with unit tests which are executed with e.g. pytest
    — to sup­port this, we will estab­lish coach­ing pro­grams for less exper­i­enced personas
  • source code should be prop­erly doc­u­mented with the numpy or rst doc style

These simple stand­ards can already lead to the point, where you are able

  • to cre­ate gen­eric reusable CI pat­terns that make it easy to run type checks, unit tests, auto­doc builds and pack­age builds for arbit­rary source code pack­ages with min­imal effort

From there on, you may at some point pro­ceed to more advanced top­ics such as:

  • mod­els must be trained in con­tain­er­ized envir­on­ments, and the con­tain­ers are man­aged in repos­it­ory x where the fol­low­ing release and ver­sion­ing concept is in place […]
  • Train­ing jobs must run on cloud ser­vice X
  • mod­els must be stored in model registry X
  • when mod­els are stored, we require the pres­ence of spe­cific cus­tom pro­pri­et­ary metadata that helps to auto­mate the serving process

This might lead to a situ­ation where you can at some point com­pletely auto­mate both, the train­ing and serving pro­cess with gen­eric reusable build­ing blocks.

As you might see already, these stand­ards may gradu­ally pile up on each other, which sooner or later will lead to a net­work of inter­con­nec­ted stand­ards that build up on each other — either pub­lic or pro­pri­et­ary stand­ards (as depic­ted with dif­fer­ent col­ors below)

Cross functional Knowlege Domain Engineering DevOps Data Science

In some cases, it might be that the appli­ance of a pro­pri­et­ary stand­ard may enable the usage of a pub­lic stand­ard and vice versa.

One pos­it­ive thing I repeatedly observed over the past years:

Stand­ards that res­ult in suc­cess­ful auto­ma­tion and enjoy­able fea­tures estab­lish trust auto­mat­ic­ally and pave the path to addi­tional standardization

How­ever, defin­ing these stand­ards needs the cross func­tional know­ledge domain and an impli­cit under­stand­ing of ‘who we are’ and ‘what we do’. If this is done cor­rectly, stand­ard­iz­a­tion will also fur­ther sharpen the abstrac­tion of pro­cesses such as ‘’train­ing an ml model’ or ‘serving an ml model’ — which is positive!

Obvi­ously, it may hap­pen that you end up with bad stand­ards. In a net­work of inter­con­nec­ted stand­ards, a bad stand­ard also neg­at­ively impacts down­stream standards:

Connected Dots

While this can­not be com­pletely avoided, one thing that helps to mit­ig­ate this some­what, is to identify strong social, vis­ion­ary and tech­nical lead per­so­nas in each domain, that are backed up with trust from their respect­ive domain. Find them, and bring them together! The earlier this is done, the better.

In my opin­ion, this is the most import­ant task from a man­age­ment per­spect­ive. Not doing so at all, or pick­ing the wrong can­did­ates for the job is going to have a huge neg­at­ive impact on your cross func­tional envir­on­ment and will heav­ily impact the scalab­il­ity of the scal­able use case archi­tec­ture. It might even be, that of all mis­takes you may make, this one might have the biggest impact. Do not under­es­tim­ate the import­ance of this decision!

Another thing to be aware of are exactly-one and one-of stand­ards. Some­times you need to allow vari­ance whereas some­times that can be a bad idea. Some obvi­ous examples where an exactly-one stand­ard makes no sense:

  • train­ing routines must always use xgboost — other ML frame­works are not supported
  • data trans­form­a­tions must always be done via SQL
  • or the con­trary: data trans­form­a­tions must always be done via Spark

The same way that exactly-one stand­ards can be bad, this is also true for one-of stand­ards. Let’s sim­u­late this with a real­istic yet per­haps exag­ger­ated example:

Source code for train­ing routines can be man­aged in the fol­low­ing ways:

  • as a notebook
  • as a single main.py file
  • as a python package

Moreover, we sup­port all of the fol­low­ing pack­age managers

  • poetry
  • pdm
  • uv
  • hatch

This already puts you into a situ­ation where you need to sup­port CI/CD pat­terns for 6 dif­fer­ent ways of man­aging source code arti­facts for train­ing routines (note­books where lis­ted for demon­strat­ive pur­poses only— ship­ping the source code of note­book cells to some cloud pro­vider, run­ning tests, type checks or auto­doc builds on them might not even be pos­sible technically):

CI/CD poetry pdm uv hatch package notebook main.py

Let’s make things worse by adding an addi­tional one-of stand­ard on top:

  • CI/CD pat­terns can be imple­men­ted either via Git­Hub Work­flows or via AWS CodeBuild
Github-Codebuild-poetry pdm uv hatch package notebook main.py

And lastly:

  • train­ing jobs can either run on Amazon Sage­maker or on AWS Glue (using spark ml)
sagemaker-glue-Github-Codebuild-poetry pdm uv hatch package notebook main.py

That escal­ated quickly — there are now already 24 imple­ment­a­tion approaches to auto­mate the pro­cess of ship­ping train­ing source code to one of the sup­por­ted cloud services.

This isn’t feas­ible any­more to sup­port from a use case over­arch­ing plat­form per­spect­ive. What’s the res­ult of such “stand­ard­iz­a­tion”? You will more or less be forced to off­load the imple­ment­a­tion and main­ten­ance bur­den of CI/CD pat­terns down to each indi­vidual use case.

Even worse, such a stack of one-of stand­ards can eas­ily res­ult in an archi­tec­ture where a given reques­ted gen­eric fea­ture might become com­pletely unfeas­ible to implement.

As a gen­eral rule of thumb:

  • the lower you are in the hier­archy of your standards
  • the less one-of stand­ards you should have!

Things like Source code man­age­ment are so fun­da­mental that you should try to avoid vari­ance wherever possible.

A bet­ter way of stand­ard­iz­a­tion looks like this:

  • source code for train­ing routines must always be man­aged in pack­ages (pub­lic standard)
  • python pack­ages must always be man­aged with pack­age man­ager X (pro­pri­et­ary standard)
  • train­ing jobs can either be run on AWS Glue or Amazon Sage­maker (pro­pri­et­ary standard)

This way, you end up with 2 instead of 24 imple­ment­a­tion approaches that can eas­ily be imple­men­ted and main­tained centrally:

Sagemaker-Glue-Github-package-managerx-package

A Scal­able Use Case Architecture

If we put all pieces together, the optimal big pic­ture looks like this:

scalable-tested-documented-automated-observable-secure-data-engineering-cross-functional-knowlege-domain

A cul­ture of pair­ing and shar­ing is used to define stand­ards which, depend­ing on the qual­ity of the stand­ards, can res­ult in suc­cess­ful test­ing, doc­u­ment­a­tion, auto­ma­tion, observ­ab­il­ity and secur­ity of a plat­form. This res­ults in a scal­able use case archi­tec­ture, because fea­tures that have been developed once, can be reused across use cases — there is less need to sup­port dozens of dif­fer­ent approaches to tackle the same tech­nical or busi­ness request.

  • Cul­ture is the very found­a­tion to suc­cess here
  • without cul­ture, there won’t be accept­ance for standards
  • without stand­ards there is no scalab­il­ity, because your forced to plan and imple­ment against unboun­ded challenges

Recov­ery

So what can we do if things have evolved com­pletely dif­fer­ently? Bad news upfront: Over the past 10 years I have seen dif­fer­ent indus­tries where shared cul­ture and stand­ards where not con­sidered in the first place and I haven’t seen a single case where that was suc­cess­fully imple­men­ted retrospectively.

In the absence of shared cul­ture and stand­ards across teams, I’ve con­tinu­ously observed pat­terns such as:

  • this one recom­mender on prod starts to pro­duce awk­ward pre­dic­tions and no one can fix it because the only per­son that knows how the model is trained and deployed to prod is on hol­i­day. No one else can fix it, because no one under­stands how the scal­able use case archi­tec­ture looks like, and it’s done dif­fer­ent com­pared to other cases
  • the same situ­ation occurs if a per­son is leav­ing the com­pany, becomes sick or is gone tem­por­ar­ily for other reasons
  • There might be some sort of plat­form team that was ini­tially foun­ded to sup­port dif­fer­ent scal­able use cases, how­ever this plat­form team can barely handle the huge amount of dif­fer­ent sup­port requests, because every scal­able use case has it’s own way of hand­ling things
  • bring­ing new scal­able use cases to pro­duc­tion is a time con­sum­ing task, simply because there are no exist­ing stand­ards you can re-use and every news­cal­able use case needs to find it’s own cul­ture and standards
  • etc.

Instead of cre­at­ing a shared cul­ture, you have thus cre­ated a uni­verse of dif­fer­ent micro cul­tures imple­men­ted across a vari­ety of dif­fer­ent teams. That is, because people (gen­er­ally speak­ing at least) have a very archaic desire for cul­ture that appears to be part of our very being since the begin­ning of human his­tory. If there is no com­pany wide cul­ture they can attach to, they will cre­ate their own use case or product internal micro cul­ture. Within these isol­ated use case cul­tures there will also grow stand­ards any­way. This appears to be another human desire — to break down com­plex­ity by apply­ing some sort of fram­ing which makes things easier to com­pre­hend. Don’t get me wrong, cul­tural diversity is a beau­ti­ful thing, but rather then let­ting lots of isol­ated micro cul­tures grow, you should use this cul­tural diversity to influ­ence a big­ger use case over­arch­ing culture.

Now, there is one approach I can prom­ise will fail: Attempt­ing to enforce new use case span­ning stand­ards from a man­age­ment per­spect­ive in a push down approach. Why is this guar­an­teed to fail?

Well, let’s cre­ate a real­istic example. You have dif­fer­ent use cases or products. In some use cases, people use python pack­ages and ded­ic­ated pack­age man­agers. Other use cases just use note­books. And yet oth­ers just write main.py mod­ules with 1000 lines of code and per­haps some requirements.txt. Some use case write tests, oth­ers do not. Some use cases use type hints, oth­ers not. Some use cases are doc­u­mented whereas oth­ers are not. Some use cases might doc­u­ment their work using sphinx, oth­ers write mark­down and yet oth­ers use a com­pany wiki. So what hap­pens if you want to enforce that all use cases must now f.i. man­age their source code in python pack­ages with pack­age manger x? You will cause a massive dis­rup­tion which ulti­mately can be clas­si­fied as an approach to enforce cul­ture. And any attempt in human his­tory that I can cur­rently think of in doing so was sooner or later doomed to run into res­ist­ance — in most cases coupled with cata­strophic res­ults and viol­ence. Now at least the last point is hope­fully not to be expec­ted in a busi­ness envir­on­ment (🤞)— but you will see res­ist­ance. Stand­ards that are attemp­ted to be enforced in a top down approach are going to be rejec­ted — because you are caus­ing dis­rup­tion for a micro cul­ture that has defined it’s own cul­ture and standards.

You can­not enforce stand­ards because you can­not enforce culture

Albeit I have not seen this per­son­ally yet, I believe the only way to fix this, is if you can sense a grow­ing bot­tom up desire for a shared cul­ture. Such a desire pre­sum­ably only grows, if indi­vidual micro cul­tures start to sense that “the way we do things” is inef­fi­cient and becomes a per­sonal bur­den and pain.

If you sense such a bot­tom up desire, you can sup­port it from a man­age­ment per­spect­ive by:

  • identi­fy­ing and pro­mot­ing social and tech­nical lead per­so­nas into a pos­i­tion where they are able to form a scal­able use case over­arch­ing culture
  • provid­ing the budgets to imple­ment a new stand­ard­ized use case platform

In a nut­shell, doing the things that should have been done in the first place.

There is another thing I am optim­istic to pre­dict: Even if you get the chance to cor­rect things, you are going to be con­fron­ted with a huge scale migra­tion, which could take weeks, months or even years to com­plete, primar­ily depend­ing on the com­plex­ity, vari­ety and the amount of dif­fer­ent use cases participating.

Resume

Work­ing in cross func­tional envir­on­ments is a huge cog­nit­ive and emo­tional chal­lenge. For some reason though, it’s these envir­on­ments that I love work­ing in most. There’s no place to learn faster than in such envir­on­ments. Over the past 10 years, it’s been cross func­tional envir­on­ments that provided the most enrich­ing input to my per­sonal career.

Work­ing in such envir­on­ments can become a night­mare though if busi­nesses fail to under­stand the import­ance of cul­ture and stand­ards and how to provide a proper frame­work for a use case span­ning cul­ture. And you’re not neces­sar­ily always in the pro­ject pos­i­tion to shape such a use case span­ning cul­ture your­self. It’s a man­age­ment respons­ib­il­ity to identify and imple­ment these lead per­so­nas and provid­ing the budgets to form such a cul­ture and standards.

My end­ing note? I don’t think that any­thing I wrote here is revolu­tion­ary or new. In fact, even though I didn’t do the research myself, I would expect that there is quiet some lit­er­at­ure on the topic. Writ­ing this down though def­in­itely helps to digest and remem­ber pat­terns that I saw went well and bad in past projects.

I hope you enjoyed read­ing this — cheers!

P.S. Do you know a good tool that can help to write, link and visu­al­ize stand­ards in a graph like struc­ture? Drop me a link the com­ments 🙂 I haven’t yet found a good one.