ML Engin­eer­ing



It has always been a big chal­lenge to effi­ciently sup­port con­tinu­ous devel­op­ment and integ­ra­tion for ML in pro­duc­tion. These days, Data Sci­ence and ML are becom­ing basic ingredi­ents to solve com­plex real-world prob­lems and deliver tan­gible value. In gen­eral, this is what we have:

  • Large data­sets
  • Inex­pens­ive on-demand com­pute resources
  • ML accel­er­at­ors in cloud
  • Rapid advances in dif­fer­ent ML research fields (such as com­puter vis­ion, nat­ural lan­guage pro­cessing, and recom­mend­a­tion systems)

How­ever, we are miss­ing the abil­ity to auto­mate and mon­itor at all steps of ML sys­tem con­struc­tion. In short, the real chal­lenge is not to build an ML model, but rather the dif­fi­culty in cre­at­ing an integ­rated ML sys­tem and to con­tinu­ously oper­ate it in pro­duc­tion. Ulti­mately, ML code is just a part of a real world ML eco­sys­tem, and there are innu­mer­able com­plic­ated steps that sur­round and sup­port this eco­sys­tem: con­fig­ur­a­tion, auto­ma­tion, data col­lec­tion, data veri­fic­a­tion, testing/debugging, resource man­age­ment, model ana­lysis, process/metadata man­age­ment, ser­vi­cing infra­struc­ture, and monitoring.

Before imple­ment­ing any ML use-cases it is use­ful to con­sider the following:

  • Is this really a prob­lem that requires ML? Is there not a way to tackle it with tra­di­tional tools and algorithms?
  • Design and imple­ment eval­u­ation tools, to prop­erly track if you are mov­ing in the right direction.
  • Try to use ML as a help­ing hand as opposed to a com­plex necessity.

So, all-in-all a well-defined ML flow can be rep­res­en­ted in three phases:

Phase 1: The first pipeline

  • Keep the model simple and think care­fully about the right infra­struc­ture. This means defin­ing the cor­rect method of mov­ing data to the learn­ing algorithm, as well as imple­ment­ing well-man­aged model integ­ra­tion and versioning
  • To have a test infra­struc­ture inde­pend­ent of the model. This should include tests to verify that data is suc­cess­fully fed into the algorithm, that the model is suc­cess­fully out­put from the algorithm, and that stat­ist­ical met­rics of the data in the pipeline is the same as data out­side the pipeline.
  • Usu­ally the prob­lems that machine learn­ing is try­ing to solve are not com­pletely new. There gen­er­ally exists some exist­ing sys­tem for rank­ing, or clas­si­fy­ing, or whatever prob­lem you are try­ing to solve. This means that there are a bunch of rules and heur­ist­ics. A heur­istic is a series of approx­im­ate steps to help you model the data. These same heur­ist­ics can give you an edge when apply­ing machine learn­ing. Try to turn heur­ist­ics into use­ful data. The trans­ition to a machine learned sys­tem will be smoother: heur­ist­ics may con­tain a lot of the intu­ition about the sys­tem you don’t want to throw away.
  • Now comes the mon­it­or­ing part. Depend­ing on the use-case, it is pos­sible that per­form­ance may decrease after a day, a week, or per­haps longer. It makes sense to have an alert mon­it­or­ing sys­tem watch­ing and trig­ger­ing retrain­ing continuously.
  • Use an appro­pri­ate eval­u­ation met­ric for your model. For example, know when to use an ROC curve vs when to use accuracy.
  • Watch for sali­ent fail­ures, which provide excep­tion­ally use­ful inform­a­tion to the ML algorithm.
  • Often, one may not have prop­erly quan­ti­fied the true object­ive. Or per­haps the object­ive may change as the pro­ject advances. Fur­ther, dif­fer­ent team mem­bers may have dif­fer­ent under­stand­ings of the object­ives. In fact, there is often no “true” object­ive. So, train on the simple ML object­ive, and add a “policy layer” on top, which allows one to add addi­tional logic and rank ML mod­els as needed.
  • Using simple pipelines make debug­ging easier.

In the first phase of the life­cycle of a machine learn­ing sys­tem, the import­ant aspect is to push the train­ing data into the learn­ing model, get any met­rics of interest eval­u­ated, and cre­ate a serving infra­struc­ture that can be built upon.  After, that Phase 2 begins.

Phase 2: Fea­ture Engineering

In the second phase, there is a lot of low-hanging fruit. Fea­ture com­bin­a­tion and tweak­ing can gen­er­ate improve­ments, and a rise in per­form­ance is gen­er­ally easy to visualize.

  • Be sure to employ model ver­sion­ing as the model is trained and upgraded.
  • AS ML mod­els train, they try to find the low­est value of the loss func­tion, which in the­ory should min­im­ize error. How­ever, this func­tion may be com­plex, and one end up stuck in dif­fer­ent local min­ima with each run. This can it hard to determ­ine if a change to the sys­tem adds mean­ing or not. By cre­at­ing a model without deep com­plex fea­tures, you can get an excel­lent baseline per­form­ance. After the baseline, more eso­teric approaches can be tried and tested – com­bin­ing fea­tures to make more com­plex ones.
  • Explor­ing fea­tures that gen­er­al­ize across dif­fer­ent data contexts.
  • Spe­cific fea­ture use may res­ult in bet­ter optim­iz­a­tion. The reason being that, with a lot of data, it is sim­pler to learn many simple fea­tures than a few com­plex fea­tures. Reg­u­lar­iz­a­tion can come in handy to elim­in­ate fea­tures that apply to only a few examples.
  • Apply trans­form­a­tions to com­bine and modify exist­ing fea­tures to cre­ate new fea­tures in human-under­stand­able ways.
  • It is import­ant to under­stand that the num­ber of fea­ture weights that can be learned in a lin­ear model is roughly pro­por­tional to the amount of data avail­able. The key is to scale the the num­ber of fea­tures and their respect­ive com­plex­it­ies to the size of data.
  • Fea­tures that are no longer required should be discarded.
  • One should apply human ana­lysis to the sys­tem. This requires cal­cu­lat­ing the delta dif­fer­ence between mod­els, and being aware of any changes when new data (or a new user) is intro­duced to a model in production.
  • New fea­tures can be cre­ated from pat­ters observed in meas­ur­able quant­it­ies (met­rics). Hence, it is a good idea to have an inter­face to visu­al­ize train­ing and performance.
  • Quan­ti­fy­ing undesir­able observed beha­viour can help in ana­lyz­ing the prop­er­ties of the sys­tem which are not cap­tured by the exist­ing loss function.
  • It is not always true that short-term beha­viour is an indic­a­tion of long-term beha­viour. Mod­els some­times need to be fre­quently tuned.
  • Study the test-train skew. This is the dif­fer­ence between per­form­ance dur­ing train­ing and per­form­ance dur­ing testing/serving. The reason for this skew can be: 
    • A dis­crep­ancy due to dif­fer­ences in data hand­ling in train­ing and testing/serving.
    • A change in the data between these steps.
    • The pres­ence of feed­back loops between the model and your train­ing algorithm.

One solu­tion is to mon­itor train­ing and test­ing expli­citly so that any change in system/data does not intro­duce unnoticed skew.

Phase 3: Optim­iz­a­tion refine­ment and com­plex models

There will be cer­tain indic­at­ors that sug­gests the end of Phase 2. One may observe that monthly gains start to dimin­ish. There will be trade-offs between the met­rics: a rise or fall in some exper­i­ments. And this is where one notices the need for model soph­ist­ic­a­tion as gains become harder to achieve.

  • Have a bet­ter look at the object­ive. If unaligned object­ives are an issue, don’t waste time on new fea­tures. As stated before, if product goals are not covered by exist­ing algorithmic object­ives, one needs to change either the object­ives or the product goals.
  • Keep ensembles simple: each model should either be an ensemble (only account­ing for the input of other mod­els) or a base model (tak­ing many fea­tures), but never both.
  • Look­ing for qual­it­at­ive new sources of inform­a­tion can be use­ful, rather than refin­ing exist­ing sig­nals once per­form­ance plateaus.
  • When deal­ing with con­tent, one may be inter­ested in pre­dict­ing pop­ular­ity (e.g. the num­ber of clicks a post on social media receives). In train­ing a model, one may add fea­tures that would allow the sys­tem to per­son­al­ize (fea­tures rep­res­ent­ing how inter­ested a user is), diver­sify (fea­tures quan­ti­fy­ing whether the cur­rent social media post is sim­ilar to other posts liked by a user), and meas­ure rel­ev­ance (meas­ur­ing the appro­pri­ate­ness of a query res­ult). How­ever, one may find that these fea­tures are weighted less heav­ily by the ML sys­tem than expec­ted. This doesn’t mean that diversity, per­son­al­iz­a­tion, or rel­ev­ance aren’t valuable

With all these steps in mind, it is clear that one can­not go about imple­ment­ing simple ML code. One needs a soph­ist­ic­ated ML archi­tec­ture to address the com­plic­a­tions and impro­visa­tions that come with devel­op­ing an ML environment.

ML Engineering

As can be seen in the above dia­gram, the pipeline includes the fol­low­ing stages:

  • Source con­trol
  • Test and build services
  • Deploy­ment services
  • Model registry
  • Fea­ture store
  • ML metadata store
  • ML pipeline orchestrator

Which can be bet­ter ana­lyzed in this diagram

ML Engineering

Lets take an example task: churn pre­dic­tion. The idea is to determ­ine the num­ber of people leav­ing a given work­place by using vari­ous para­met­ers. The idea is to imple­ment CD/DI integ­ra­tion when deploy­ing, and Kuber­netes is used as an envir­on­ment to sup­port the vari­ous pro­cesses involved in the integration.

ML Engineering

Once the model is deployed fol­low­ing things need to be kept in mind:

Eval­u­ation: meas­ur­ing the qual­ity of pre­dic­tions (off­line eval­u­ation, online eval­u­ation, eval­u­at­ing using busi­ness tools, and eval­u­at­ing using stat­ist­ical tools)

Mon­it­or­ing: track­ing qual­ity over time

Man­age­ment: improve deployed model with feed­back → redeploy­ment

So, in con­clu­sion the need for auto­ma­tion and mon­it­or­ing for all steps of ML sys­tem con­struc­tion is import­ant. A wel­len­gin­eered ML solu­tion won’t simply make the devel­op­ment pro­cess easier, but will also make it coher­ent and resilient.