Intro­duc­tion

MLOps com­bines machine learn­ing life­cycle man­age­ment with the infra­struc­ture code and pro­cesses of DevOps. It bridges the gap between data sci­ence and oper­a­tions by auto­mat­ing the ML life­cycle, enabling con­tinu­ous integ­ra­tion, deploy­ment, and mon­it­or­ing lead­ing to iter­at­ive improve­ments of ML Mod­els. This helps in stand­ard­iz­ing and stream­lin­ing the machine learn­ing life­cycle, ensur­ing repro­du­cib­il­ity and scalab­il­ity of ML work­flows. Moreover, this accel­er­ates the pro­cess of tak­ing ML mod­els from devel­op­ment to pro­duc­tion and redu­cing time-to-mar­ket for AI drive solu­tions. MLOps ensures model reli­ab­il­ity, gov­ernance, and com­pli­ance vital in reg­u­lated indus­tries and mis­sion-crit­ical applications.

Chal­lenges

Without MLOps, organ­iz­a­tions face vari­ous chal­lenges impact­ing the pro­ductiv­ity and out­come of these pro­jects. One sig­ni­fic­ant issue is the extens­ive time spent on manual pro­cesses in data pre­pro­cessing to model deploy­ment and mon­it­or­ing. These are not only time con­sum­ing but also prone to errors. Moreover, without auto­mated tools, Scal­ing and man­aging data pipelines becomes inef­fi­cient as the com­plex­ity of oper­a­tions increases with increas­ing volume and vari­ety of data. There­fore embra­cing MLOps is a crit­ical step towards agil­ity and main­tain­ing a com­pet­it­ive edge in the market.

Our Pro­posed Solution

We are going to focus on Azure ML Ser­vice and MLFlow and how our Auto­mated Solu­tion built on it could help to over­come the above chal­lenges in Data Sci­ence Pro­jects. Our Goal of this Pro­ject is to auto­mate the pro­duc­tion of an MLOps Pipeline which will enable gen­eric reusab­il­ity for vari­ous ML Use cases. Using MLFlow with AzureML allows us to lever­age its cap­ab­il­it­ies within the Azure eco­sys­tem. This helps us to track Azureml exper­i­ments and log vari­ous met­rics like accur­acy, pre­ci­sion, etc which provides a trans­par­ent view of our mod­els’ per­form­ance and helps us to com­pare between dif­fer­ent applic­a­tions with ease. Our Pro­posed Solu­tion con­sists of two parts.

Infra­struc­ture

In the first part, we use Ter­ra­form to cre­ate the required infra­struc­ture for deploy­ing ML use cases in Azure.

Overview of Infrastructure Resources
Fig­ure 1: Over­view of Infra­struc­ture Resources
Azure ML Pipeline

This part con­sists of the MLOps pipeline using Python SDK. We start with the con­fig­ur­a­tion of the AML Work­space to inter­act with the Azure Ser­vices. Then we do the envir­on­ment setup by defin­ing a conda spe­cific­a­tion file, enabling con­sist­ent pack­age install­a­tions across dif­fer­ent com­pute tar­gets. The next step is to con­fig­ure the Com­pute Tar­get using GPU-based instances for execut­ing ML work­loads. The Com­pute cluster scales dynam­ic­ally based on the work­load demands. Once the setup is com­pleted, We start with fetch­ing data­sets for ML Train­ing, AzureML sim­pli­fies data man­age­ment through Data­stores, We register an Azure Blob Con­tainer as a Data­store, facil­it­at­ing easy access to data­sets stored in Blob Storage.

The core of this solu­tion lies in these 4 sub­sequent steps from Data Pre­pro­cessing, Train­ing, Eval­u­ation, and Deploy­ment as Infer­ence end­point. We adop­ted an object-ori­ented approach by encap­su­lat­ing ML func­tion­al­it­ies into reusable classes and meth­ods. This helped us to achieve mod­u­lar­ity, reusab­il­ity, and extens­ib­il­ity. The meth­ods defined in the base class could be redefined and over­rid­den by the Child Class as per the require­ments. To demon­strate this cap­ab­il­ity, we have abstrac­ted the reusable com­pon­ents of this pipeline inside mlops/src/ml_pipeline_abstractions. It con­tains a base pipeline and a child pipeline class. The meth­ods defined in the BasePipeline Class can be over­rid­den by the Child Pipeline definitions.

MTL Interfaces
Fig­ure 2: MTL Interfaces
Data Pre­pro­cessing

The Pre­pro­cessing step retrieves the AzureML Data­store which is con­nec­ted to the Blob Stor­age. After retrieval, it loads the data­set as an AzureML Tab­u­lar Data­set. Sub­sequently, we fil­ter out the head­ers and indexes from the data before deploy­ing it in the train­ing func­tion. This step could be fur­ther cus­tom­ized by allow­ing cli­ents to cre­ate their ver­sion of the ‘prepare_data‘ method in the Child Class definition

Model Train­ing

In the Model Train­ing step, we offer a com­bin­a­tion of cus­tom and pre­defined train­ing func­tions encap­su­lated within mod­u­lar classes. This allows flex­ib­il­ity to cli­ents who prefer to util­ize their pro­pri­et­ary algorithms. Cli­ents can seam­lessly integ­rate their unique train­ing func­tions by incor­por­at­ing them into the utils pack­age and tog­gling the option to employ a cus­tom train­ing func­tion. Sim­ul­tan­eously, our sys­tem allows for the selec­tion of a primary met­ric to guide model per­form­ance eval­u­ations. Cli­ents can choose from a com­pre­hens­ive suite of met­rics, includ­ing accur­acy, pre­ci­sion, recall, f1 score, ROC AUC, and the con­fu­sion mat­rix. The chosen primary met­ric is logged into the Azure Machine Learn­ing (AML) Work­space, ensur­ing a trans­par­ent and com­par­at­ive assess­ment of model performance.

Model Eval­u­ation

In the sub­sequent eval­u­ation phase, our focus shifts to the com­par­ison of this primary met­ric across all trained mod­els. Through this com­par­ison, we identify the model with the best per­form­ance within the model registry. This model is then added with a “pro­duc­tion” tag, sig­ni­fy­ing its read­i­ness for deployment.

Model Deploy­ment

The final phase is the deploy­ment phase where the best model is chosen by fil­ter­ing on the ‘pro­duc­tion’ tag. The model is deployed as a real-time end­point to provide a scal­able and access­ible web ser­vice for machine learn­ing inference.

Workflow of ML Pipeline
Fig­ure 3: Work­flow of ML Pipeline
Auto ML

We have also provided integ­ra­tion with Azure AutoML cap­ab­il­it­ies. This enables our cli­ent to train their data­sets using a range of mod­els. This can be achieved by set­ting up the Automl flag to true. After train­ing, this pro­cess cul­min­ates in a detailed com­par­ison between the trained mod­els, high­light­ing their per­form­ance met­rics. We can define our primary met­ric while con­fig­ur­ing the AutoML func­tion, Accord­ingly, the pipeline selects the best-per­form­ing model. In the next steps, we register this model to the registry and deploy it as an infer­ence end-point.

By incor­por­at­ing AutoML, we empower our cli­ents to har­ness the power of machine learn­ing without the need for extens­ive expert­ise and help them lever­age their data­sets effectively.

Integ­ra­tion with Azure DevOps for CI/CD

The Azure DevOps CI/CD pipeline auto­mates the MLOps pro­cess by orches­trat­ing a sequence of tasks to ensure the smooth deploy­ment of ML Mod­els. A set of pre­defined vari­ables is set to con­fig­ure the envir­on­ment which is util­ized for the sub­sequent tasks. At first, authen­tic­a­tion is set using a ser­vice con­nec­tion that gives access to our Azure sub­scrip­tion and its resources. The next step is to deploy Ter­ra­form to pro­vi­sion the neces­sary infra­struc­ture as described above. Sub­sequently Python envir­on­ment is set and the MLOps pipeline Python script is executed. This com­pletes the end-to-end MLOps work­flow deploy­ment. We have integ­rated Azure­DevOps with Git­hub, This ensures that our Azure pipeline is triggered every time there is a com­mit in the mas­ter branch of the Git­hub repository.

Con­clu­sion

It can be con­cluded that the above MLOps pro­ject stands as an accelerator stream­lin­ing the trans­ition from exper­i­mental machine learn­ing to robust pro­duc­tion sys­tems. The mod­u­lar­ity in select­ing met­rics, train­ing func­tions, Automl, etc ensures a very flex­ible and agile solu­tion that will help our cli­ents accel­er­ate their AI-driven products and ser­vices devel­op­ment. It will provide an over­all com­pet­it­ive advant­age in the mar­ket driven by high-per­form­ing, reli­able, and scal­able machine learn­ing solutions.