Improve your machine learn­ing train­ing routine with cloud-based, con­tain­er­ized workflows



Intro­duc­tion

Sage­Maker model train­ing. Train­ing machine learn­ing mod­els on a local machine in a note­book is a com­mon task among data sci­ent­ists. It is the easi­est way to get star­ted, exper­i­ment and build a first work­ing model. But for most busi­nesses this is not a sat­is­fy­ing option: Today, mak­ing the most of your train­ing data usu­ally means to scale to giga­bytes or peta­bytes of data, which do not eas­ily fit into your local machine. Data is your most valu­able asset and you would not want to use only small frac­tion of it due to tech­nical reas­ons. Another prob­lem that might occur when train­ing a model on your local machine is put­ting it to use. A model that can only be used on your laptop is pretty use­less. It should be avail­able to a large group of con­sumers, deployed as a REST API or mak­ing batch pre­dic­tions in a large data pipeline. If your model “works on your machine”, how does it get to production?

Another pit­fall lies in the con­sist­ency between code used to train a model and the seri­al­ized model itself. By design, a note­book is well suited for explor­at­ory ana­lysis and devel­op­ment via trial and error. For example, it is pos­sible to jump between indi­vidual cells. It is not enforced to keep a given top-down order. This makes it dif­fi­cult to check whether the code exactly matches the model fit. Fit­ting a model in a note­book, per­sist­ing it, change the note­book after­wards and do a git com­mit is not a desired work­flow and might lead to non-trace­able issues. In an optimal work­flow, mod­els and asso­ci­ated code should be kept con­sist­ent with each other in the sense that it should be pos­sible to fit the model in a new envir­on­ment again.

Sage­Maker model train­ing. We demon­strate how many of the issues can be cir­cum­ven­ted, by mov­ing your train­ing routine to the cloud. We use the AWS Ser­vice Sage­maker to execute a con­tain­er­ised train­ing routine that will allow us to a) scale our train­ing job to whatever sizes of train­ing data, b) sup­port our routines with single or mul­tiple GPUs if needed, and c) eas­ily deploy our model after the train­ing has suc­ceeded. The train­ing and deploy­ment routine will be writ­ten in a python script, which can eas­ily be ver­sion con­trolled, and allows us to link code and model. The script is passed via the Sage­maker Python SDK to a con­tainer, run­ning a docker image with all depend­en­cies needed for the job. The main bene­fit of using Sage­maker is that instead of hav­ing to man­age the infra­struc­ture run­ning the con­tainer, all the infra­struc­ture man­age­ment is done for you by AWS. All that is required is to make a few calls with the Python Sage­maker SDK and everything else is taken care of.

SageMaker model training: Code
schem­atic illus­tra­tion of a Sage­maker Train­ing Job

Pre­requis­ites

To try this demo out by your­self, you will need access to an AWS account. The easi­est way would be to use an AWS Sage­maker note­book instance (you can find more inform­a­tion about how to con­fig­ure a note­book instance here), but the code can be run in any com­pute envir­on­ment, given that authen­tic­a­tion and per­mis­sions are provided.

Data

In this example we will be using the Palmer Pen­guin Data­set, which provides a suit­able altern­at­ive to the fre­quently used Iris data­set. It con­tains inform­a­tion about vari­ous pen­guins. You can read more about it here. The object­ive we will be solv­ing with our machine learn­ing algorithm is pre­dict­ing the sex of a pen­guin (male/female) by using vari­ous attrib­utes of the pen­guin (e.g. flip­per length, bill length, spe­cies, island) as our features.

You can install palmer­pen­quins via pip.

SageMaker model training: Code

We use the lib­rary dotenv to load the fol­low­ing envir­on­ment vari­ables. They will be needed for execut­ing the job fur­ther below and con­tain inform­a­tion on the role used for exe­cu­tion as well as the paths to the data and the model storage.

SageMaker model training: Code

As an altern­at­ive, you can define the desired input- and out­put paths dir­ectly and parse the Sage­maker excecu­tion role.

SageMaker model training: Code

Min­imal data pro­ces­ing is required: We drop entries with null val­ues as well as duplic­ates. Then, we per­form a train-test-split and save the data on our work­ing dir­ect­ory as well as in AWS s3 storage.

SageMaker model training: Code

Model train­ing

Sage­Maker model train­ing. To execute the train­ing and deploy­ment routine, we need to write a python script. The cru­cial part for the train­ing lies in the main clause. It reads the data, instan­ti­ates a pipeline and trains the model. Here, a min­imal pre­pro­cessing of one-hot-encod­ing and stand­ard scal­ing is chosen. Logist­i­cRe­gres­sion acts as a baseline model. The model is then seri­al­ized and saved given the model directory.

The script takes four argu­ments. First, we need to define input path for the train­ing data. It assumes the exist­ence of two files: X_train.csv and y_train.csv. The dif­fer­en­ti­ation between cat­egor­ical and numer­ical vari­ables is expli­citly given in this example. Finally, the out­put path for the seri­al­ized model is defined. Using these argu­ments makes it con­veni­ent to run the train­ing routine on our local machine and via Sage­maker in the cloud.

SageMaker model training: Code

The script also con­tains sev­eral serving func­tions that Sage­maker requires for model serving via the Sage­maker model end­point ser­vice. These func­tions com­prise of model_fn() ensur­ing that the model gets loaded from file, input_fn() hand­ling the input in a way that it can be used for call­ing the pre­dict() func­tion on the model, the predict_fn() which calls pre­dict on the model and the output_fn(), which will con­vert the model out­put to a format that can be send back to the caller.

For test­ing pur­poses, the script is also callable on our local machine, or on the instance on which the note­book is running.

SageMaker model training: Code

In order to run the train­ing routine in the cloud, we use the SKLearn object from the Pyhton SDK. It is the stand­ard inter­face for schedul­ing and defin­ing model train­ing and deploy­ment of scikit-learn mod­els. We spe­cify the resources needed, the frame­work ver­sion, the entry point, the role as well as the output_path which will be the model-dir argu­ment. Fur­ther argu­ments like the numer­ical and cat­egor­ical fea­ture list can be passed via the hyper­para­met­ers dictionary.

When call­ing fit(), Sage­maker will auto­mat­ic­ally launch a con­tainer with a scikit-learn image and execute the train­ing script. The dic­ton­ary that we pass with a single keyword „train“ to the fit() func­tion spe­cifies the path to the pro­cessed data in S3. The train­ing data is copied from there into the train­ing con­tainer. The SKLearn object will move the model arti­facts to the desired out­put path in S3, defined via the keyword „output_path“ in its definition.

SageMaker model training: Code

Instan­ti­at­ing the SKLearn object and call­ing the fit() method are everything that is needed to launch the train­ing routine as a con­tain­er­ized work­flow. We could eas­ily scale the resources, for example by choos­ing a big­ger instance or use GPU sup­port for our train­ing. We could also make our calls auto­mat­ic­ally using a work­flow orches­tra­tion tool such as Air­flow or AWS Step Func­tion, or trig­ger our train­ing each time we merge our train­ing routine with a CI/CD pipeline.

Model deploy­ment

After eval­u­at­ing our model, we can now go on and deploy it. To do so, we only have to call deploy() on the SKlearn object that we used for model train­ing. A model end­point is now booted in the background.

SageMaker model training: Code

We can test the end­point by passing the top 10 rows of our test data­set to the end­point. We will receive labeled pre­dic­tions as to whether the pen­guin is a male or a female based on it’s attributes.

SageMaker model training: Code

At the end of our jour­ney, the end­point should be shut down.

SageMaker model training: Code

Out­look

Sage­Maker model train­ing. We demon­strated how a machine learn­ing train­ing routine can be moved to a cloud-based exe­cu­tion using a ded­ic­ated con­tainer in AWS. Com­pared to a local work­flow it provides numer­ous advant­ages, such as scalab­il­ity of com­pute res­sources (one simply has to change para­met­ers on the SKLearn object) and repro­du­cib­il­ity of res­ults: train­ing scripts can eas­ily be ver­sioned and train­ing jobs auto­mat­ic­ally be triggered, allow­ing to con­nect model train­ing and ver­sion con­trol. Most import­antly, it reduces the bar­rier between devel­op­ment and pro­duc­tion: Deploy­ing a trained model only requires a single method call. One can find the res­ult­ing note­book on GitHub.

Of course, the example provided here is not solv­ing all tech­nical hurdles of MLOps and more fea­tures are usu­ally needed to build a mature ml-plat­form. Ideally, we would want to have a model registry, where we store and man­age mod­els and arti­facts, an exper­i­ment­a­tion track­ing plat­form to log all our efforts to improve the model, and per­haps also a data ver­sion­ing tool. But mov­ing your local train­ing routine into the cloud is already a big step forward!

There are also many pos­sib­il­it­ies to extend the train­ing routine and adapt it towards spe­cific needs: If spe­cial python lib­rar­ies or depend­en­cies are used for the algorithm, a cus­tom docker image can be pushed to AWS Elastic Con­tainer Registry (ECR) and then be used in the Sage­maker train­ing routine. If data pro­cessing becomes more com­plex, one can use a pro­cessing con­tainer to decap­su­late this step. Also, if the end­point is used in pro­duc­tion, it is advis­able to develop a REST API on top of the deployed Sage­maker end­point, allow­ing to bet­ter handle secur­ity con­straints as well as logical heur­ist­ics and pre­pro­cessing of your API calls. And of course a model eval­u­ation step that cal­cu­lates met­rics on the vaild­a­tion data set should be included, but we will dive deeper into model eval­uta­tion in one of our next blog posts.

Stay tuned!