ML Engineering

It has always been a big challenge to efficiently support continuous development and integration for ML in production. These days, Data Science and ML are becoming basic ingredients to solve complex real-world problems and deliver tangible value. In general, this is what we have:

Large datasets
Inexpensive on-demand compute resources
ML accelerators in cloud
Rapid advances in different ML research fields (such as computer vision, natural language processing, and recommendation systems)

However, we are missing the ability to automate and monitor at all steps of ML system construction. In short, the real challenge is not to build an ML model, but rather the difficulty in creating an integrated ML system and to continuously operate it in production. Ultimately, ML code is just a part of a real world ML ecosystem, and there are innumerable complicated steps that surround and support this ecosystem: configuration, automation, data collection, data verification, testing/debugging, resource management, model analysis, process/metadata management, servicing infrastructure, and monitoring.

Before implementing any ML use-cases it is useful to consider the following:

Is this really a problem that requires ML? Is there not a way to tackle it with traditional tools and algorithms?
Design and implement evaluation tools, to properly track if you are moving in the right direction.
Try to use ML as a helping hand as opposed to a complex necessity.

So, all-in-all a well-defined ML flow can be represented in three phases:

Phase 1: The first pipeline

Keep the model simple and think carefully about the right infrastructure. This means defining the correct method of moving data to the learning algorithm, as well as implementing well-managed model integration and versioning
To have a test infrastructure independent of the model. This should include tests to verify that data is successfully fed into the algorithm, that the model is successfully output from the algorithm, and that statistical metrics of the data in the pipeline is the same as data outside the pipeline.
Usually the problems that machine learning is trying to solve are not completely new. There generally exists some existing system for ranking, or classifying, or whatever problem you are trying to solve. This means that there are a bunch of rules and heuristics. A heuristic is a series of approximate steps to help you model the data. These same heuristics can give you an edge when applying machine learning. Try to turn heuristics into useful data. The transition to a machine learned system will be smoother: heuristics may contain a lot of the intuition about the system you don’t want to throw away.
Now comes the monitoring part. Depending on the use-case, it is possible that performance may decrease after a day, a week, or perhaps longer. It makes sense to have an alert monitoring system watching and triggering retraining continuously.
Use an appropriate evaluation metric for your model. For example, know when to use an ROC curve vs when to use accuracy.
Watch for salient failures, which provide exceptionally useful information to the ML algorithm.
Often, one may not have properly quantified the true objective. Or perhaps the objective may change as the project advances. Further, different team members may have different understandings of the objectives. In fact, there is often no “true” objective. So, train on the simple ML objective, and add a “policy layer” on top, which allows one to add additional logic and rank ML models as needed.
Using simple pipelines make debugging easier.

In the first phase of the lifecycle of a machine learning system, the important aspect is to push the training data into the learning model, get any metrics of interest evaluated, and create a serving infrastructure that can be built upon. After, that Phase 2 begins.

Phase 2: Feature Engineering

In the second phase, there is a lot of low-hanging fruit. Feature combination and tweaking can generate improvements, and a rise in performance is generally easy to visualize.

Be sure to employ model versioning as the model is trained and upgraded.
AS ML models train, they try to find the lowest value of the loss function, which in theory should minimize error. However, this function may be complex, and one end up stuck in different local minima with each run. This can it hard to determine if a change to the system adds meaning or not. By creating a model without deep complex features, you can get an excellent baseline performance. After the baseline, more esoteric approaches can be tried and tested – combining features to make more complex ones.
Exploring features that generalize across different data contexts.
Specific feature use may result in better optimization. The reason being that, with a lot of data, it is simpler to learn many simple features than a few complex features. Regularization can come in handy to eliminate features that apply to only a few examples.
Apply transformations to combine and modify existing features to create new features in human-understandable ways.
It is important to understand that the number of feature weights that can be learned in a linear model is roughly proportional to the amount of data available. The key is to scale the the number of features and their respective complexities to the size of data.
Features that are no longer required should be discarded.
One should apply human analysis to the system. This requires calculating the delta difference between models, and being aware of any changes when new data (or a new user) is introduced to a model in production.
New features can be created from patters observed in measurable quantities (metrics). Hence, it is a good idea to have an interface to visualize training and performance.
Quantifying undesirable observed behaviour can help in analyzing the properties of the system which are not captured by the existing loss function.
It is not always true that short-term behaviour is an indication of long-term behaviour. Models sometimes need to be frequently tuned.
Study the test-train skew. This is the difference between performance during training and performance during testing/serving. The reason for this skew can be:
- A discrepancy due to differences in data handling in training and testing/serving.
- A change in the data between these steps.
- The presence of feedback loops between the model and your training algorithm.

One solution is to monitor training and testing explicitly so that any change in system/data does not introduce unnoticed skew.

There will be certain indicators that suggests the end of Phase 2. One may observe that monthly gains start to diminish. There will be trade-offs between the metrics: a rise or fall in some experiments. And this is where one notices the need for model sophistication as gains become harder to achieve.

Have a better look at the objective. If unaligned objectives are an issue, don’t waste time on new features. As stated before, if product goals are not covered by existing algorithmic objectives, one needs to change either the objectives or the product goals.
Keep ensembles simple: each model should either be an ensemble (only accounting for the input of other models) or a base model (taking many features), but never both.
Looking for qualitative new sources of information can be useful, rather than refining existing signals once performance plateaus.
When dealing with content, one may be interested in predicting popularity (e.g. the number of clicks a post on social media receives). In training a model, one may add features that would allow the system to personalize (features representing how interested a user is), diversify (features quantifying whether the current social media post is similar to other posts liked by a user), and measure relevance (measuring the appropriateness of a query result). However, one may find that these features are weighted less heavily by the ML system than expected. This doesn’t mean that diversity, personalization, or relevance aren’t valuable

With all these steps in mind, it is clear that one cannot go about implementing simple ML code. One needs a sophisticated ML architecture to address the complications and improvisations that come with developing an ML environment.

As can be seen in the above diagram, the pipeline includes the following stages:

Source control
Test and build services
Deployment services
Model registry
Feature store
ML metadata store
ML pipeline orchestrator

Which can be better analyzed in this diagram

Lets take an example task: churn prediction. The idea is to determine the number of people leaving a given workplace by using various parameters. The idea is to implement CD/DI integration when deploying, and Kubernetes is used as an environment to support the various processes involved in the integration.

Once the model is deployed following things need to be kept in mind:

Evaluation: measuring the quality of predictions (offline evaluation, online evaluation, evaluating using business tools, and evaluating using statistical tools)

Monitoring: tracking quality over time

Management: improve deployed model with feedback → redeployment

So, in conclusion the need for automation and monitoring for all steps of ML system construction is important. A wellengineered ML solution won’t simply make the development process easier, but will also make it coherent and resilient.