Framework for a successful continuous training strategy

ML models are built on the assumption that the data used in production will be similar to the data observed in the past, the one that we trained our models on. While this may be true for some specific use cases, most models work in dynamic data environments where data is constantly changing and where “concept drifts” are likely to happen and adversely impact the models’ accuracy and reliability. 

To deal with this, ML models need to be retrained regularly. Or as stated in Google’s “MLOps: Continuous delivery and automation pipelines in machine learning“: “To address these challenges and to maintain your model’s accuracy in production, you need to do the following: Actively monitor the quality of your model in production […] and frequently retrain your production models.” This concept is called ‘Continuous Training’ (CT) and is part of the MLOps practice. Continuous training seeks to automatically and continuously retrain the model to adapt to changes that might occur in the data.

There are different approaches/methodologies to perform continuous retraining, each with its own pros, cons, and cost. Yet, similar to the shoemaker who walks barefoot, we – data scientists – seem to overdo retraining, sometimes manually, and often use it as a “default” solution without enough production-driven insights.

Each ML use case has its own dynamic data environment that can cause concept drifts: from real-time trading to fraud detection, with the adversary changing the data distribution or recommendation engines with a wealth of new movies and new trends. Yet, regardless of the use case, three main questions need to be addressed when designing a  continuous training strategy:

1 – When should the model be retrained? As the goal is to keep running models that are highly relevant and that perform optimally at any point in time, how often should the model be retrained

2 – What data should be used? The common assumption for the selection of the adequate dataset is that the relevance of the data is correlated to how recent it is, which triggers a set of questions: should we use new data or add to older data sets? What is a good balance between old and new data? How recent is considered new data, or when is the cut between old and new data?

3 – What should be retrained? Can we replace data and retrain the same model with the same hyperparameters? Or should we take a more intrusive approach and run a full pipeline that simulates our research process?

Each of the three questions above can be answered separately, and can help determine the optimal strategy for each case. Yet, and while answering these questions, there is a list of considerations that need to be taken into account, and which we have investigated herein. For each question, we describe different approaches, corresponding to different levels of automation and maturity of the ML process.

When to retrain?

The three most common strategies are periodic retraining, performance-based, or based on data changes.

Periodic retraining

A periodic retraining schedule is the most naive and straightforward approach. Usually, it is time-based:  the model is retrained every 3 months –  but can also be volume-based – i.e., for every 100K new labels. 

The advantages of periodic retraining are just as straightforward: this is a very intuitive and manageable approach as it is easy to know when the next retraining iteration will happen, and it is easy to implement.

This method, though, often reveals itself to be ill-fitted or inefficient. How often do you retrain the model? Every day? Every week?  Every three months? Once a year? While one may want to frequently retrain to keep models up-to-date, retraining the model too often when there is no actual reason, i.e., no concept drift, is costly. Besides, even when retraining is automated, it requires important resources – both computational and from your data science teams who need to oversee the retraining process and the new model behavior in production after deployment. Yet, retraining with large intervals may miss the point of continuous retraining and fail to adapt to changes in the data without mentioning the risks of retraining on noisy data.

At the end of the day, a periodic retraining schedule makes sense if the frequency is aligned with the dynamism of your domain.  Otherwise, the selection of a random time/milestone may expose you to risks and leave you with models that have less relevance than their previous version. 

Performance-based trigger 

It’s almost a common-sense claim based on the good old engineering adage: “if it ain’t broken, don’t fix it.” The second most common approach to determine when to retrain is to leverage performance-based triggers and retrain the model once you detect performance degradation.

This approach, more empirical than the first one, assumes that you have a continuous view of the performance of your models in production.

The main limitation when relaying on performance only is the time it takes for you to obtain your ground truth – if you obtain it at all. In user conversion prediction cases,  it can take 30 or 60 days until you will get a ground truth, or even 6 months or more in cases such as transaction fraud detection or LTV. If you need to wait so long to have full feedback, that means you’ll retrain the model too late after the business has already been impacted.

Another non-trivial question that needs to be answered is: what is considered ‘performance degradation’? you may be dependent on the sensitivity of your thresholds and the accuracy of the calibration, which could lead you to retrain too frequently or not frequently enough. 

Overall, using performance based triggers is good for use-cases where there is fast feedback and high volume of data, like real time bidding, where you are able to measure the model’s performance as close as possible to the time of predictions, in short time intervals and with high confidence (high volume). 

Driven by data changes

This approach is the most dynamic and naturally triggers retraining from the dynamism of the domain. By measuring changes in the input data, i.e., changes in the distribution of features that are used by the model, you can detect data drifts that indicate your model may be outdated and needs to be retained on fresh data.

It is an alternative to the performance-based approach for cases where you don’t have fast feedback or cannot assess the performance of the model in production in real-time. Besides, it’s also a good practice to combine and use this approach even when there is fast feedback, as it may indicate suboptimal performance, even without degradation. Understanding data changes to start a retraining process is very valuable, especially in dynamic environments.

Driven by data changes

The graph above, generated by the monitoring service, shows a data drift metric (upper graph) and a performance KPI (lower graph) on a marketing use case. The data drift graph is a timeline where the Y-axis shows the level of drift for each day (i.e., the data on this day) relative to the training set. In this marketing use-case example, new campaigns are introduced very frequently, and the business expands to new countries. Clearly, the data streaming in production is drifting and becoming less similar to the data on which the model was trained. By having a more thorough view of the model in production, a retraining iteration could have been triggered well before a performance degradation would have been observed by the business. 

So when to retrain? This depends on key factors such as: the availability of your feedback, the volume of your data, and your visibility on the performance of your models. Overall, there is no one size fits all to the question of selecting the right time to retrain. Rather, and depending on your resources, the goal should be to be as production driven as possible. 

What data should be used?

While we have seen that timing is everything (i.e., when do I retrain my models?), now let’s look at the nerve of all MLOps: the data. When retraining, how do I select the right data to be used? 

Fix window size

The most naive approach is to use a fixed window size as a training set. For example, use data from the last X months. Clearly, the advantage of the method is in its simplicity, as it is very straightforward.

The disadvantages derive from the challenge of selecting the right window: if it is too wide, it may miss new trends, and if it is too narrow, it will be overfitted and noisy. The choice of the window size should also take into account the frequency of retraining:  it makes no sense to retrain every 3 months if you only use the last month of data as a training set.

Besides, the data selected can be very different from the data that is being streamed in production, as it may happen straight after a change point (in case of fast/sudden change), or it may contain irregular events (e.g., holidays or special days like election day), data issues and more. 

Overall, the fix window approach, like any other “static” method, is a simple heuristic which may work well in some cases, but will fail to capture the hyper dynamism of your environment in cases where the change rate is versatile and irregular events are common like holidays or technical issues.

Dynamic window size

The dynamic window size approach tries to solve some of the limitations of the predefined window size by determining how much historical data should be included in a more data-driven way and by treating the window size as another hyperparameter that can be optimized as part of the retraining. 

This approach is most suitable for cases in which there is fast feedback (i.e., Real-Time Bidding or food delivery ETA). The most relevant data can be used as a test set, and the window size can be optimized on it, just like another hyperparameter of the model. In the case illustrated below, the highest performance is achieved by taking the last 3 weeks, which is what should be selected for this iteration. For future ones, a different window may be selected according to the comparison with the test set.

Dynamic window size

The advantage of the dynamic window size approach is that it is more data-driven, based on the model performance, which is really the bottom line, and hence it is more robust for highly temporal environments.

The disadvantage is that it requires a more complex training pipeline (see next question ‘What to train’) to test the different window sizes and select the optimal one, and it is much more computing-intensive. Additionally, like the fixed window size, it assumes that the most recent data is the most relevant, which is not always true. 

Dynamic data selection

The third approach of selecting the data to train the models seeks to achieve the basic objective of any retraining strategy, which is the basic assumption of any ML model: using training data that is similar to the production data. Or in other words, models should be retrained on data that is as similar as possible to the data used in real-life to issue predictions. This approach is complex and requires high visibility of the production data. 

To do so, you need to perform a thorough analysis of the evolution of the data in production to understand if and how it changes over time. One way to do this is to calculate the drift in input data between each pair of days, i.e., how much the data of one day is different from the data of another day, and generate a heat map that reveals the change patterns of the data over time like the one shown below. The heat map below is a visualization of drift evaluation over time where the axes (columns and rows) are dates, and each dot (cell in the matrix) is the level of drift between two days, where the higher the drift, the more red the cell is. 

Dynamic data selection

Above, you can see the result of such an analysis generated automatically by Superwise to help spot data that is as similar as possible to the one that is now streaming. There are different types of insights that can transpire easily from this view: 

1 – The month of December is very different from the rest of the data (before and after). This insight enables us to avoid treating the whole period as bulk and exclude these days when you want to retrain, as this month’s data doesn’t represent the normal data you observe in production. 

2- The seasonality of the data can be visualized – and this can trigger a discussion around the necessity to use different models for weekdays and weekends.

One more thing that is very powerful here: you can actually have a picture that proves that the most recent data isn’t necessarily the most relevant.

In short, selecting the right data to retrain your models requires a thorough view of the behavior of the data in production. While using fixed or dynamic window sizes may help you to “tick the box”, it usually remains a guesstimation that might be more costly than efficient.

What should be retrained?

Now that we have analyzed when to retrain and what data should be used let’s address the third question: what should be retrained? One could select only to retrain (refit) the model instance based on the new data, to include some or all of the data preparation steps, or take a more intrusive approach and run a full pipeline that simulates the research process.

The basic assumption of retraining is that the model that was trained in the research phase will become outdated due to concept drift and hence need to be retrained.  However, in most cases, the model instance is just the last phase of a wider pipeline that was built in the research phase and includes data preparation steps. The question is: how severe is the drift and its impact? or in the context of retraining, what parts of the model pipeline should be challenged?

In the research phase, you experiment and evaluate many elements within the model pipeline steps in an effort to optimize your model. These elements can be grouped into two high-level parts, data preparation, and model training. The data preparation part is responsible for preparing the data (duh!) to be consumed by the model and includes methods like feature engineering, scaling, imputation, dimensional reduction, etc., and the model training part is responsible for selecting the optimal model and its hyperparameters. At the end of the process, you get a pipeline of sequential steps that take the data, prepare it to be used by the model and predict the target outcome using the model.

Retraining only the model, i.e., the last step in the pipeline, with new relevant data is the most simple and naive approach and may suffice to avoid performance degradation. However, in some cases, this model-only approach may not be enough, and a more comprehensive approach should be taken regarding the scope of the retraining. Broadening the scope of retraining can be performed in two dimensions:

1 – What to retrain? What parts of the pipeline should be retrained using the new data?

2 – Level of retraining: Are you just refitting the model (or other steps) with the new data? Do you do hyperparameters optimization? or do you challenge the selection of the model itself and test for other model types?

Another way to look at this is that in the retraining process, you basically try to automate and avoid the manual work of model optimization research that would be done by a data scientist if there was no automatic retraining process. Hence you need to decide to what extent you want to try and mimic the manual research process, as illustrated in the chart below. 

manual research process
Model Research Stages as described in MLOps: Continuous delivery and automation pipelines in machine learning

Note that automating the first steps of the flow is more complex, and as more automation is built around these experimentations, the process grows more robust and flexible to adapt to changes, but it also adds complexity as more end cases and checks should be considered

Example & conclusion

Let’s take, for example, a simple model pipeline that is a result of manual research done by our awesome data science team for some classification tasks:

manual research for some classification tasks

You can take the simple approach and retrain the model itself and let it learn a new tree, or you can include some (or all) data preparation steps (e.g., relearn the mean in the imputer or the min/max values in the scaler). But besides the selection of the steps to be retrained, you need to decide to what level. For example, even if you choose to retrain just the model, it can be performed on multiple levels. You can retrain the model itself and let it learn a new tree, i.e., model fit (new_X, new_y), or you can search and optimize the model hyperparameters (max depth, min leaf size, max-leaf nodes, etc.) or even challenge the selection of the model itself and test for other model types (e.g., logistic regression and. random forest). The data preparation steps can (and should) also be retrained to test for different scalers or different imputation methods or even test for different feature selections.

When choosing to take the more comprehensive approach and perform hyperparameter optimization/model search, you can also use AutoML frameworks that are designed to automate the process of building and optimizing the model pipeline, including data preparation, model search, and hyperparameter optimization. And it doesn’t have to involve complex meta-learning. It can be simple: the user selects a range of models and parameters. A lot of tools are available: AutoKeras, Auto SciKitLearn. Yet, as tempting as AutoML is, it doesn’t exempt the user from having a robust process to track, measure, and monitor the models throughout their lifecycle, especially before and after deploying them in production.

At the end of the day, “flipping the switch” and automating your processes must be accompanied by robust processes to assure that you remain in control of your data and your models.

To achieve efficient Continuous Training, you should be able to lead with production driven insights. For any and every use case, the creation of a robust ML infrastructure, relies heavily on the ability to achieve visibility and control over the health of your models and the health of your data in production. That’s what it entails to be a data driven data scientist 😉

Part I: Safely rolling out ML models to production

This piece is the first part of a series of articles on production pitfalls and how to rise to the challenge. 

CI/CD best practices to painlessly deploy ML models and versions

For any data scientist, the day you roll out your model’s new version to production is a day of mixed feelings. On the one hand, you are releasing a new version that is geared towards yielding better results and making a greater impact; on the other, this is a rather scary and tense time. Your new shiny version may contain flaws that you will only be able to detect after they have had a negative impact.

ML is as complex as orchestrating different instruments

Replacing or publishing a new version to production touches upon the core decision-making logic of your business process. With AI adoption rising, the necessity to automatically publish and update models is becoming a common and frequent task, which makes it a top concern for data science teams.

In this two-part article, we will review what makes the rollout of a new version so sensitive, what precautions are required, and how to leverage monitoring to optimize your Continuous Integration pipeline (Part I) as well as your Continuous Deployment (CD) one to safely achieve your goals.  

The complexity of the ML orchestra

ML systems are composed of multiple moving and independent parts of software that need to work in tune with each other:

  • Training pipeline: Includes all processing steps that leverage your historical dataset to produce a working model: data pre-processing, such as embeddings, scaling, feature engineering, feature selection or dimensionality reduction, hyperparameter tuning, and performance evaluation using cross-validation or hold-out sets. 
  • Model registry: A deployed model can take various forms: specific object serialization, such as pickles, or cross-technology serialization formats, such as PMML. Usually, these files are kept in a registry based on shared file storage (S3, GCS,…) or even in a version repository (git). Another option to persist the model is by saving the new model parameters, i.e., saving a logistic regression coefficient in a database. 
  • Serving layer: This is the actual prediction service. Such layers can be embedded together with the business logic of the application that relies on the model predictions or can be separated to act as a prediction service decoupled from the business processes it supports. In both cases, the core functionality is to retrieve the relevant predictions according to new incoming requests using the latest model in the model registry. Such a service can work in a batch or stream manner. While inferring the predictions, the preprocessing steps that were taken in the training pipeline should also be aligned and used with the same logic. 
  • Label collection: A process that collects the ground truth for supervised learning cases. It can be done manually, automatically, or using a mixed approach, such as active learning.
  • Monitoring: An external service that monitors the entire process, from the quality of inputs that get into the serving and up to the collected labels, to detect drift, biases, or integrity issues.
The ML Orchestra
The ML orchestra

Given this relatively high-level and complex orchestration, many things can get out of sync and lead us to deploy an underperforming model. Some of the most common culprits are:

Lack of automation 

Most organizations are still manually updating their models. Whether it is the training pipeline or parts of it, such as features selection or the delivery and promotion of the newly created model by the serving layer; doing so manually can lead to errors and unnecessary overheads. To be really efficient: all the processes, from training to monitoring, should be automated to leave less room for error – and more space for efficiency. 

A multiplicity of stakeholders

As a plurality of stakeholders and experts are involved, there are more handovers and, thus, more room for misunderstandings and integration issues. While the data scientists design the flows, the ML engineers are usually doing the coding – and without being fully aligned, this may result in having models that work well functionally but fail silently, either by using the wrong scaling method or by implementing an incorrect feature engineering logic.

Hyper-dynamism of real-life data

Research mode (batch) is different from production. Developing and empirically researching the optimal training pipeline to yield the best model is done in offline lab environments, with historical datasets, and by simulating hold-out testings. Needless to say, these environments are very different from the production model, for which only parts of the data are actually available. This usually results in containing data leakages or wrong assumptions that lead to bad performance, bias, or problematic code behavior once the model is in production and needs to serve live data streams. For instance, the newly deployed version’s inability to handle new values in a categorical feature while pre-processing it into an embedded vector during the inference phase.

Silent failures of ML models vs. traditional IT monitoring

Monitoring ML is more complex than monitoring traditional software – the fact that the entire system is working doesn’t mean it actually does what it should do. Because all the culprits listed above may result in functional errors, and these may be “silent failures”, it is only through robust monitoring that you can detect failures before it is too late and your business is already impacted.

The risks associated with these pitfalls are intrinsically related to the nature of ML operations. Yet, the best practices to overcome them may well come from traditional software engineering and their use of CI/ CD (Continuous Integration/ Continuous Deployment) methodologies. To analyze and recommend best practices for the safe rollout of models, we have used the CI/CD grid to explain which steps should be taken.

Best practices for the CI phase

CI practices are about frequently testing the codebases of each of the software modules, or unit tests, and the integrity of the different modules working together by using integration/system tests. 

Now let’s translate this to the ML orchestra. A new model, or version, should be considered a software artifact that is a part of the general system. As such, the creation of a new model requires a set of unit and integration tests to ensure that the new “candidate” model is valid before it is integrated into the model registry. But in the ML realms, CI is not only about testing and validating code and components but also about testing and validating data, data schemas, and model quality. While the focus of CI is to maintain valid code bases and modules’ artifacts before building new artifacts for each module, the CD process handles the phase of actually deploying the artifacts into production

Here are some of the main best practices for the CI phase that impact the safe rollout of model/new version implementations:

Data validation

Models are retrained/produced using historical data. For the model to be relevant in production, the training data set should adequately represent the data distribution that currently appears in production. This will avoid selection bias or simply irrelevance. To do so, before even starting the training pipeline, the distribution of the training set should be tested to ensure that it is fit for the task. At this stage, the monitoring solution can be leveraged to provide detailed reports on the distributions of the last production cases, and by using statistical tools such as deequ, this type of data verification constraint can be automatically added to the CI process.

Extracting data distributions from production
Extracting data distributions from production using the monitoring service and comparing them to your training dataset

Model quality validation 

When executing the training pipeline and before submitting the new model as a “candidate” model into the registry, ensure that the new training process undergoes a healthy fit verification.

Even if the training pipeline was automated, it should include a hold-out/cross validation model evaluation step. 

Given the selected validation method, one should test that the fitted model convergence doesn’t indicate overfitting, i.e., seeing a reduced loss on the training dataset while it’s increasing on the validation set. The performance should also be above a certain minimal rate –  based on a hardcoded threshold, naive model as a baseline, or calculated dynamically by leveraging the monitoring service and extracting the rates of the production model during the used validation set. 

Leveraging monitoring for the CI phase 

‍Test cases – Model robustness for production data assumptions 

Once the model quality validation phase is completed, one should perform integration tests to see how the serving layer integrates with the new model and whether it successfully serves predictions for specific edge cases. For instance: handling null values in the features that could be nullable, handling new categorical levels in categorical features, working on different lengths of text for text inputs, or even working for different image sizes/resolutions,… Here also, the examples can be synthesized manually or taken from the monitoring solution, whose capabilities include identifying and saving valid inputs with data integrity issues. 

Model stress test 

Changing the model or changing its pre-processing steps or packages could also impact the model’s operational performance. In many use cases, such as real time-bidding, increasing the latency of the model serving might impact dramatically the business. 

Therefore, as a final step in the model CI process, a stress test should be performed to measure the average latency to serve a prediction. Such a metric can be evaluated relative to a business constraint or relative to the current production operational model latency, calculated by the monitoring solution. 

Conclusion

Whenever a new model is created, before submitting it to the model “registry”, and as a potential “candidate” to replace the production model, applying these practices to the CI pipelines will help ensure that it works and integrates well with the serving layer.

Yet, while testing these functionalities is necessary, it remains insufficient to safely roll out models and address all the pitfalls listed in this article. For a thorough approach, one should also look at the deployment stage.

Next week, in the second part of this article, we will review what model CD strategies can be used to avoid the risks associated with the rollout of new models/versions, what these strategies require, in which cases they are better fitted, and how they can be achieved using a robust model monitoring service. So stay tuned!

If you have any questions, or feedback, or if you want to brainstorm on the content of this article, please reach out to me on LinkedIn