Thinking about building your own ML monitoring solution?

“We already have one!” That’s the first sentence most of our customers said when we met to discuss AI assurance solutions. Most AI-savvy organizations today have some form of monitoring. Yet, as they scale their activities, they find themselves at a crossroads: should they invest more in their homegrown solution or receive support from vendor solutions? And if they do choose to invest more, for how long will their DIY solution be “good enough”?

In this blog, we explore how far homegrown solutions can take you and what you need to think about when planning to scale your use of machine learning.

DIY tools are (only) a start when monitoring your AI

Data science teams spend months researching and training their best models. The production phase and the necessary MLOps/monitoring phase sometimes only come as an afterthought. In this context, many data science and engineering teams develop initial AI monitoring tools in-house. But while DIY tools may be a decent approach for businesses with a contained use of AI when the time comes to expand the use of modeling, homegrown tools fail short of supporting the diversity and complexity of the models and the data used. Here is a shortlist of some of the lessons learned that we have witnessed with customers scaling their AI.

As they grow, the number of models and use cases grow

Guess what? Homegrown solutions don’t scale in sync with the models and require more and more maintenance, tweaks, and attention…This is especially true as organizations adopt AI for various use cases: from marketing to core activities embedded in their product.

Models monitoring is not a one-off task. As organizations adopt new models, they need to create a new monitoring paradigm that caters to the different types of data – structured, text, image, video, etc..; all of which require different measures and techniques to analyze the incoming data for the process. In other words, what works for a classification model probably won’t work for a regression/clustering one, and a new set of tools will need to be developed. And even for specific structured use cases, different features of the model require different KPIs to analyze the health of the process: numerical/categorical/time/etc…

Regardless of the sophistication of the models, monitoring is an ongoing task that requires 25%-40% of a data science team’s time. The inefficiency and the frustration that comes with the heavy investment in homegrown monitoring solutions are among the first reasons that push organizations to turn to vendor solutions. Along with the fact that they would much rather their teams focus on creating models that have an impact on the business.

You don’t know what you don’t know

This is perhaps the most critical point. For organizations that have already engineered a solution that computes specific KPIs for your models, they find themselves struggling to proactively understand when concept drifts happen or when biases start to develop. More often than not, homegrown solutions tend to look at the things that are already known, and the issues that were already anticipated, thus realizing too late when events occur that are beyond this scope. This is often the point where organizations realize the limitations of their own solution, however sophisticated they engineered it to be, as it fails to bring value to the whole ML process.

In environments where data is extremely dynamic, assuring the health of models in production is about leveraging the expertise and best practices to be proactive: be alerted on issues that pertain to the health of the models, gain insights, and diagnose issues promptly.

Multiple stakeholders

As mentioned in a previous post, scaling AI poses the question of who owns it when it’s in production: data science teams? data engineering? business analysts? hybrid creatures? Ultimately, as AI use grows, the stakeholders involved also change, regardless of the number of models. Think about the fraud detection and cybersecurity space where analysts are the predominant users of the AI predictions and need to make sure the models are always tuned to a very dynamic data landscape.

For a monitoring solution to be useful, all the stakeholders involved need to derive insights and an understanding of the health of the predictions:

  • Data science teams need to understand if/when/how they should retrain the model, and the cases in which the model doesn’t perform well,
  • Business analysts want to know what drives decisions and get alerted as soon as there is high uncertainty regarding the model decision quality,
  • Data engineers need to know about the quality of the data streaming through the system, and whether it has outliers missing values or strange data distributions

To do so organizations need to create and maintain a view of the ML predictions that everyone involved can access and extract value from, without creating unnecessary noise. Beyond determining if there are sufficient resources, there is also a matter of skill set as all stakeholders often have different perceptions that need to be bridged under one enterprise-wide view. Ultimately, the complexity of these tasks is what drives AI practitioners to scale their activities to select a best-of-breed solution for assuring their models in production.

The amount of data is exponential!

In industries such as Adtech where models process TBs of data each day, the velocity of the data is a challenge to obtaining a clear picture. Do you have the time and tools necessary to continuously extract, compare, and analyze statistical metrics for your ML process, without impacting your core activities?

Scaling your AI? Here’s what you need to ask yourself

Here’s a quick list of considerations you may want to think over as you consider the best way to assure the health of your models in production. At the end of the day, it boils down to a question of resources management and efficiency: how much time should you invest in developing a set of tools to monitor your models in production, today? And what will it cost you tomorrow as you add more and more models and use cases?

  • How much will a homegrown solution cost?
  • How efficient will this be in the long run?
  • Is it really what my team needs to focus on, or is it better to buy and use such a capability?
  • How can I foster an enterprise-wide understanding of the models’ health?
  • How can I make my monitoring solution a proactive one?

Don’t play around with your growth

At Superwise, we specialize in accompanying our customers as they transition from using homegrown solutions–or even nothing!–to a rich model observability solution that helps them achieve business impact and grow their AI practice. Enabling them to focus on what they do best: developing and deploying models that help their business grow.

Facing the challenges of day 2 with your models in production

AI is everywhere.

Businesses from all verticals are promptly adopting AI algorithms to automate some of their most important decisions: from approving transactions to diagnosing cancer to granting credit and so much more.

As the AI race is booming, more organizations step into “Day 2”, the day their models are moved out of the realms of research and training and into production. And this is when the picture starts to crack.

Once they move to production, maintaining the models is a whole new story: they become subjects to drifts, develop biases, or simply suffer from low-quality data. In this “Day 2”, no one other than the data scientists who created the model really trusts it or understands how well it’s doing. And sometimes, even they feel they’ve lost control once it’s in production!

Operating ML models is essentially operating in the dark, without clear standards of what is entailed to ensure models make the impact they were designed for: what metrics should you look at? At what granularity? And most importantly, with what resources when your team needs to focus on creating future models and not troubleshooting the existing ones?

And this is the great paradox of AI in production; What it can do is great, but if the natural degradation of the models over time cannot be controlled, we remain blind to biases, drifts, and compliance risks. Leaving us with no way to really achieve the full business value of machine learning. In other words, we’re headed for trouble.

So what’s the deal? How can we scale AI efforts while fostering trust and without losing sight?

Mind the Gaps of Day 2!

The way we see it, there are two main gaps today that prevent organizations from  stepping into “Day 2” with confidence :

Lack of Real-World Assurance – There is a lack of best practices or capabilities to help assure the health of models in production. As we evolve into a more mature use of AI, practitioners are starting to look at monitoring more seriously, but the field and the literature are still in their inception. Data scientists across all verticals reach out to us as they find themselves turning away from their homegrown solutions that lack an all-encompassing view and often drain the resources of teams that are already spread out pretty thin. They need to find solutions that will enable them to get the right insights at the right time to help them become more efficient. They need to know if there is an issue before the business is impacted, when and if to retrain the model, and how to decide what data should be used to do so. And all this should be accomplished without creating unnecessary noise.

Lack of Ownership – Models are created by data scientists, but their results/predictions are used by the operational teams.

These users are the ones who are the most at risk of being impacted by wrong predictions. Take marketing analysts who use machine learning to predict users’ Life Time Value, for example, these teams are measured by the success of the activities that depend on AI predictions…and when their activities don’t yield the expected results, they are the ones losing out – and so is the whole business.

Operational teams need to become independent and gain visibility into what makes their models tick. More than that, they should be able to put the models to work for them and get key insights into their business: are there biases? Are there missed sub-populations?

For our users, the ability to gain independence and access information regarding the health of the model that matters for their business is crucial. More than that, as they start understanding that the models should work for them, they become their favorite resource!

AI Assurance as the necessary leap to success

At Superwise, we get it. With years of experience in building AI solutions and supporting organizations through their digitalization initiatives, we deeply understand the benefits and the blindspots of AI. We know that performant AI models can empower decision-makers, giving them the confidence to run free with their models, innovate and drive efficiency.

But as incredibly powerful as AI is, it requires a leap–one that is both technological and organizational–it needs Assurance.

AI Assurance gives you the visibility and control needed to create trust and confidence and enable you to scale the use of AI across the enterprise. With AI Assurance, you’ll be prepared for Day 2, when your models meet real life.

What every organization wants is to be in control of its models, even once they’ve been let out into the real world. AI Assurance not only delivers the practical tools to make this possible, but it also empowers you, the user, to use your AI models to their fullest extent with confidence. And this is what assurance is all about–providing the right metrics and the right insights to enable real-world success and independence with AI models.

To support this leap, we deliver an AI for AI solution. We learn from your models what their normal behavior can and should be and help you face the challenges of bias and concept drifts.

To illustrate, we recently helped one of our customers reduce their time to detect and fix concept drifts by  95%!

It’s not only the wealth of out-of-the-box metrics that make this possible – it’s our ability to give you the grid from which you can understand your models and get the tools to gain independence and control. Our solution grants you the right insights at the right time so you can know how your models are doing, get alerted when they go south, and take the right corrective action before there’s any business impact.

Want to take the leap? Schedule a demo today

Framework for a successful continuous training strategy

ML models are built on the assumption that the data used in production will be similar to the data observed in the past, the one that we trained our models on. While this may be true for some specific use cases, most models work in dynamic data environments where data is constantly changing and where “concept drifts” are likely to happen and adversely impact the models’ accuracy and reliability. 

To deal with this, ML models need to be retrained regularly. Or as stated in Google’s “MLOps: Continuous delivery and automation pipelines in machine learning“: “To address these challenges and to maintain your model’s accuracy in production, you need to do the following: Actively monitor the quality of your model in production […] and frequently retrain your production models.” This concept is called ‘Continuous Training’ (CT) and is part of the MLOps practice. Continuous training seeks to automatically and continuously retrain the model to adapt to changes that might occur in the data.

There are different approaches/methodologies to perform continuous retraining, each with its own pros, cons, and cost. Yet, similar to the shoemaker who walks barefoot, we – data scientists – seem to overdo retraining, sometimes manually, and often use it as a “default” solution without enough production-driven insights.

Each ML use case has its own dynamic data environment that can cause concept drifts: from real-time trading to fraud detection, with the adversary changing the data distribution or recommendation engines with a wealth of new movies and new trends. Yet, regardless of the use case, three main questions need to be addressed when designing a  continuous training strategy:

1 – When should the model be retrained? As the goal is to keep running models that are highly relevant and that perform optimally at any point in time, how often should the model be retrained

2 – What data should be used? The common assumption for the selection of the adequate dataset is that the relevance of the data is correlated to how recent it is, which triggers a set of questions: should we use new data or add to older data sets? What is a good balance between old and new data? How recent is considered new data, or when is the cut between old and new data?

3 – What should be retrained? Can we replace data and retrain the same model with the same hyperparameters? Or should we take a more intrusive approach and run a full pipeline that simulates our research process?

Each of the three questions above can be answered separately, and can help determine the optimal strategy for each case. Yet, and while answering these questions, there is a list of considerations that need to be taken into account, and which we have investigated herein. For each question, we describe different approaches, corresponding to different levels of automation and maturity of the ML process.

When to retrain?

The three most common strategies are periodic retraining, performance-based, or based on data changes.

Periodic retraining

A periodic retraining schedule is the most naive and straightforward approach. Usually, it is time-based:  the model is retrained every 3 months –  but can also be volume-based – i.e., for every 100K new labels. 

The advantages of periodic retraining are just as straightforward: this is a very intuitive and manageable approach as it is easy to know when the next retraining iteration will happen, and it is easy to implement.

This method, though, often reveals itself to be ill-fitted or inefficient. How often do you retrain the model? Every day? Every week?  Every three months? Once a year? While one may want to frequently retrain to keep models up-to-date, retraining the model too often when there is no actual reason, i.e., no concept drift, is costly. Besides, even when retraining is automated, it requires important resources – both computational and from your data science teams who need to oversee the retraining process and the new model behavior in production after deployment. Yet, retraining with large intervals may miss the point of continuous retraining and fail to adapt to changes in the data without mentioning the risks of retraining on noisy data.

At the end of the day, a periodic retraining schedule makes sense if the frequency is aligned with the dynamism of your domain.  Otherwise, the selection of a random time/milestone may expose you to risks and leave you with models that have less relevance than their previous version. 

Performance-based trigger 

It’s almost a common-sense claim based on the good old engineering adage: “if it ain’t broken, don’t fix it.” The second most common approach to determine when to retrain is to leverage performance-based triggers and retrain the model once you detect performance degradation.

This approach, more empirical than the first one, assumes that you have a continuous view of the performance of your models in production.

The main limitation when relaying on performance only is the time it takes for you to obtain your ground truth – if you obtain it at all. In user conversion prediction cases,  it can take 30 or 60 days until you will get a ground truth, or even 6 months or more in cases such as transaction fraud detection or LTV. If you need to wait so long to have full feedback, that means you’ll retrain the model too late after the business has already been impacted.

Another non-trivial question that needs to be answered is: what is considered ‘performance degradation’? you may be dependent on the sensitivity of your thresholds and the accuracy of the calibration, which could lead you to retrain too frequently or not frequently enough. 

Overall, using performance based triggers is good for use-cases where there is fast feedback and high volume of data, like real time bidding, where you are able to measure the model’s performance as close as possible to the time of predictions, in short time intervals and with high confidence (high volume). 

Driven by data changes

This approach is the most dynamic and naturally triggers retraining from the dynamism of the domain. By measuring changes in the input data, i.e., changes in the distribution of features that are used by the model, you can detect data drifts that indicate your model may be outdated and needs to be retained on fresh data.

It is an alternative to the performance-based approach for cases where you don’t have fast feedback or cannot assess the performance of the model in production in real-time. Besides, it’s also a good practice to combine and use this approach even when there is fast feedback, as it may indicate suboptimal performance, even without degradation. Understanding data changes to start a retraining process is very valuable, especially in dynamic environments.

Driven by data changes

The graph above, generated by the monitoring service, shows a data drift metric (upper graph) and a performance KPI (lower graph) on a marketing use case. The data drift graph is a timeline where the Y-axis shows the level of drift for each day (i.e., the data on this day) relative to the training set. In this marketing use-case example, new campaigns are introduced very frequently, and the business expands to new countries. Clearly, the data streaming in production is drifting and becoming less similar to the data on which the model was trained. By having a more thorough view of the model in production, a retraining iteration could have been triggered well before a performance degradation would have been observed by the business. 

So when to retrain? This depends on key factors such as: the availability of your feedback, the volume of your data, and your visibility on the performance of your models. Overall, there is no one size fits all to the question of selecting the right time to retrain. Rather, and depending on your resources, the goal should be to be as production driven as possible. 

What data should be used?

While we have seen that timing is everything (i.e., when do I retrain my models?), now let’s look at the nerve of all MLOps: the data. When retraining, how do I select the right data to be used? 

Fix window size

The most naive approach is to use a fixed window size as a training set. For example, use data from the last X months. Clearly, the advantage of the method is in its simplicity, as it is very straightforward.

The disadvantages derive from the challenge of selecting the right window: if it is too wide, it may miss new trends, and if it is too narrow, it will be overfitted and noisy. The choice of the window size should also take into account the frequency of retraining:  it makes no sense to retrain every 3 months if you only use the last month of data as a training set.

Besides, the data selected can be very different from the data that is being streamed in production, as it may happen straight after a change point (in case of fast/sudden change), or it may contain irregular events (e.g., holidays or special days like election day), data issues and more. 

Overall, the fix window approach, like any other “static” method, is a simple heuristic which may work well in some cases, but will fail to capture the hyper dynamism of your environment in cases where the change rate is versatile and irregular events are common like holidays or technical issues.

Dynamic window size

The dynamic window size approach tries to solve some of the limitations of the predefined window size by determining how much historical data should be included in a more data-driven way and by treating the window size as another hyperparameter that can be optimized as part of the retraining. 

This approach is most suitable for cases in which there is fast feedback (i.e., Real-Time Bidding or food delivery ETA). The most relevant data can be used as a test set, and the window size can be optimized on it, just like another hyperparameter of the model. In the case illustrated below, the highest performance is achieved by taking the last 3 weeks, which is what should be selected for this iteration. For future ones, a different window may be selected according to the comparison with the test set.

Dynamic window size

The advantage of the dynamic window size approach is that it is more data-driven, based on the model performance, which is really the bottom line, and hence it is more robust for highly temporal environments.

The disadvantage is that it requires a more complex training pipeline (see next question ‘What to train’) to test the different window sizes and select the optimal one, and it is much more computing-intensive. Additionally, like the fixed window size, it assumes that the most recent data is the most relevant, which is not always true. 

Dynamic data selection

The third approach of selecting the data to train the models seeks to achieve the basic objective of any retraining strategy, which is the basic assumption of any ML model: using training data that is similar to the production data. Or in other words, models should be retrained on data that is as similar as possible to the data used in real-life to issue predictions. This approach is complex and requires high visibility of the production data. 

To do so, you need to perform a thorough analysis of the evolution of the data in production to understand if and how it changes over time. One way to do this is to calculate the drift in input data between each pair of days, i.e., how much the data of one day is different from the data of another day, and generate a heat map that reveals the change patterns of the data over time like the one shown below. The heat map below is a visualization of drift evaluation over time where the axes (columns and rows) are dates, and each dot (cell in the matrix) is the level of drift between two days, where the higher the drift, the more red the cell is. 

Dynamic data selection

Above, you can see the result of such an analysis generated automatically by Superwise to help spot data that is as similar as possible to the one that is now streaming. There are different types of insights that can transpire easily from this view: 

1 – The month of December is very different from the rest of the data (before and after). This insight enables us to avoid treating the whole period as bulk and exclude these days when you want to retrain, as this month’s data doesn’t represent the normal data you observe in production. 

2- The seasonality of the data can be visualized – and this can trigger a discussion around the necessity to use different models for weekdays and weekends.

One more thing that is very powerful here: you can actually have a picture that proves that the most recent data isn’t necessarily the most relevant.

In short, selecting the right data to retrain your models requires a thorough view of the behavior of the data in production. While using fixed or dynamic window sizes may help you to “tick the box”, it usually remains a guesstimation that might be more costly than efficient.

What should be retrained?

Now that we have analyzed when to retrain and what data should be used let’s address the third question: what should be retrained? One could select only to retrain (refit) the model instance based on the new data, to include some or all of the data preparation steps, or take a more intrusive approach and run a full pipeline that simulates the research process.

The basic assumption of retraining is that the model that was trained in the research phase will become outdated due to concept drift and hence need to be retrained.  However, in most cases, the model instance is just the last phase of a wider pipeline that was built in the research phase and includes data preparation steps. The question is: how severe is the drift and its impact? or in the context of retraining, what parts of the model pipeline should be challenged?

In the research phase, you experiment and evaluate many elements within the model pipeline steps in an effort to optimize your model. These elements can be grouped into two high-level parts, data preparation, and model training. The data preparation part is responsible for preparing the data (duh!) to be consumed by the model and includes methods like feature engineering, scaling, imputation, dimensional reduction, etc., and the model training part is responsible for selecting the optimal model and its hyperparameters. At the end of the process, you get a pipeline of sequential steps that take the data, prepare it to be used by the model and predict the target outcome using the model.

Retraining only the model, i.e., the last step in the pipeline, with new relevant data is the most simple and naive approach and may suffice to avoid performance degradation. However, in some cases, this model-only approach may not be enough, and a more comprehensive approach should be taken regarding the scope of the retraining. Broadening the scope of retraining can be performed in two dimensions:

1 – What to retrain? What parts of the pipeline should be retrained using the new data?

2 – Level of retraining: Are you just refitting the model (or other steps) with the new data? Do you do hyperparameters optimization? or do you challenge the selection of the model itself and test for other model types?

Another way to look at this is that in the retraining process, you basically try to automate and avoid the manual work of model optimization research that would be done by a data scientist if there was no automatic retraining process. Hence you need to decide to what extent you want to try and mimic the manual research process, as illustrated in the chart below. 

manual research process
Model Research Stages as described in MLOps: Continuous delivery and automation pipelines in machine learning

Note that automating the first steps of the flow is more complex, and as more automation is built around these experimentations, the process grows more robust and flexible to adapt to changes, but it also adds complexity as more end cases and checks should be considered

Example & conclusion

Let’s take, for example, a simple model pipeline that is a result of manual research done by our awesome data science team for some classification tasks:

manual research for some classification tasks

You can take the simple approach and retrain the model itself and let it learn a new tree, or you can include some (or all) data preparation steps (e.g., relearn the mean in the imputer or the min/max values in the scaler). But besides the selection of the steps to be retrained, you need to decide to what level. For example, even if you choose to retrain just the model, it can be performed on multiple levels. You can retrain the model itself and let it learn a new tree, i.e., model fit (new_X, new_y), or you can search and optimize the model hyperparameters (max depth, min leaf size, max-leaf nodes, etc.) or even challenge the selection of the model itself and test for other model types (e.g., logistic regression and. random forest). The data preparation steps can (and should) also be retrained to test for different scalers or different imputation methods or even test for different feature selections.

When choosing to take the more comprehensive approach and perform hyperparameter optimization/model search, you can also use AutoML frameworks that are designed to automate the process of building and optimizing the model pipeline, including data preparation, model search, and hyperparameter optimization. And it doesn’t have to involve complex meta-learning. It can be simple: the user selects a range of models and parameters. A lot of tools are available: AutoKeras, Auto SciKitLearn. Yet, as tempting as AutoML is, it doesn’t exempt the user from having a robust process to track, measure, and monitor the models throughout their lifecycle, especially before and after deploying them in production.

At the end of the day, “flipping the switch” and automating your processes must be accompanied by robust processes to assure that you remain in control of your data and your models.

To achieve efficient Continuous Training, you should be able to lead with production driven insights. For any and every use case, the creation of a robust ML infrastructure, relies heavily on the ability to achieve visibility and control over the health of your models and the health of your data in production. That’s what it entails to be a data driven data scientist 😉