I am Nufar. I am the head of operational AI, product, and strategy within the IT AI organization of Intel. I own AI enablement and usage across Intel: through tools, training, and up-skilling for the various teams’ at Intel consultation, as well as the professional services that deliver end-to-end AI capabilities across the company.
The organization that I lead is the newest addition to the overall IT AI organization, this organization which was established over 11 years ago to disrupt how Intel’s critical work is being done with AI. Our 200+ AI professionals across the world are working on a vast range of breakthrough AI technologies built to yield high business impact. Currently, we have about 500 algorithms in production, and over the years we have worked on many thousands of algorithms. Thereby we have gained vast experience in bringing AI capabilities to production. We have seen it all. We have succeeded and we have failed. From both, we learned a lot and we bear the scars to show that we have been there.
I hold a master’s from Ben-Gurion university. I specialized in optimization and statistics and joined Intel right after I graduated with my Master’s and held several roles. Prior to my current role, I was a vertical manager of AI solutions for Intel’s R&D organizations, solving the questions of how to bring AI to create better computer chips. And I have three kids, trying to solve the question of how to bring them to production usually without AI.
So with 500 algorithms in production, you’ve definitely scaled your use of AI… Can you tell us a bit more about the biggest challenges you faced and the best practices that you learned from that?
In terms of productization, a lot of the focus that we see is on efficiency: how can we do more with a similar headcount?
We want to focus on new activities rather than spending unnecessary time and effort maintaining the algorithms that we already have.
So many of the challenges are around: how to automate as much as possible? Whether it’s the monitoring and the deployment that can automatically actuate retraining and parameters tuning. But also for example: how to empower product managers and account managers to be more self-sufficient in their ability to analyze the results? Moreover, we make sure that if needed, our product managers can easily retrain the algorithm or adjust some of the algorithm parameters to make sure that the business value is staying high over time and without relying all the time on data scientists or machine learning engineers to do the work for them.
Other than automating as much as possible, the other takeaway would be reusing the different components as much as possible. We try to have one platform on which you can very easily deploy new capabilities and new algorithms. So, a platform that is very easily scaled, whether it’s to deploy an existing solution or an algorithm to a new customer. Even if it’s a completely new instance of the algorithm, or whether it’s to create something that is altogether a new feature or new product – the platform underneath should be as generic and as easy to be used for other and new capabilities. For instance, a lot of the data post-processing and data manipulation will be shared by different algorithms and capabilities rather than developing each of these building blocks from scratch each time. It is based on a micro-services architecture concept but taking it even one step further: Think of it as a microservices algorithm where you can do a mix and match of what you need for the new capabilities.
Another lesson is that you always need to have production in mind, and be very much aware of the performance or latency that the algorithm will need to work in. So if it’s not feasible for you to perform in the required time, then you need to do something different because you understand early on that your model will not be operational going forward.
The first lesson that you mentioned was about “how to make sure that the product managers are independent” so that they don’t need to bring in the data scientists in order to understand what is happening with the predictions. Who really is in charge of assuring the health of the models in production?
So the short answer will be everyone, right?
If you want to maintain high quality in production over time, it can’t be under the accountability of one specific function. I think the entire team has to be accountable for the quality.
Another thing that you mentioned is that even from the R&D perspective, everybody should be aligned to have a production mindset. Now, a lot of people are talking about the notion of the full stack data scientist. How do you see this?
I always make sure that the data scientists on my team have and nurture the ability to write the algorithms’ code that is production-level quality. I don’t let data scientists create a messy code or pseudo-code and then have a machine learning engineer recreate it for production. First of all, I think it’s not a very efficient way to work. And also I think that data scientists like to code, so it gives them a more holistic way to practice their skills and moreover, it makes them better at their job as they fully grasp the impact of any algorithmic choice they make on the end-result. Besides, in many cases, because of the size of the data, we need to have the ability to work in languages that enable parallelized data science. So data scientists need to have quite extensive coding skills. All that said, machine learning engineering is by all means a profession in its own which is crucial for the overall success. To us, their focus is on creating the best performing software architecture and code that will take the algorithm developed by the data scientist and make it shine. They work together with the data scientist and not after they’re finished. And in most cases, I would ask the data scientist to write the algorithm module all the way to production and to work with, or alongside machine learning engineers to do the entire flow of the software around this algorithm. So, I think that the data scientists do need to understand the business and do need to write code all the way to production, but I don’t think they are a replacement for machine learning engineers who are needed to create the best possible overall software: with low maintenance, with the best performance and best in class.
Once the models are in production, what is the interaction with the business owners? How do they understand what is happening with their predictions and how do you make sure that there’s enough trust for them to adopt the predictions?
We’ve learned over the years that, especially as we onboard new customers and new capabilities, we have to deliver a lot of visibility on what the algorithms are doing and a lot of preliminary analysis on even just the raw data of what we see in order to create trust. And, by the way, we’ve learned it the hard way. In cases when we haven’t provided that early on, it took us a very long time to create trust and for the customers to go into production. In most cases, within Intel, we do an initial proof of concept, and we’ve seen that the time between proof of concept to working smoothly in production is longer when this trust is not built. So whenever we bring something to production, especially with new customers, we try to give them a lot of insights into how the algorithms are working and what they are seeing. Very often, when there is enough transparency early on, it takes a shorter time to go live.
Do you see a need for xAI?
Definitely! I think, again, it varies between the different businesses. For example, when you work with the verification engineers within Intel, it’s their profession to be a little bit suspicious and skeptical. That’s what they are trained for: to look for bugs, right? So for those kinds of users, we have to provide a lot more visibility and explainability. There might be some other users that are less technical or less skeptical, and you can provide more of a black box kind of capability, but overall, I think having the ability to have an explainable AI across the board is something that has a high value.
What do you see as being the main tools or the main pillars of investment of growth at Intel?
A lot of MLOps, as everyone. I assume that the ability to have one MLOps across different verticals and different organizations and to ease the access to MLOps for teams without high proficiency in machine learning is key. Even if there is a citizen data scientist team that can create a production-grade solution at a low cost, I see their need for MLOps growing. Also, we do have AutoML on board and we want to make it easier to productize what was created via AutoML, such as a no-code or low-code capability. Even for the practitioners with high coding skills, we do invest in pipelines to ease both the R&D and the production flow. Even for data scientists that can code:how can they reuse more and go through all the data science pipeline development without having to rewrite codes over and over. So we are investing in a lot of reuse and facilitation to go through the different stages.
By easing the entire pipeline and also the MLOps aspects, we’re also seeing our data analysts and product teams gaining more reusable building blocks and coding ability so they can do their work and even develop algorithms that could make it to production.
Are there any AI failures, that are keeping you up at night?
Some of our algorithms are very intrusive into the most critical processes within Intel, such as the testing done within the production line or during the R&D stages, as well as algorithms residing within Intel products making them smarter.
Thereby, almost any line of code, and any algorithm going into production is more monitored than an ICU patient.
And in many cases, we identify data drifts or even issues with the data, that are representative of an issue with the business process, even before the business is aware of it. By the way, for everything that we create, the first capability that we create is the ability to turn it off, and a lot of our algorithms have their own fail-safe mechanism.
Do you remember the first model that you productized and is it still live?
I remember it! This actual model is not live, but several versions of it are. It was an optimization algorithm to optimize which test to run, to fully validate the entire CPU within Intel. This specific algorithm is currently not in production, but there are several versions that are and actually bring a lot of value.
If you had all the resources in the world, and if you could build the best team that you could ever dream of, what would be the machine learning model that you would want to create and publish?
I would probably do matchmaking between private tutors and students for anything, right. Not just for kids, but also for adults wanting to learn something new or become better at something important to them. I think with Covid we’ve seen the potential of learning remotely and we’ve only scraped the surface of what can be done.