Tell us a bit about yourself and what you do.
My name is Jiazhen Zhu. Currently, I’m a machine learning engineer at Walmart Global Tech. I have both a bachelor’s and master’s degree in computer science. I’m focused on end-to-end ML, leading collaborative teams combining data engineering, data science, and MLOps to build an improved, robust platform for our data-driven divisions and data-powered projects. For data engineers, this means building the foundations for their data pipeline, helping them with DataOps process, transform, load, etc., data; then we will go back to the ML part to train and validate models and finally push them through our pipelines to productionalize our machine learning projects.
In addition, I publish a lot of the Medium blogs within Walmart Global tech on machine learning and data engineering, and my team plans and creates a lot of customized packages for machine learning.
If another organization is looking to adopt ML, what questions should they be asking?
First, you need to ask, “Do we need ML?” Not all problems are best solved with machine learning. You need to ask yourself, “Can your problems be solved with heuristic algorithms, detailed business intelligence, or even just dashboards that communicate value to stakeholders better?”
Second, assuming that you need machine learning, “What kind of machine learning do we need, and do we have the data? Does the use case need structured data, text, or image?” Even then we need to ask, “Do we have enough data, and clean data, to train machine learning models?”
Third, “Does everyone in our company trust ML results?” ML is very much a black box. If vendors or partners don’t trust machine learning results, maybe we need to invest in an explanation process for machine learning.
Last, “Is the company a data-driven company? Do we have all the end-to-end processes for the data itself, for example, quality data? Do we have a pipeline?” If not, we need to build the DataOps and data engineering part first, and then the data scientists and machine learning engineers can pick up the data, the real data, the clean data, to proceed to the modeling.
How does your organization’s workflow operate between data scientists and management?
How much information do higher-ups need or want to trust in ML results, and how do you deliver the information they want?
This is a good question. So, there are a couple of ways. There are different stakeholders/customers. Each has a different level of ML literacy. Maybe they do not trust the machine learning algorithm itself very much. So maybe to communicate insights, we don’t need to explain the machine learning algorithm, just the business intelligence.
Second, they may not trust deep learning models that are black boxes. Then, we will plan for some basic models, like a linear regression, or just regression, or random forest. It’s very basic, but it can be explained by the model itself. We can print out and show why we predict new data into a category, and why a number belongs to this report. So it’s much easier for them to understand the model itself.
Third, some customers like fashion e-commerce need more accuracy because they have a lot of high-velocity data. They may get better results and a more useful algorithm using a deep learning model. We need to achieve their trust, even if the model is a black box. So what we can do is use more of an explanation process. For example, for NLP, we will add SHAP, LIME, MUSE, or some other algorithm to explain the result of the model itself.
Also, we have all the ML operations, the ML pipeline, so we have to monitor the model itself for concept drift, data drift, and so forth. We need to monitor a lot of model metrics but also the assessment itself. And that information can be transferred to the customer. When they can see everything about the model itself, they will gain trust in the model and the machine learning pipeline and systems.
Do you have a specific set of best practices that you’ve learned help engender trust with the customer?
Is there a playbook that you follow, or does it depend on each customer and how much they want to know?
It depends on the requirements of the customer. We select the best model or the best deep learning model, and if we select a deep learning model, we need to plan for more of an explanation process. My goal is to do the best we can to explain the model itself, to make sure the model is open to the customer, not just a black box. To involve the business or involve the customer in the model. If they can understand the model and understand the process, they will trust the result.
If they just get the result directly, sometimes they cannot understand the process, so they don’t trust the result. But if I open the process, open the reasoning, and add more additional explanations using the algorithm, using good measurements, and using the systems, they know the process and trust the product much more.
Once your models are out in the real world, what are your biggest concerns when it comes to making sure they’re performing correctly?
There are a couple of concerns, the data itself and the model itself, because machine learning combines the data and the model. So, for the data, maybe we’ll get streaming data, maybe we’ll get the batch data. The data may be different from the training data that we trained the model for. There may be data drift or even concept drift occurring. So the first step is to try to monitor the data.
Then the second one is model performance. We will retrain our processes a lot using new data and customer feedback. We make sure our model is the latest version, which can meet the requirement for the new data which comes in, and the new requirement from the customers that are giving us feedback. So, we need to monitor the model performance, the matrix performance, and the speed performance.
The third one is the system performance. When we move the model to production, research and the reality of production are totally different. So how do you monitor the system effectively to get an idea of how the system is doing.
Are there specific metrics you look at to determine how well a model is performing?
There are a lot of different metrics here, but as a rule of thumb for model performance, depending on the type of model, we’re monitoring things like F1 and accuracy. For data, we’re looking at integrity and quality — is the distribution changing? Are we all of a sudden missing values? And on the system side, model training speed, function runtime.
There are many different aspects, each with its own metrics that we are monitoring.
When you identify an issue, what does your resolution process look like?
So, it depends. For the retraining process, we have to make sure the system can handle it. But if we have a data source issue — where the data is different, the data don’t meet the model itself, or doesn’t meet the requirements anymore; even if we do retrain, we’ll still have model performance issues. In this case, the data science and machine learning engineer will update the model itself. So then, we will maybe release a new version of the model or change the model. Because we have a lot of different versions of the model, so we can update it to meet the requirements.
What challenge of your job keeps you up at night?
Because a model is based on data, it’s the data difference most of the time. To start, maybe we had some of the data. Then as data streams, a data drift occurs. Perhaps even the concept has drifted. If this happens, the model cannot handle it.
So the most significant things are: how we can get alerted immediately, as soon as possible, and then how can we involve the data science and machine learning engineer into the process.
What area of machine learning do you see the ecosystem investing in over the next couple of years?
From my personal view, there are a couple of things that are very interesting: MLOps and data-centric AI.
We move the model into production, but making sure the ML system is running well is another story. So, how to scale it, how to make it sustainable.
Data-centric AI, for example, we have more and more data than ever before. But we know if we have enough good data, and solid data, we can get good model performance. So that can help us reduce the cost of training, and reduce a lot of dimensions if the data is not that big. Especially for deep learning, if we have more data, the model can be improved a lot. But another story is, if we have enough good data, then we can reduce the data size itself.
So this is the reason I’m always thinking about how to combine data engineering and machine learning together. Some data scientists will spend 70% of their time on data processing and just 10-20% on modeling. Then, they will transfer the model to the machine learning engineer who will build the model itself and the ML system. But if the data scientist knows more about machine learning and more about data engineering and quality, they can shift the balance and help them do more modeling and less data engineering.
Is there an aspect of machine learning, either a risk or opportunity, that people aren’t talking enough about that they should be looking out for?
So, I look at a lot of research papers and blogs. And currently, everyone, all the research scientists, are focusing on the model itself, especially for machine learning.
But we all know, for example, that if you want to improve from 80% to 81% efficiency, you need to spend a lot of time doing research, inputing the algorithm, and then maybe you will achieve it. So even if you bring a new method or innovation to the model itself, it might not get you a better performance from the model than before.
So the thing is, how can we improve performance? A lot of the things I said before — we need data-centric AI, and we need to integrate MLOps to get the new, really good data. It is another way, an effective way, of improving the model itself.
The other thing is how to get trust from everyone, from customers to partners, to get trust for your model.
This interview has been edited for length and clarity.