Tell us a bit about yourself, your background, where you work, and what you do there.
I’m a Senior Machine Learning Engineer and the Machine Learning Platform Lead at Reverie Labs, a drug discovery company that’s focused on developing small molecule therapies for cancer using computation. I started working at Reverie about four years ago as a machine learning engineer. I was an early employee and have worked on many different kinds of projects. When I started, we were a team of five people and have since grown to more than five times that size. My primary focus has been on helping advance our machine learning capabilities, both by developing models that can accurately predict molecular properties and by creating machine learning tools that can help the company scale its efforts effectively.
In my role as ML Platform Lead, I aim to figure out how to make our internal machine learning tooling better, enabling our engineers and data scientists to do their work more efficiently. That includes ingesting new data, experimenting with new approaches, training new models, evaluating them, and eventually deploying to production.
What career advice would future you give to your past self?
I first started working on machine learning in 2016, which was probably close to the time that it was reaching its peak in the hype cycle, at least in the tech ecosystem. Machine learning was often touted as a sort of “magic” solution to things, and there was a notion that models should be designed in such a way that they would replace humans. I think that viewing models more as aids than replacements has proven to be a more realistic objective and one that can deliver high value in a shorter time frame.
How does ML impact your business? Why is it important and what does it help you achieve?
Machine learning is a core part of Reverie’s business. At the end of the day, our “product” is a molecule, or set of molecules, that can bind to a target of interest. We don’t sell any services or license our technology, but instead focus on developing molecules in-house. We’re trying to accelerate every stage of that drug discovery process using machine learning.
In the early stages of a therapeutic program, we will have already identified a target of interest, such as a specific kinase that is implicated in cancer, and we will need to find a novel molecule that “hits,” or has some degree of potency against, that target. In what is referred to as virtual screening, we take massive libraries of candidate molecules and distill that list down into a smaller set that we can have tested in the lab. That filtering step is essential because having molecules synthesized and tested is time consuming as well as expensive. So, the more efficiently we can filter that list, the quicker we can progress our programs. Machine learning models are part of a computational toolkit that we use to score and rank the initial list.
As a program progresses, and we’ve been able to find hits, we can use a similar approach iteratively: training a model using existing data (which we keep adding to as we test more molecules), scoring new candidates, and testing the most promising ones. While potency is very important, we also need to ensure that the molecules we eventually progress to the clinic fulfill a multitude of different criteria, all of which are needed in order for the molecule to be an effective therapy for patients. This is why drug discovery is often referred to as a multi-objective optimization problem.
Machine learning isn’t the only tool that we use — other computational methods and our team of chemists play crucial roles. But we aim to use machine learning wherever we can to accelerate our workflows. For example, we might try to use machine learning to approximate computational chemistry methods that are very time-intensive, or use generative models to help ideate and explore novel chemical space.
What questions should organizations adopting ML be asking themselves to replicate your success?
I think that identifying the right opportunities for using machine learning is essential. The first question is, “what is an important problem for which I have good training data available?” with an immediate follow-up of, “what are some creative ways to find more data?” Oftentimes models can be readily improved by training on auxiliary tasks, where more data is available or can be computed at scale.
The next step is identifying where that data lives and how you can efficiently get it into a format that your models can ingest. For example, in Reverie’s case, we have to translate molecular structures and experimental values that live in a specialized data store, into arrays and tensors that our machine learning models can work with. Having systems in place that can mold data into a usable format quickly and efficiently can drastically increase the effectiveness of data scientists and machine learning engineers. Trying to scope out up front what that whole data flow pipeline should look like is a better idea than trying to figure it out as you go.
Also, on the platform side, it makes sense to separate your use cases. One is catering to your practitioners (data scientists and machine learning engineers) as your consumers. Think about how you can offer them flexibility, the ability to easily try new approaches. The other use case is production — the consumers in that setting are the ones who are working directly with the outputs of models. The journey from a practitioner’s model to a production model should be as seamless as possible.
What teams do you have in place and what part of the ML process does each team own?
Reverie’s team structure is fairly fluid, and people generally have responsibilities across a few different categories.
We have multiple therapeutic programs that we’re working on at once, each corresponding to a specific target for a disease or indication. Under each of those programs, we have a separate team that consists of medicinal chemists, computational chemists, data scientists, and program managers.
Then, there is the core machine learning team, which I am part of. We take input from the program teams to identify areas of interest and try to move the needle. At a high level, this kind of work is applied research. We’re often in the literature, trying to find new methods and architectures that might address a specific problem. However we also work closely with the software engineering team, which is actively involved in building much of the core infrastructure that enables us to work with data, train models, and get predictions on large batches of data.
What does your MLOps stack look like?
We’ve built an internal API that we can use flexibly, whether it’s a TensorFlow model, a PyTorch model, or a Scikit-Learn model. We’ve essentially wrapped all those frameworks under a common umbrella so that machine learning engineers and data scientists can fit and train on data flexibly. This was designed to encourage rapid development and experimentation, agnostic of framework.
One challenge is that our raw molecular data can often take many shapes and forms. For example, we might have only text-based inputs (molecules can be represented in chemical notation as a string), spatial information (3D coordinates of all atoms in a protein-ligand complex), or niche file formats that come from computational chemistry workflows. We have had to think about how we want to ingest and transform that data into a usable format, whether that’s a PyTorch dataset or TFRecords, and have tried to make that process as streamlined as possible.
Our cloud infrastructure is on AWS, which includes both storage and compute. There are a variety of other tools that we use to orchestrate workflows, such as Kubernetes, Argo, and Ray. For experiment tracking, we use MLFlow.
Who is responsible for the models once they are in production?
Once your model meets the real world what is important to you to monitor and why?
We want to monitor how our models perform on new, experimental data, as it arrives from the lab. This helps us gauge where our models are performing well, and where they might need more improvement. Typically, we retrain our models every time we get a batch of new data. So, we try to have an evaluation pipeline in place that evaluates performance prior to retrain every time, which gives us a trajectory of model performance over the lifecycle of a program.
Monitoring in this way is incredibly important since we’re continually using our models to recommend new molecules to design and test. If our models all of a sudden see degraded performance, we need to know so that we can take that into account when making decisions about our chemical synthesis queue.
What are your top metrics that you look to gain visibility into model health?
The metrics that we care about vary by the type of task that we’re trying to model, but they are usually the common sorts of metrics used to assess regression and classification tasks, such as coefficient of determination and AUC score. When we’re assessing performance on new data, however, we have to be especially careful in regard to data distribution when interpreting those metrics. Chemical data is typically distributed in sparse clumps — one chemical series, based on a certain chemical scaffold, is often used to create multiple analogs. Predicting across different scaffolds is usually a more difficult task than predicting values within a specific series. In fact, during training, scaffold-based cross-validation is usually needed to ensure that models are not overfitting to any specific series.
We also sometimes have very specialized metrics in drug discovery. For example, in virtual screening, we only care about accuracy in the top 0.001% of the predicted distribution, so we usually use a metric like enrichment to represent that. The important thing is to select a metric that best reflects how the model will be used for decision-making in practice.
In what ML areas are you investing the most heavily to improve in over the next two years?
Relative to machine learning more broadly, ML for drug discovery or “molecular” machine learning is still a fairly nascent field. A model that can effectively predict protein-ligand binding affinity in the general case is still a difficult problem and a huge opportunity. At Reverie, we have a number of ongoing efforts to try to build models that can generalize in this way across multiple targets. To this end, we’re investing heavily in trying to integrate new representation, featurization, and modeling methods.
What aspect of the industry, either an opportunity or a risk, isn’t getting enough attention?
Generally speaking, using machine learning models to challenge intuitions, especially in an industry like pharma, is essential, and something we’re trying to take advantage of. Using machine learning models in both a predictive and generative capacity can reflect directly what data is telling you and uncover insights that a human practitioner may not see immediately. I think that’s an interesting paradigm, trying to learn from your models and using them as a creative aid to inform the exact ways people think about novel ideas. That sort of interplay, in my view, is a huge opportunity.