Articles

Machine Learning in Mass Spectrometry Data Analysis

Hear expert insights on applications of machine learning within mass spectrometry and the challenges faced.
| 4 min read

Mass spectrometry instruments and related technologies continue to advance, representing a cornerstone technique in analytical science. As with many other techniques in the space, one consequence of these advancements is the production of vast volumes of complex data. Analyzing this data presents a difficult challenge requiring novel solutions.

Enter machine learning, a subset of artificial intelligence that is revolutionizing how scientists handle and interpret mass spectrometry data. We heard from Will Fondrie, Manager of Computational Biology at Talus Bio, to discover more about this field.

What are some exciting ways in which machine learning can enhance mass spectrometry data analysis?

Mass spectrometry is a field uniquely suited for machine learning and AI to improve how we interpret our data. Unlike many other data types, it is difficult to say that mass spectra are ‘human interpretable.’ That is, it takes a trained expert to gain any insight into what an individual mass spectrum contains, and even still, some details in a given mass spectrum are of unknown significance or uninterpretable. However, interpreting mass spectra is a prime task for machine learning and AI methods, which excel at learning patterns from data. We’ve seen this in recent years in the mass spectrometry-based proteomics field, where AI tools such as Prosit have been trained to predict the mass spectrum that may result from a peptide and then used to detect those peptides in new experiments.

Outside of mass spectrum interpretation, there is a lot of potential for machine learning methods to speed up data acquisition and to use the data for downstream applications. In terms of downstream applications, we see a lot of use for machine learning and AI methods in drug discovery and development—such as the work we perform at Talus—as well as domains such as biomarker discovery and inferring protein-protein interactions.

What are some common challenges and apprehensions faced by mass spectrometrists with respect to machine learning techniques?

One of the biggest problems we face is that mass spectra are difficult to represent computationally in a manner that is suitable for machine learning. This has traditionally made it difficult to use mass spectra as input for machine learning or AI models, particularly compared to other fields such as genomics. In addition, the difficulty in mass spectrum interpretation, reliance on proprietary data formats, and the sheer amount of jargon have hindered our recruitment of computer scientists and other experts into the field. Thus, the most common challenge for mass spectrometrists seeking to apply machine learning methods to their data is finding an expert with whom they can communicate effectively. This is actually one of the primary goals of our short course at ASMS (Machine Learning for Mass Spectrometry Data Analysis)—we want to equip mass spectrometrists with the vocabulary and understanding that they need to seek out the expertise that is most relevant to the problem they want to solve.

As far as apprehensions go, the biggest concerns typically stem from the interpretability of machine learning models (determining why a model provides a particular answer) or a lack of understanding of the underlying methods. Again, education is the key to addressing these concerns. We want to help mass spectrometrists critically evaluate proposed new methods and understand the shortcomings of the models we use.

Please provide a brief overview of supervised machine learning and unsupervised machine learning.

In a supervised learning task, we are given a dataset with features (the model input) and their associated labels (the model output). Our goal is to find the rules (our model) that transform our input features into the output label. We then use that model to predict the labels for new data. Some examples of supervised learning are predicting the properties of an analyte based on its structure or sequence or predicting the clinical outcome of patients given a set of biomarkers. I like to think about this with a cooking analogy. A standard computer program is like following a recipe—you know the ingredients and the rules and use them to create the dish. In contrast, supervised learning is like trying to create a recipe after tasting a dish. You know the ingredients and the final dish, so the goal is to determine the rules. However, once you’ve figured out the rules that comprise your recipe (your model), you can follow the new recipe whenever you want to create the dish again.

For an unsupervised learning task, we are given features but no labels. Instead, there is some relationship between the examples in the dataset that we want to learn. Some examples of unsupervised learning include dimensionality reduction methods such as principal component analysis (PCA) and clustering methods. To go with another food-based analogy, let’s say you’re tasked with organizing buffet tables, and you’re provided a list of ingredients for each dish that will be served. We would most likely group dishes with similar sets of ingredients together. For example, sweet ingredients likely indicate desserts. This is an example of an unsupervised learning task where we do not have examples of where each food should go, but we think that foods with similar ingredients should likely be in the same course. Here, we’re looking for patterns that connect the examples to one another rather than patterns that connect an example to a provided label.

What advancements can we expect to see in this field within the next year or so and in the more distant future?

I’m excited to see more end-to-end machine learning systems for processing mass spectrometry data. This is something we’re already starting to see in the proteomics field, and it holds tremendous potential. Previously, each step of the data processing pipeline may have incorporated some level of machine learning or AI to improve its effectiveness, such as using predicted spectral libraries to improve our ability to find peptides in mass spectra. However, these individual components have been disjointed. Instead, end-to-end models will be able to use the data more effectively, potentially leading to some exciting improvements in data quality. However, such systems need to be rigorously evaluated, as obscuring the processing of data behind an AI model often inhibits interpretability and must substantiate the claims of improvement that they make.

In the more distant future, we’ll start seeing more machine learning and AI approaches incorporated into mass spectrometer instrument control. I could see such methods being particularly useful in ‘top-down proteomics’ applications where we want to analyze whole proteins instead of small peptides—the inevitable future of proteomics. AI methods could potentially help the instrument automatically decide how to fragment each analyte to maximize the sequence information we can gain.

Published in:

Innovation and sustainability in modern analysis magazine cover
June 2024

Innovation and Sustainability in Modern Analysis

Begin by delving into the topic of machine learning in mass spectrometry data analysis. We hear from ASMS short course instructor, Will Fondrie of Talus Bio, who shares his expertise on this exciting topic. We then turn our focus to ion mobility-mass spectrometry (IM-MS), and look at the latest developments in this technology. Daniel DeBord, PhD, of MOBILion Systems Inc. explains the impact of these advances, highlighting the increasing relevance of IM-MS and its diverse applications.

Meet the Author(s):

  • Aimee Cichocki is the Managing Editor at Separation Science and Chromatography Forum. Aimee brings a broad range of experience in creating, editing, and formatting scientific content. With a degree in medicinal chemistry, a 10-year background in formulation chemistry, an MBA, and a diverse background in publishing, Aimee guides editorial initiatives at Separation Science and Chromatography Forum. Aimee is dedicated to ensuring the delivery of informative, reliable, and practical content to our audience of analytical scientists.

Here are some related topics that may interest you:

Loading Next Article...
Loading Next Article...