Machine Learning in Mass Spectrometry Data Analysis

by | Jun 3, 2024

Hear expert insights on applications of machine learning within mass spectrometry and the challenges faced.

Mass spectrometry instruments and related technologies continue to advance, representing a cornerstone technique in analytical science. As with many other techniques in the space, one consequence of these advancements is the production of vast volumes of complex data. Analyzing this data presents a difficult challenge requiring novel solutions.

Enter machine learning, a subset of artificial intelligence that is revolutionizing how scientists handle and interpret mass spectrometry data. We heard from Will Fondrie, Manager of Computational Biology at Talus Bio, to discover more about this field.

What are some exciting ways in which machine learning can enhance mass spectrometry data analysis?

Mass spectrometry is a field uniquely suited for machine learning and AI to improve how we interpret our data. Unlike many other data types, it is difficult to say that mass spectra are ‘human interpretable.’ That is, it takes a trained expert to gain any insight into what an individual mass spectrum contains, and even still, some details in a given mass spectrum are of unknown significance or uninterpretable. However, interpreting mass spectra is a prime task for machine learning and AI methods, which excel at learning patterns from data. We’ve seen this in recent years in the mass spectrometry-based proteomics field, where AI tools such as Prosit have been trained to predict the mass spectrum that may result from a peptide and then used to detect those peptides in new experiments.

Outside of mass spectrum interpretation, there is a lot of potential for machine learning methods to speed up data acquisition and to use the data for downstream applications. In terms of downstream applications, we see a lot of use for machine learning and AI methods in drug discovery and development—such as the work we perform at Talus—as well as domains such as biomarker discovery and inferring protein-protein interactions.

What are some common challenges and apprehensions faced by mass spectrometrists with respect to machine learning techniques?

One of the biggest problems we face is that mass spectra are difficult to represent computationally in a manner that is suitable for machine learning. This has traditionally made it difficult to use mass spectra as input for machine learning or AI models, particularly compared to other fields such as genomics. In addition, the difficulty in mass spectrum interpretation, reliance on proprietary data formats, and the sheer amount of jargon have hindered our recruitment of computer scientists and other experts into the field. Thus, the most common challenge for mass spectrometrists seeking to apply machine learning methods to their data is finding an expert with whom they can communicate effectively. This is actually one of the primary goals of our short course at ASMS (Machine Learning for Mass Spectrometry Data Analysis)—we want to equip mass spectrometrists with the vocabulary and understanding that they need to seek out the expertise that is most relevant to the problem they want to solve.

As far as apprehensions go, the biggest concerns typically stem from the interpretability of machine learning models (determining why a model provides a particular answer) or a lack of understanding of the underlying methods. Again, education is the key to addressing these concerns. We want to help mass spectrometrists critically evaluate proposed new methods and understand the shortcomings of the models we use.

Please provide a brief overview of supervised machine learning and unsupervised machine learning.

In a supervised learning task, we are given a dataset with features (the model input) and their associated labels (the model output). Our goal is to find the rules (our model) that transform our input features into the output label. We then use that model to predict the labels for new data. Some examples of supervised learning are predicting the properties of an analyte based on its structure or sequence or predicting the clinical outcome of patients given a set of biomarkers. I like to think about this with a cooking analogy. A standard computer program is like following a recipe—you know the ingredients and the rules and use them to create the dish. In contrast, supervised learning is like trying to create a recipe after tasting a dish. You know the ingredients and the final dish, so the goal is to determine the rules. However, once you’ve figured out the rules that comprise your recipe (your model), you can follow the new recipe whenever you want to create the dish again.

For an unsupervised learning task, we are given features but no labels. Instead, there is some relationship between the examples in the dataset that we want to learn. Some examples of unsupervised learning include dimensionality reduction methods such as principal component analysis (PCA) and clustering methods. To go with another food-based analogy, let’s say you’re tasked with organizing buffet tables, and you’re provided a list of ingredients for each dish that will be served. We would most likely group dishes with similar sets of ingredients together. For example, sweet ingredients likely indicate desserts. This is an example of an unsupervised learning task where we do not have examples of where each food should go, but we think that foods with similar ingredients should likely be in the same course. Here, we’re looking for patterns that connect the examples to one another rather than patterns that connect an example to a provided label.

What advancements can we expect to see in this field within the next year or so and in the more distant future?

I’m excited to see more end-to-end machine learning systems for processing mass spectrometry data. This is something we’re already starting to see in the proteomics field, and it holds tremendous potential. Previously, each step of the data processing pipeline may have incorporated some level of machine learning or AI to improve its effectiveness, such as using predicted spectral libraries to improve our ability to find peptides in mass spectra. However, these individual components have been disjointed. Instead, end-to-end models will be able to use the data more effectively, potentially leading to some exciting improvements in data quality. However, such systems need to be rigorously evaluated, as obscuring the processing of data behind an AI model often inhibits interpretability and must substantiate the claims of improvement that they make.

In the more distant future, we’ll start seeing more machine learning and AI approaches incorporated into mass spectrometer instrument control. I could see such methods being particularly useful in ‘top-down proteomics’ applications where we want to analyze whole proteins instead of small peptides—the inevitable future of proteomics. AI methods could potentially help the instrument automatically decide how to fragment each analyte to maximize the sequence information we can gain.

Cover of PFAS analysis magazineThis article is featured in our June 2024 publication, Innovation and Sustainability in Modern Analysis. Find out about the latest innovations and sustainable advances in mass spectrometry, chromatography, and related techniques.

Related Content