Using Large Language Models to Map the Body's Cell Signaling Network

From the Computer to the Clinic: 3-19-24

Mar 19, 2024

Introduction

Welcome back to ‘From the Computer to the Clinic’ - a newsletter about computational biology and its contributions to biomedical research.

In this newsletter, we explore how computational biology research can drive clinical progress. By sharing success stories in one disease area or domain of research, we aim to inspire the use of these successful approaches for other diseases and research areas also.

If you haven’t already, you can subscribe to this newsletter, or share it with friends and colleagues

Featured Research

Each cell in the human body participates in a complex signaling network. Chemical compounds touch down on the surface receptors of cells and initiate downstream signaling cascades that change a cell’s behavior, often triggering the production of cellular metabolites that activate other cells, propagating the signal further and coordinating body-wide cellular activity. Researchers in laboratories across the world are working, and have been working for decades, to pin down the players involved in this network: the metabolite signaling compounds and the proteins they influence.

Featured Study: Self- and cross-attention accurately predicts metabolite–protein interactions (Campana & Nikoloski, NAR Genomics and Bioinformatics, 2023)

This edition’s featured study is an exciting new contribution to this long-running effort. The authors have developed a computational approach to predict new protein-metabolite interactions, drawing on the large body of known interactions catalogued by researchers in the past. They lay out three tiers of questions that need answering – do a protein and metabolite interact, what is the strength of this interaction, and what part of the protein participates in the interaction.

We have likely studied just a small fraction of the possible protein-metabolite interactions that take place in the human body, let alone answered the second and third questions (interaction strength and interaction site) for the interactions we do know about. There are also additional questions to consider, like in what specific cell types do interactions take place (do they happen in the heart, the brain, the lungs?)

A quote from another recent paper (Kurbatov et al., 2023) offers a sense of scale: “Knowledge accumulated today is focused mainly on exogenous ligands with promising pharmacological properties. Beyond this area, the complexity of the endogenous protein–metabolite interactome is striking: even in a typical bacterial cell, by the most conservative estimate, functionally significant events may potentially occur between more than one million proteins and 100 million metabolites.”

Clearly, this is a wide open frontier in biological knowledge. To help explore it, the authors’ of the featured study have constructed a large language model to ‘learn the rules’ of protein-metabolite interactions. The model is trained on two large repositories of known metabolite-protein interactions (BioSnap, STITCH). These interactions come from seven different species (including humans, mice, and yeast). Once trained, you can give the model a protein and a metabolite as input and receive as output a probability that the given protein and metabolite will interact. It is possible, for example, to feed the model a metabolite molecule and a long list of proteins to see which proteins the metabolite is most likely to interact with.

The authors tested the model and report good performance on labeled test data. The model achieved high sensitivity (the model is good at correctly predicting protein-metabolite interactions when a protein and metabolite are known to interact) and specificity (the model is good at not predicting an interaction between metabolites and proteins that do not interact) - the former 79% and the latter 87%. The model outperformed other existing models that it was compared against.

What kind of biological questions could you answer with this kind of predictive model? One could imagine using it to predict all the cell signaling pathways influenced by a particular metabolite. If you did this, you could identify metabolites that are particularly adept at stimulating cellular activity, and therefore highly relevant for future study.

You could also explore whether certain metabolites can stimulate cellular pathways known to be associated with disease. Such metabolites may play an unheralded role in various illnesses, if they are underproduced or overproduced in people with disease relative to the healthy population. They may also be useful as a form of therapeutic intervention. Perhaps you could fine-tune metabolite levels in the body (for example, by administering probiotic species to act as metabolite-producing factories) to shift the activity of multiple cellular pathways in a desired direction simultaneously. Of course, the fact that a metabolite may act on many different pathways at once is as potentially dangerous as it is promising. If you are not careful in modifying metabolite levels, un-targeted pathways could get caught in the crossfire, producing undesirable side effects.

Another challenge to consider is that even if you can predict accurately that a metabolite and protein interact, you don’t necessarily know that this interaction will stimulate or repress a particular pathway. Some metabolites are known to stimulate cell signaling pathways - others to repress them. And as the authors mention, understanding the strength of a given interaction is also important. It may be that one strong interaction with a key protein in a cell signaling pathway could have a dramatic effect on cell function, whereas multiple weak interactions may have little influence on the behavior of a cell. These questions are fertile ground for future work.

Additional Studies

Screening cosmetic chemicals to ID dermatitis-causing compounds. Kwon et al. developed a machine learning model, trained on a set of chemicals (input: their sequences and physiochemical properties) that are known to cause dermatitis. With the trained model, a user can provide a novel compound and predict whether or not it will have skin-damaging effects (and furthermore, if it is predicted to cause skin damage, whether it will be a strong or weak ‘skin sensitizer’). The model uses the transformer architecture, which has become very popular in recent years as the ML framework that powers ChatGPT.

Will a genetic mutation contribute to disease? That is the question that Zhang et al. sought to answer in their recent study. To predict whether the change in a specific DNA nucleotide will be ‘neutral’ or disease-causing, the authors built a model that takes into account several details about the protein in question (including biochemical properties, solvent accessibility, and protein structure). The authors took inspiration from Google DeepMind’s popular AlphaMissense Model, which uses protein structural information from the AlphaFold Database to predict the effects of a mutation. This model also makes use of the transformer architecture, and is trained on data from several large datasets (PredictSNP, MMP, PMD).

Exploring the ‘full potential of proteins’. Ingraham et al. developed a generative model to produce proteins with desired functional properties. Generative protein modeling has become popular in recent years - as creating novel proteins not produced naturally by biological cells could be useful for therapeutic or biomanufacturing purposes. The authors developed a diffusion model (the sort of ML approach used by text-guided image generation software like DALL-E 2) to generate novel protein sequences, and tested their properties after expressing them in E. coli cells. Users can input desired functional information to guide the model to generate proteins with properties of interest.