Your immune system harbors a lifetime's worth of information about threats it's encountered — a biological Rolodex of baddies. Often the perpetrators are viruses and bacteria you've conquered; others are undercover agents like vaccines given to trigger protective immune responses or even red herrings in the form of healthy tissue caught in immunological crossfire.
Now researchers at Stanford Medicine have devised a way to mine this rich internal database to diagnose diseases as diverse as diabetes COVID-19 responses to influenza vaccines. Although they envision the approach as a way to screen for multiple diseases simultaneously, the machine-learning based technique can also be optimized to detect complex, difficult-to-diagnose autoimmune diseases such as lupus.
In a study of nearly 600 people — some healthy, others with infections including COVID-19 or autoimmune diseases including lupus and Type 1 diabetes — the algorithm the researchers developed, called Mal-ID for machine learning for immunological diagnosis, was remarkably successful in identifying who had what based only on their B and T cell receptor sequence and structures.
"The diagnostic toolkits that we use today don't make much use of the immune system's internal record of the diseases it has encountered," said postdoctoral scholar Maxim Zaslavsky , PhD. "But our immune system is constantly surveilling our bodies with B and T cells, which act like molecular threat sensors. Combining information from the two main arms of the immune system gives us a more complete picture of the immune system's response to disease and the pathways to autoimmunity and vaccine response."
Zaslavsky and Erin Craig are the lead authors of the study published Feb. 20 in Science. Professor of pathology Scott Boyd , MD, PhD, and associate professor of genetics and computer science Anshul Kundaje, PhD, are the senior authors of the research.
In addition to aiding diagnosis of tricky diseases, Mal-ID could track responses to cancer immunotherapies and subcategorize disease states in ways that could help guide clinical decision making, the researchers believe.
"Several of the conditions we were looking at could be significantly different at a biological or molecular level, but we describe them with broad terms that don't necessarily account for the immune system's specialized response," said Boyd, who co-directs the Sean N. Parker Center for Allergy and Asthma Research . "Mal-ID could help us identify subcategories of particular conditions that could give us clues to what sort of treatment would be most helpful for someone's disease state."
Deciphering the language of proteins
In a follow-the-dots approach, the scientists used machine learning techniques based on large language models those that underlie ChatGPT to home in on the threat-recognizing receptors on immune cells called T cells and the business ends of antibodies (also called receptors) made by another type of immune cell called B cells. These language models look for patterns in large datasets like texts from books and websites. With enough training, they can use these patterns to predict the next word in a sentence, among other tasks.
In the case of this study, the scientists applied a large language model trained on proteins, fed the model millions of sequences from B and T cell receptors, and used it to lump together receptors that share key characteristics — as determined by the model — that might suggest similar binding preferences. Doing so might give a glimpse into what triggers caused a person's immune system to mobilize — churning out an army of T cells, B cells and other immune cells equipped to attack real and perceived threats.
"The sequences of these immune receptors are highly variable," Zaslavsky said. "This variability helps the immune system detect virtually anything, but also makes it harder for us to interpret what these immune cells are targeting. In this study, we asked whether we could decode the immune system's record of these disease encounters by interpreting this highly variable information with some new machine learning techniques. This idea isn't new, but we've been missing a robust way to capture the patterns in these immune receptor sequences that indicate what the immune system is responding to."
B cells and T cells represent two separate arms of the immune system, but the way they make the proteins that recognize infectious agents or cells that need to be eliminated is similar. In short, specific segments of DNA in the cells' genomes are randomly mixed and matched — sometimes with an additional dash of extra mutations to spice things up — to create coding regions that, when the protein structures are assembled, can generate trillions of unique antibodies (in the case of B cells) or cell surface receptors (in the case of T cells).
The randomness of this process means that these antibodies or T cell receptors aren't tailored to recognize any specific molecules on the surface of invaders. But their dizzying diversity ensures that at least a few will bind to almost any foreign structure. (Auto-immunity, or an attack by the immune system on the body's own tissues, is typically — but not always — avoided by a conditioning process T and B cells go through early in development that eliminates problem cells.)
The act of binding stimulates the cell to make many more of itself to mount a full-scale attack; the subsequent increased prevalence of cells with receptors that match similar three-dimensional structures provides a biological fingerprint of what diseases or conditions the immune system has been targeting.
To test their theory, the researchers assembled a dataset of over 16 million B cell receptor sequences and over 25 million T cell receptor sequences from 593 people with one of six different immune states: healthy controls, people infected with SARS-CoV-2 (the virus that causes COVID-19) or with HIV, people who had recently received an influenza vaccine, and people with lupus or Type 1 diabetes (both autoimmune diseases). Zaslavsky and his colleagues then used their machine-learning approach to look for commonalities between people with the same condition.
"We compared the frequencies of segment usage, the amino acid sequences of the resulting proteins and the way the model represented the 'language' of the receptors, among other characteristics," Boyd said.
T and B cells together
The researchers found that T cell receptor sequences provided the most relevant information about lupus and Type 1 diabetes while B cell receptor sequences were most informative in identifying HIV or SARS-CoV-2 infection or recent influenza vaccination. In every case, however, combining the T and B cell results increased the algorithm's ability to accurately categorize people by their disease state regardless of sex, age or race.
"Traditional approaches sometimes struggle to find groups of receptors that look different but recognize the same targets," Zaslavsky said. "But this is where large language models excel. They can learn the grammar and context-specific clues of the immune system just like they have mastered English grammar and context. In this way, Mal-ID can generate an internal understanding of these sequences that give us insights we haven't had before."
Although the researchers developed Mal-ID on just six immunological states, they envision the algorithm could quickly be adapted to identify immunological signatures specific to many other diseases and conditions. They are particularly interested in autoimmune diseases like lupus, which can be difficult to diagnose and treat effectively.
"Patients can struggle for years before they get a diagnosis, and even then, the names we give these diseases are like umbrella terms that overlook the biological diversity behind complex diseases," Zaslavsky said. "If we can use Mal-ID to unravel the heterogeneity behind lupus, or rheumatoid arthritis, that would be very clinically impactful."
Mal-ID may also help researchers identify new therapeutic targets for many conditions.
"The beauty of this approach is that it works even if we don't at first fully know what molecules or structures the immune system is targeting," Boyd said. "We can still get the information simply by seeing similar patterns in the way people respond. And, by delving deeper into these responses we may uncover new directions for research and therapies."
Researchers from the Swiss Tropical and Public Health Institute, the University of Basel, the Oklahoma Medical Research Foundation, the University of Pennsylvania, the University of Cincinnati, the Cincinnati Children's Hospital Medical Center, the Icahn School of Medicine at Mount Sinai, Duke University, the Swedish Medical Center, the University of Washington, the Institute for Systems Biology, the Harvard T.H. Chan School of Public Health, Beth Israel Deaconess Medical Center, New York University, and the Lupus Foundation of America contributed to the work.
The study was funded by the National Institutes of Health (grants R01AI130398, R01AI127877, U19AI057229, U54CA260518, U19AI167903, 5R01 EB001988-16, UM-1 AI100645, UM1 AI144371, AI 101093, AI-086037, AI-48693, R01AI153133, R01AI137272, 3U19AI057229–17W1 COVID SUPP2, AR07375, UM1AI144292, NIDDK P30DK116074, U54CA260518, U19AI167903, R01 AI175771-01, R01 CA264090-01, U19 AI057229 and 1U54CA26051), the National Science Foundation, the Burroughs Wellcome Fund, the Sunshine Foundation, the Henry Gustav Floren Trust, a philanthropic gift from Eva Grove and a philanthropic gift from an anonymous donor.