AI Models Revolutionize Protein Science, Healthcare

Technical University of Denmark

Researchers have developed new AI models that can vastly improve accuracy and discovery within protein science. Potentially, the models will assist the medical sciences in overcoming present challenges within, e.g. personalised medicine, drug discovery, and diagnostics.

In the wake of broadly available AI tools, most technical and natural sciences fields are advancing rapidly. This is particularly true in biotechnology, where AI models power breakthroughs in drug discovery, precision medicine, gene editing, food security, and many other research areas.

One sub-field is proteomics – the study of proteins on a large scale – where vast amounts of protein data are gathered in databases against which a sample can be compared. These databases enable scientists to discern which proteins – and, thereby, microorganisms - are present in a sample. They allow a doctor to diagnose diseases, monitor the effectiveness of a treatment, or identify pathogens present in a patient's sample.

Although these tools are very useful and effective, there are limits to what they can do, says Timothy Patrick Jenkins, an Associate Professor at DTU Bioengineering and corresponding author:

"First off, no database includes everything, so you need to know which databases are relevant to your particular needs. Then deep searches are very time-consuming and demand a lot of computer power. And, finally, it's nearly impossible to identify proteins that haven't been registered yet."

For this reason, some groups have worked on so-called 'de novo sequencing algorithms' that improve accuracy and lower computational costs with increasing database size. Still, according to Jenkins and colleagues from DTU, Delft University in the Netherlands and the British AI company InstaDeep, their performance remained "underwhelming."

Exceeding state-of-the-art

In a new paper in Nature Machine Intelligence , they propose two novel AI models to assist researchers, medical practitioners, and commercial entities in finding exactly the necessary information in the vast amounts of data. These are called InstaNovo and InstaNovo+ and are available to researchers through the InstaDeep website (see fact box).

"Seen together, our models exceed state-of-the-art and are significantly more precise than currently available tools. Furthermore, as we show in the paper, our models are not specific to a particular research area. Instead, these tools could propel significant advances in all fields involving proteomics," says Kevin Michael Eloff, a research engineer at InstaDeep and co-first author of the paper.

To assess the usefulness of their models, the researchers have trained and tested them on several specific tasks within major areas of interest.

One investigation was performed on wound fluid from venous leg ulcer patients. Since venous leg ulcers are notoriously difficult to treat and often become chronic, knowing which microorganisms like bacteria are present is crucial to treatment. The models could map ten times as many sequences as a database search, among them E. coli and Pseudomonas aeruginosa – the latter being a multidrug-resistant bacterium.

Another use case was conducted on small pieces of protein, called peptides, displayed on the surface of cells. These help the immune system recognize infections and diseases such as cancer. The InstaNovo models identified thousands of new peptides that were not found using traditional methods. In personalised cancer treatments empowering the immune system – immunotherapy for short - these peptides are all potential attack points.

"In combination, our tests of the model on complex cases, where, for example, unknown proteins are present, or where we have no prior knowledge of the organisms involved, show that they are suitable to improve our understanding significantly. That this bodes well for biomedicine is a given, since it can directly improve identification of our microbiome, as well as improve our efforts within personalised medicine and cancer immunology," says Konstantinos Kalogeropoulos, co-first author and Assistant Professor at DTU Bioengineering.

The paper provides six additional cases that demonstrate how these models improve therapeutic sequencing, discover novel peptides, detect unreported organisms, and significantly enhance proteomics searches. The implications of their results extend far beyond the medical sciences, says Timothy Patrick Jenkins:

"Looking at it from a purely technical, scientific perspective, it is also true that with these tools, we can improve our understanding of the biological world as a whole, not only in terms of healthcare but also in industry and academia. Within every field using proteomics - be it plant science, veterinary science, industrial biotech, environmental monitoring, or archaeology - we can gain insights into protein landscapes that have been inaccessible until now."

FACTS

What Are InstaNovo and InstaNovo+?

InstaNovo is a transformer-based model designed for de novo peptide sequencing. Developed in collaboration between InstaDeep and the Department of Biotechnology and Biomedicine at the Technical University of Denmark (DTU) , it translates fragment ion peaks from mass spectrometry data into peptide sequences with unprecedented precision.

Unlike traditional methods that rely on pre-existing databases, InstaNovo identifies peptides that have never been documented before—expanding the landscape of proteomic discovery.

A key innovation of the InstaNovo models is InstaNovo+, a diffusion-based iterative refinement model that enhances sequence accuracy by mimicking how researchers manually refine peptide predictions. InstaNovo+ begins with an initial sequence—either derived from InstaNovo or generated at random—and improves it, step by step.

When paired with InstaNovo, InstaNovo+ significantly reduces false discovery rates (FDR) and improves sequence accuracy, not just by refining predictions, but by exploring a broader range of potential peptide sequences.

Unlike autoregressive models such as InstaNovo and others , which predict peptide sequences one amino acid at a time , InstaNovo+ processes entire sequences holistically, enabling greater accuracy and higher detection rates.

Together, InstaNovo and InstaNovo+ enhance de novo peptide sequencing, striking a balance between precision and exploration to accelerate biological discovery.

Source: InstaDeep.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.