The amount of recorded genetic data already exceeds the processing capacity of the software used to analyse such data. Alexandru Tomescu, a researcher specialising in algorithms, is developing methods for increasingly accurate - and as much as a thousand times faster - processing of genetic data.
More detailed information on the functioning of human genes is helping in the development of more effective drugs and better therapies for cancers and other diseases. These advances are based on enormous volumes of data accumulated through gene sequencing, or reading the genome, and these volumes continue to grow. For example, researchers globally are currently constructing a pangenome , or a comprehensive description of the genomes of all humans.
What the researchers are looking for is information from genetic data. This is done with software that contains a range of algorithms, or detailed instructions for performing certain tasks. Algorithms can be used to identify recurring patterns or abnormalities that can indicate, for example, diseases.
Now, we have arrived at a juncture where the algorithms used so far no longer perform their tasks quickly enough. Researchers also wish to uncover increasingly complex elements in genomes, which means increasingly taxing tasks for algorithms and slower performance. In other words, algorithms must be designed to be faster and more efficient.
A thousand-fold increase in speed
A research group headed by Associate Professor of Algorithmic Bioinformatics Alexandru Tomescu at the University of Helsinki develops techniques that could speed up many algorithms used in bioinformatics. The group's goal is for algorithms to function rapidly also in the case of larger datasets and quickly carry out increasingly complex tasks.
For example, brain researchers may wish to investigate the expression of genes in different tissues or even in individual cells. Accurate modelling of gene expression is an extremely difficult problem for today's software, resulting primarily in rough estimates.
Tomescu's group wishes to change this.
"The idea is to identify features that correspond to the data content without having to go through it in its entirety several times. We are designing algorithms that can quickly extract the desired details from data. In a way, we are looking for shortcuts in the data structure," Tomescu says.
For example, when a genome is sequenced, or when the genes in the genome are read, the sequencing software presents the results as a string of up to 250 million characters. Sequencing the genomes of 10,000 people generates a thousandfold number of 2,500 billion characters.
While classic algorithms would take roughly two days to go through the data, those developed by Tomescu's group enable the analysis of such enormous masses of data in much smaller entities, corresponding in size to data on roughly 10 people only. The result would be ready in under three minutes, or a thousand times faster than currently.
There is already concrete evidence on the benefits of more high-performance algorithms in biomedicine. New algorithms make it possible to see how genes are expressed in cells, enabling, among other things, the tracing of brain cell function and boosting research on ageing. The software developed for this purpose by Tomescu's group has already been successfully exploited in the field.
Accelerating algorithms in practice
How can algorithms be accelerated in practice?
"Simply by tailoring them to take advantage of the special features found in the strings of characters in the data sequenced from genomes," Tomescu says.
"For instance, if your task were to unload a truckload of apples into storage, you might, without any instructions to guide you, start working one apple at a time. Then you'd realise that the apples are in boxes, which are faster to move. Again, you would notice that the boxes are on pallets, and you could proceed to using a forklift," Tomescu explains.
In this example, the boxes and pallets are easy to detect, making it easy to increase efficiency. In genomic data too, larger entities, such as clusters of 10 individuals, can be observed. However, in practice these assistive factors are difficult to identify.
"Even if we identify a special feature in the genetic data, utilising it to speed up the processing of data can be challenging. It would be like trying to lift an uncommonly shaped pallet with an ordinary forklift. New, rapid algorithms are like new kinds of boxes or forklifts that can solve a range of problems," Tomescu says.
Pursuing effective genomic search engines
Alexandru Tomescu recently received a prestigious Consolidator Grant from the European Research Council (ERC) for his algorithm research. The goal of his ERC-funded study is to gain a closer look at, for example, the mechanisms of brain cells and to create effective genomic search engines. The findings can accelerate breakthroughs in biomedical research and precision medicine.
Tomescu's group conducts basic research, laying the groundwork for future therapies.
"While our work is far from the hospital, we hope that faster algorithms will help bioinformatics researchers develop novel software to support diagnostics and patient care. Hopefully our efforts will provide increasingly accurate information on genes in the future," Tomescu says.
The time saved by algorithms can manifest, for example, in the fairly rapid completion of computing tasks in laboratories. According to Tomescu, rapid algorithms also consume less energy than their less agile peers. In addition, both money and effort will eventually be saved, as the need for sequencing can be reduced through precision computing.
Retained accuracy
Tomescu emphasises that the results produced by faster algorithms are demonstrably as accurate as those of classic, slower algorithms.
"Everything hinges on mathematical correctness. We can mathematically prove that our methods provide the same results as others, but faster," Tomescu says.
In the case of old and new algorithms alike, often the most difficult problem is to pinpoint the exact problem in need of a solution.
"We still have a lot of work to do. Having said that, we believe that our techniques can be adapted to a range of problems considered important by biomedical researchers in the future."