A new study presents "Evo" – a machine learning model capable of decoding and designing DNA, RNA, and protein sequences, from molecular to genome scale, with unparalleled accuracy. Evo's ability to predict, generate, and engineer entire genomic sequences could change the way synthetic biology is done. "The ability to predict the effects of mutations across all layers of regulation in the cell and to design DNA sequences to manipulate cell function would have tremendous diagnostic and therapeutic implications for disease," writes Christina Theodoris in a related Perspective. With a vocabulary of just four nucleotides, DNA encodes all the genetic information essential for life. Variations in the genomic sequence reflect adaptations selected for specific biological functions. These variations drive evolution by enabling organisms to adapt to new or changing environments. Advances in DNA sequencing technologies have allowed for genomic variations to be mapped at the whole-genome scale. These data, combined with novel machine learning algorithms, could enable the creation of a comprehensive model that can understand DNA, RNA, and protein functions and their interactions. But, while some researchers inspired by the success of large language models (LLMs) have attempted to model DNA as a "language" by applying similar techniques, current generative models tend to focus narrowly on individual molecules or DNA segments. Alongside computational limitations, this has constrained the scope of these models in capturing broader genomic interactions necessary for understanding complex biological processes.
Here, Eric Nguyen and colleagues present Evo – a large-scale genomic foundation model, equipped with 7 billion parameters and designed to generate DNA sequences up to whole-genome scale. Built on the StripedHyena architecture, Evo was trained on a dataset of 2.7 million evolutionary diverse microbial genomes. According to Nguyen et al., Evo excels in both predictive and generative biological tasks, achieving high accuracy in zero-shot evaluations for predicting mutation impacts on bacterial proteins and RNA, as well as in modeling gene regulation. Evo also grasps the intricate coevolution between coding and noncoding sequences, supporting the design of complex biological systems like CRISPR-Cas complexes and transposable elements. At the genomic scale, Evo can generate sequences over 1 megabase in length, a capability vastly surpassing prior models. "Future models may learn from diverse human and other eukaryotic genomes, using larger context lengths to capture distant genomic interactions over larger genomic scales," writes Theodoris in the Perspective.