Guided by a multimodal generative language model called ESM3, Thomas Hayes and colleagues generated and synthesized a previously unknown bright fluorescent protein, with a genetic sequence so different from known fluorescent proteins that the researchers say its creation is equivalent to ESM3 simulating 500 million years of biological evolution. The model could provide a new way to "search" the space of protein possibilities with an eye to better understanding how naturally evolved proteins work, as well as developing novel proteins for uses in medicine, environmental remediation, and a host of other applications. ESM3 can reason over protein sequence, structure, and function, by representing each of these through alphabets of discrete tokens that can be combined in a generative language model. This strategy differs from previous uses of language models that were only scaled for protein sequences. The training data for ESM3 consists of 771 billion unique tokens created from 3.15 billion protein sequences, 236 million protein structures and 539 million proteins with function annotations. ESM3 can train up to 98 billion parameters. ESM3 is now available in public beta via an API, enabling scientists to engineer proteins programmatically or through interactive browser-based apps. Researchers can use the EvolutionaryScale Forge API through the free academic access tier or use the code and weights of the open model.
AI Model Simulates 500M Years to Design Fluorescent Protein
/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.