Federal funds totaling $49.4 million will underwrite core operations supporting the digital archive of protein structures
Federal science agencies have renewed and increased funding for a world-renowned digital archive of protein structures housed by Rutgers University-New Brunswick, an open-access data resource that has enabled research in everything from agriculture to zoology and has laid the groundwork for Nobel Prize-winning discoveries.
Federal funds totaling $49.4 million will underwrite efforts at the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank through 2028, Rutgers officials said. Specifically, the data bank will be supported by grants from the U.S. National Science Foundation, the U.S. Department of Energy, the National Cancer Institute, the National Institute of Allergy and Infectious Diseases, and the National Institute of General Medical Sciences of the National Institutes of Health.
The RCSB Protein Data Bank is a multi-institutional collaboration between Rutgers, the University of California San Diego and the University of California San Francisco. Previous federal funding for RCSB Protein Data Bank core operations for 2019-2023 totaled $34.3 million.
"We are honored that so many scientific and medical achievements stemming from access to the data bank have been recognized for their profound impact on science and society," said University Professor and Henry Rutgers Chair Stephen K. Burley, Director of the RCSB Protein Data Bank. "We are excited at the prospect of enabling many more insights and innovations coming from global biological and biomedical research communities."
The digital archive serves as the U.S. data center of the Worldwide Protein Data Bank partnership, which was established in 2003 to support joint management of the Protein Data Bank archive as a global public good.
At the center of it all is a quest to understand the myriad shapes and purposes of nature's proteins, the building blocks of life, and the most structurally complex and functionally sophisticated molecules known. Proteins are essential components of every cell in the body and play crucial roles in growth, repair and bodily functions. The shape or structure of each protein, essentially a string of atoms in an amino acid-chain molecule folded up in a complex manner, determines its biological function.
Understanding the structures of proteins is central to molecular and cellular biology because it allows scientists to predict how the proteins will interact with other molecules, design drugs targeting specific proteins and gain insights into various biological processes and diseases.
To do all that, the structure itself needs to be discerned first at the atomic level. Proteins may be among the largest of the biomolecules, but they are still too small for the human eye to see, even with a conventional optical microscope. Before researchers can submit a protein structure to the database, they must employ painstaking methods coming from physics and chemistry to discern what the protein looks like. Some common methods used to study protein structures include X-ray crystallography, nuclear magnetic resonance imaging, 3D electron microscopy and cryo-electron tomography. Once deposited to the Protein Data Bank, the structure is rigorously validated and biocurated by experts.
With an initial tranche of just seven protein structures and the dream of advancing science through the free sharing of vital scientific information, Helen M. Berman, then a protein crystallographer at the Institute for Cancer Research, Fox Chase Cancer Center in Philadelphia, played a key role in co-founding the Protein Data Bank with colleagues in 1971. It was the first open-access, digital data resource in biology, designed to serve as the global archive for atomic-level, 3D structures of proteins and other large biomolecules.
Berman, now a Board of Governors Distinguished Professor Emerita of Chemistry and Chemical Biology at Rutgers-New Brunswick, brought the archive to Rutgers in 1998 and headed the effort until 2014. The Protein Data Bank, continuously funded by the U.S. government for more than five decades, has grown to include more than 230,000 3D structures of proteins, nucleic acids (both DNA and RNA), viruses (the Zika virus, for instance), and macromolecular machines, such as the RNA polymerase II responsible for synthesizing messenger RNAs.
"When we first began to think about having a Protein Data Bank in the 1960s, there were only a handful of proteins whose structures had been determined, so, it was hard to imagine that more than 50 years later there would be more than 230,000," Berman said. "My hope for the future is that having seen the enormous benefits of curating and archiving data, more scientific communities will follow the example of the Protein Data Bank."
The database has more than fulfilled Berman's vision, she said, proving to be an engine fueling scientific discovery. More than 60,000 scientists worldwide have freely contributed their data to the archive, and more than a million research papers have been published with discoveries and insights based on structures stored therein. About 10 million data files are downloaded from the Protein Data Bank daily by many millions of users working and learning in nearly every country and territory recognized by the United Nations.
As a mission-critical repository of protein structures, the Protein Data Bank enabled the 2024 Chemistry Nobel Prize winners David Baker, Demis Hassabis and John Jumper to crack the code for protein structures. Baker used the archive as a knowledge base for his protein structure design algorithms for which his half-share of the Nobel Prize was awarded. Information stored in the archive also provided the training set for AlphaFold2, a deep learning, artificial intelligence-powered software tool developed by Hassabis, Jumper and the Google DeepMind team, for which they earned their half-share of the prize.
"The Protein Data Bank helped David Baker to imagine new proteins and Demis Hassabis and John Jumper to predict the structures of proteins from their amino acid sequences alone," Burley said. "The Nobelists went on record after receiving news of the prize to the effect that none of this would have happened without the open-access Protein Data Bank."
My hope for the future is that having seen the enormous benefits of curating and archiving data, more scientific communities will follow the example of the Protein Data Bank.
Helen Berman
Board of Governors Distinguished Professor Emerita of Chemistry and Chemical Biology
Beginning in early 2020, the Protein Data Bank emerged as a vital tool in the global effort by scientists to decipher the structure and function of the virus that caused the coronavirus disease 19 (COVID-19) global pandemic. Within months of COVID-19 cases first appearing in late 2019, scientists based in Shanghai deposited the first 3D structure of a crucial viral protein into the database. Today, more than 4,600 protein structures of SARS-CoV-2, the virus that causes COVID-19, reside in the archive, where they are made freely available to researchers, educators, and clinicians throughout the world.
In another advance, open access to the structures archive in the Protein Data Bank facilitated the design of the highly successful messenger RNA vaccines treating COVID-19 and the structure-guided discovery of nirmatrelvir, the active ingredient of Paxlovid.
Many millions of scientists continue to either download data or use specialized tools from RCSB.org for research in various fields.
For Roland Dunbrack, a computational structural biologist based at the Institute for Cancer Research at Fox Chase Cancer Center in Philadelphia, access to the database has been essential to his mission. In the hope of discovering foundational knowledge that will lead to the discovery and development of anticancer drugs, Dunbrack is seeking to understand the structure and function of human protein kinases. These enzymes are a key component in how the growth of human cells is controlled and how they interact with each other. Kinases have "active" structural forms that catalyze reactions and "inactive" versions that block reactions.
"We have used the Protein Data Bank to establish criteria for the active forms of about 150 human kinase genes," said Dunbrack, adding that humans possess 437 such genes. "We were able to use the 150 experimental structures to predict the structures of all 437 such genes in their active form using the program AlphaFold2."
Dunbrack and his team are focusing on the inactive forms of human kinases. They are hoping to better understand the mechanism by which cancer-causing mutations in kinase genes destabilize inactive kinases. In doing so, the mutations prod these kinases into being active, permanently turning on growth signals in cells that will lead to cancer.
"These structural models will provide data for the development of kinase inhibitors that might act as drug treatments in cancer and other diseases," Dunbrack said.
More broadly, Dunbrack and his group also will continue to use the entire PDB to produce structural models of many human protein complexes under study in cancer research worldwide.