To conduct a groundbreaking study of genetic data from more than half a million U.S. veterans, scientists needed tools of the kind found only at the Department of Energy's Oak Ridge National Laboratory.
"This particular study is probably the crown jewel of the field so far," said Ravi Madduri, a senior computational scientist at Argonne National Laboratory and senior author of the study. "It's not only one of the largest genome-wide association studies ever done, but it's analyzing some of the most diverse data ever assembled. Its insights bring us a step closer to the long-sought goal of precision medicine, which would use personalized cures tailored to an individual's genetic makeup."
The study published in Science in July examined the genetic architecture of 2,068 traits based on statistics collected by the Department of Veterans Affairs ' Million Veteran Program , or MVP, from 635,969 veterans of all ages, races and backgrounds.
To analyze those numbers, the research team turned to ORNL, home of the Oak
Ridge Leadership Computing Facility's Summit supercomputer (one of the world's top computing systems at the time, now decommissioned).
"We are honored to collaborate with the VA to tackle some of the most pressing challenges in healthcare - work that not only advances the well-being of our veterans but also drives groundbreaking discoveries with far-reaching impact on human health," said ORNL's Anuj Kapadia, who oversees advanced computing for health sciences and helped coordinate ORNL's role in the study.
The research team's approach relied on Summit's speeds of 200 petaflops, or 200 quadrillion calculations per second.
"One of the primary roadblocks to do these moonshot analyses has traditionally been access to the necessary level of computing," Madduri said. "This is what's known as a population-level genome-wide association study, or GWAS. We've traditionally had a poor understanding of risks for minority populations in the U.S. due to a lack of data. The VA had that data. But the magnitude of the data was much larger than what the scientific community's used to for this kind of analysis.
"That's where Summit came in. The diversity and scale make this study stand out, and we couldn't have done this study with any other set of computers."
The VA data had initially been prepared for storage and analysis on CPU-based computers rather than for the GPUs that powered Summit. The research team converted that data to run on GPUs, a task that took years on its own.
"This kind of analysis depends on multiplying out associations for each data point, so the calculations become exponentially large in a hurry," Madduri said. "Using CPUs would have required approximating those equations and sacrificing accuracy. Summit's GPUs were able to divide those giant equations across nodes without sacrificing accuracy or detail."
The study calculated associations between all genotypes, or combinations of genes, and phenotypes, or detectible characteristics, from the participating veterans. The raw data amounted to more than 30 terabytes uncompressed - the equivalent of about 200 million pages of text.
The analysis ran for more than 500,000 node-hours and led to more than 350 billion examinations of associations between nearly 44 million genetic variants and more than 2,000 traits. Researchers pinpointed a total of 26,049 associations between genetic variations and traits across a total of 1,270 health traits.
Results will be made available for future research through the National Institute of Health's National Library of Medicine .
"We now have a world-class database that's the first of its kind," Madduri said. "There are other banks of genetic data such as the UK Biobank, but none with this kind of diversity and scale. We hope others will build upon the associations we've identified in this particular study."
Researchers couldn't simply dive into such a huge amount of data unassisted. An ORNL team built a computing pipeline to enable large-scale visualizations of the data - including charts, graphs and plots - along with CIPHER , an online knowledge-sharing platform that makes the visualizations available to researchers worldwide.
"Our goal was to make it available in such a way that anyone can explore the data at a high level in an easily accessible way," said David Heise, an ORNL software engineer who helped lead the pipeline team.
The visualizations use only summary statistics and contain no individual health information.
"We're pleased this customized research tool is hosted at ORNL to assist the VA in supporting veterans and will continue to assist further studies for broader research," said Laura Davies, a project manager on the ORNL team.
Support for this research came from the VA and from the DOE Office of Science's Advanced Scientific Computing Research program. The OLCF is an Office of Science user facility at ORNL.
UT-Battelle manages ORNL for DOE's Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE's Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science . - Matt Lakin