Argonne team delivers a 100x speedup of genetic data analysis from the Million Veteran Program
It’s already the MVP of genetic diversity among biobanks
DOE/Argonne National Laboratory
Argonne computer scientists enabled a new research modality in the search for causality among genetic variants for a given disease.
For the past three years, Ravi Madduri, a senior computer scientist at the U.S. Department of Energy’s (DOE) Argonne National Laboratory, and his team have been working on ways to speed up analysis of genetic data to accelerate the search for associations among genetic variants — and thus causality — in determining the risks for different diseases and identifying at-risk populations.
The data used in this pursuit is an ambitious project started by the U.S. Department of Veterans Affairs (VA) called the Million Veteran Program (MVP). A defining aspect of this project and the data collected is the diversity of those veterans contributing DNA samples: nearly one-third (29%) are of non-European ancestry.
The MVP is the VA’s largest effort to research improving health care for veterans. It is also among the largest programs studying genes and health in the world. And, since its 2011 launch, it has reached its namesake milestone, as one million veterans had contributed their DNA as of late 2023.
“Creating this capability will be fundamental to all research going forward because of the increases in the cohorts of populations that are being genotyped.” — Ravi Madduri, Argonne senior computer scientist
MVP’s data: vast in extent and representation
“Why are we so excited? Why is this important? Several factors have come together in the MVP to take this work to the next level in applicability and thus in its explanatory power and significance,” says Madduri.
“First, even before the sequencing of the human genome, we’ve known that it’s had a role to play in disease.”
But there’s a big “however.” Madduri continues, “The fruits of sequencing haven’t entirely lived up to the hype: or rather, the hype was only realized for monogenic disorders, or diseases that are caused by mutations impacting a single gene, such as in the case of cystic fibrosis, sickle cell anemia, Tay-Sachs and Huntington’s disease.”
A quick refresher: there are 23 chromosomes in the human genome, each with 20,000 genes; and each gene consists of patterns of the four bases: adenine (A), cytosine (C), guanine (G), and thymine (T), where A pairs with T and C with G. The human genome is the entire gene structure of these base pairs laid out. In the context of this study, genetic variants are when a single base is replaced by another.
As more studies were performed, researchers came to realize that genetic associations for most diseases are dispersed across numerous genetic variants occurring across these mutations in the base pairs. Thus, with cardiovascular diseases (such as strokes) or types of cancer, different genetic variants increase risk, and with the variants spread across different genes, the hunt is on to those that are related.
The strength of an association helps determine whether it is causal in disease formation. Specifically, researchers test for a measure called a p-value threshold for variants, which, if they overlap with another variant a certain number of times — cross the threshold value — the association of the genetic variants is deemed to be causal for the disease in question. These associations are subsequently reported in the literature.
“Another reason why this work is so exciting is that in clinical research studies, having diversity in participants helps ensure that research findings will apply more widely — to people of all ancestries. Among all the biobanks in the world, the MVP is the most diverse: studies based on using it have wider applicability as a result,” Madduri says.
“And, as we’ve discussed in previous work,” Madduri continues, “disease factors for previously underrepresented groups have been surfaced and identified through the MVP.”
Working with the MVP’s dataset, the case sample size in the 2023 study referenced above was increased over previous studies by the following percentages: 43% for people of European ancestry, 26% for those with Asian ancestry, 45% for Hispanic groups — and by a whopping 87% for those with African ancestry. The study additionally reported that the genetic risk score for men of African ancestry showed a greater risk of aggressive versus non-aggressive disease for prostate cancer among 451 risk variants.
As U.S. veterans continue to contribute their DNA to the MVP, the information it contains will only increase in representation and applicability.
With so much data… challenges: laying down tracks for future genetics research
Stepping into the gap to perform the cross comparisons and identify associations are genome-wide association studies. These studies involve selecting a variant and comparing it with every other genetic variant in the biobank, tasks that are data intensive and benefit from using high performance computing.
While such a comparison may seem straightforward, the magnitude of the data and the number of comparisons required led researchers at the VA-MVP to turn to DOE researchers to innovate on the computing approaches deployed. “Another leap in advancing techniques was needed,” says Madduri.
For a while, the team had been trying to make the computations work on Oak Ridge National Laboratory’s Summit supercomputer, on which the team had been allocated time. But the team was running into challenges while using Summit’s central processing units (CPUs) to complete the analysis. Because the size of the data from each ancestral cohort that was being analyzed was larger than the available processing capacity on the CPUs, the team quickly found that the analysis failed to complete, even on one of the world’s largest supercomputers.
At that point, says Madduri, “we decided to open the box and look at the code.”
The team’s realization was that the fundamental computational block in the analysis is a large matrix-matrix multiplication. The solution? Use graphics processing units (GPUs) to perform matrix manipulation instead of relying on CPUs.
At the same time, advances in using artificial intelligence (AI) gave rise to using GPUs, which perform exceptionally well for matrix multiplication. From there, the team revised the code to use GPUs and realized the 100x speedup in runtime, which made calculating 300 billion associations possible.
Results point to exciting research possibilities
“Creating this capability will be fundamental to all research going forward because of the increases in the cohorts of populations that are being genotyped,” says Madduri, adding, “This work will also help level the playing field for conditions that are often not studied because they are rare.”
The team used about half of Summit, or 2,000 nodes with six GPUs per node, for 14 days in late December 2023. Specifically, they distributed the analysis across the six GPUs per node to realize the unprecedented speed and accuracy.
Without these changes, the team would never have completed the analysis, as running a job of this size would have taken several additional months of computational time and required limiting the sample size to 50% of the MVP’s data.
Work published in Science in July 2024 made use of the team’s speedup capabilities on Summit. “Looking forward, we want to do this on a larger cohort (a million individuals) and then expand to a collection of entire human genomes,” adds Madduri. Given that an entire human genome has 2 billion biomarkers whereas the present study compared 45 million biomarkers, such an effort will be more than an order of magnitude larger in size.
“The crown jewel of the collaboration with the genetics researchers (the study authors) is that we really accomplished what we set out to do in spite of early setbacks,” says Madduri.
A research bonus: dividends in motivation and inspiration
Madduri also observed a bonus on the research team: incredible commitment to enabling the computations and analysis of the MVP data. Noticing the determined work and long hours of co-researcher Alex Rodriguez, an Argonne computational scientist, Madduri sought to understand his motivation. Rodriguez, who is originally from Peru, responded, “I want to do this for my people.”
Widening the scope to encompass all veterans, Madduri returns to a point he never tires of expressing. “This database exists based on people who have made sacrifices and give everything for our country,” he says. “We owe them the best of what we have to offer — in the form of better care through targeted therapeutics, for example — that insights from these kinds of studies can provide.”
Argonne National Laboratory seeks solutions to pressing national problems in science and technology by conducting leading-edge basic and applied research in virtually every scientific discipline. Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy’s Office of Science.
The U.S. Department of Energy’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.
Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.