A tool for more accurate disease risk prediction across diverse populations

By integrating genomic data from different ancestries, PRS-CSx reduces bias in risk-predicting polygenic scores.

By MGH Office of News and Public Affairs

May 5, 2022

Credit: Ricardo Job-Reese, ӳ��ý Communications

Polygenic risk scores (PRS) are promising tools for predicting disease risk based on genetics, but current versions have built-in bias that can affect their accuracy in some populations and result in health disparities. However, a team of researchers from Massachusetts General Hospital (MGH), the ӳ��ý of MIT and Harvard, and Shanghai Jiao Tong University in Shanghai, China, have designed a new method for generating PRS, dubbed PRS-CSx, that more accurately predicts disease risk across populations, which they report in .

Alterations in the genome's DNA sequence can produce genetic variants that increase the risk for disease. Some genetic variants are closely linked to certain diseases, such as the BRCA1 mutation and breast cancer.

"However, most common human diseases – such as type 2 diabetes, high blood pressure, and depression, for example – are influenced not by single genes, but by hundreds or thousands of genetic variants across the genome," said Tian Ge, an applied mathematician and biostatistician in the Psychiatric and Neurodevelopmental Genetics Unit of the MGH Center for Genomic Medicine, and a postdoctoral scholar in the Stanley Center for Psychiatric Research at ӳ��ý. "Each variant contributes a small effect."

PRS aggregate variants' effects across the genome, and have shown promise for one day being used to predict individual patients chances of developing diseases. That would allow clinicians to recommend preventive measures and monitor patients closely for early diagnosis and intervention.

However, a PRS must be "trained" to predict disease risk, using data from studies in which genomic information is collected from large groups of individuals. While many disease-causing variants are shared, Ge explained, there are important differences in the genetic basis of a disease between individuals of different ancestries. For example, a common genetic variant that is associated with a specific disease in one population may be present a lower frequency, or even be missing, in others. In addition, a shared variant's effect size (how much it increases risk) may also vary from one ancestral group to another.

As a result, PRS trained using data from one population therefore often have attenuated, or reduced, performance when applied to other populations.

"A major problem with existing methods for PRS calculation is that, to date, most of the genomic studies used data collected from individuals of European ancestry," Ge said. That creates a Euro-centric bias in existing PRS, producing substantially less-accurate predictions and raising the possibility that they could over- or underestimate disease risk in non-European populations.

Improved accuracy from expanded diversity

Fortunately, investigators have increased efforts to collect genomic data from underrepresented populations. Leveraging these resources, Ge and his colleagues created PRS-CSx, which can integrate data from multiple populations and account for genetic similarities and differences between them. While there are still significantly more genomic data available on individuals of European ancestry, the investigators used computational methods that allowed them to maximize the value of non-European data and improve prediction accuracy in ancestrally diverse individuals.

In the study, the investigators used genomic data from individuals in several different populations to predict a wide range of physical measures (such as height, body mass index, and blood pressure), blood biomarkers (such as glucose and cholesterol), and the risk for schizophrenia. Then they compared the predicted trait or disease risk with actual measures or reported disease status to measure PRS-CSx's prediction accuracy. The study's results demonstrated that PRS-CSx is significantly more accurate than existing PRS tools in non-European populations.

"The goal of our work was to narrow the gap between the prediction accuracy in underrepresented populations relative to European individuals, and narrow the gap in health disparities when implementing PRS in clinical settings," said Ge, who noted that the new tool will continue to be refined with the hope that clinicians may one day use it to inform treatment choices and make recommendations about patient care.

PRS-CSx could also have a role in basic research, according to the study's lead author, Yunfeng Ruan, a postdoctoral research fellow in the Cardiovascular Disease Initiative at ӳ��ý. It could be used, for example, to explore gene-environment interactions, such as how the effect of genetic risk would depend on the level of environmental risk factors in global populations.

Even with PRS-CSx, the gap in prediction accuracy between European and non-European populations remains considerable. Greater sample diversity is crucial to further improve the prediction accuracy of PRS in diverse populations.

"The expansion of non-European genomic resources, coupled with advanced analytic methods like PRS-CSx, will accelerate the equitable deployment of PRS in clinical settings," said Hailiang Huang, a statistical geneticist in the Stanley Center at ӳ��ý and the Analytic and Translational Genetics Unit at MGH, and co-senior author of the paper.

This work was supported by the National Institute on Aging, National Human Genome Research Institute, the National Institute of Diabetes and Digestive and Kidney Diseases, the National Institute of Mental Health, the Brain & Behavior Research Foundation, the Zhengxu and Ying He Foundation, and the Stanley Center for Psychiatric Research.

Adapted from .