Integrating 730,947 exome sequences with clinical literature improves gene discovery.
| Authors | |
| Abstract | Accurate estimates of allele frequencies aid in genetic discovery, including rare disease diagnosis, common disease investigations, and population genetics. Here, we present the Genome Aggregation Database version 4 (gnomAD v4), including 730,947 with exome sequences, a fivefold increase over previous releases. We demonstrate that statistical power to detect strong selective constraint continues to increase with sample size. We develop a new loss-of-function annotation pipeline, which learns genomic features predictive of nonsense-mediated decay and splicing effects from selection signals, achieving 90% precision for distinguishing likely true versus false positive loss-of-function variants. This improved pipeline, along with incorporation of highly deleterious missense variants into measures of loss-of-function intolerance, improves disease gene detection, particularly for short genes and those with gain-of-function mechanisms. To improve disease gene prediction, we systematically extract gene-disease associations from biomedical literature, map these to gene-level biological features, and integrate both with refined constraint metrics within a Bayesian framework, yielding state-of-the-art prediction of gene-disease relevance. We highlight genes under strong constraint but with limited clinical characterization, which are enriched in embryonic lethal and fertility phenotypes, thus prioritizing previously under-characterized disease genes. Together, these advances establish a unified framework for accelerating gene discovery and improving rare disease diagnosis. |
| Year of Publication | 2026
|
| Journal | medRxiv : the preprint server for health sciences
|
| Date Published | 03/2026
|
| DOI | 10.64898/2026.03.23.26349081
|
| PubMed ID | 41929314
|
| Links |