PMCID
PMC13042128

Integrating 730,947 exome sequences with clinical literature improves gene discovery.

medRxiv : the preprint server for health sciences
Authors
Abstract

Accurate estimates of allele frequencies aid in genetic discovery, including rare disease diagnosis, common disease investigations, and population genetics. Here, we present the Genome Aggregation Database version 4 (gnomAD v4), including 730,947 with exome sequences, a fivefold increase over previous releases. We demonstrate that statistical power to detect strong selective constraint continues to increase with sample size. We develop a new loss-of-function annotation pipeline, which learns genomic features predictive of nonsense-mediated decay and splicing effects from selection signals, achieving 90% precision for distinguishing likely true versus false positive loss-of-function variants. This improved pipeline, along with incorporation of highly deleterious missense variants into measures of loss-of-function intolerance, improves disease gene detection, particularly for short genes and those with gain-of-function mechanisms. To improve disease gene prediction, we systematically extract gene-disease associations from biomedical literature, map these to gene-level biological features, and integrate both with refined constraint metrics within a Bayesian framework, yielding state-of-the-art prediction of gene-disease relevance. We highlight genes under strong constraint but with limited clinical characterization, which are enriched in embryonic lethal and fertility phenotypes, thus prioritizing previously under-characterized disease genes. Together, these advances establish a unified framework for accelerating gene discovery and improving rare disease diagnosis.

Year of Publication
2026
Journal
medRxiv : the preprint server for health sciences
Date Published
03/2026
DOI
10.64898/2026.03.23.26349081
PubMed ID
41929314
Links