Simplifying causal gene identification in GWAS loci.
| Authors | |
| Abstract | Genome-wide association studies (GWAS) help to identify disease-linked genetic variants, but pinpointing the most likely causal genes in GWAS loci remains challenging. Existing GWAS gene prioritization tools are powerful but often use complex black box models trained on datasets containing biases. Here, we used a data-driven approach to construct a truth set of causal genes in 200 GWAS loci. We found that a simple logistic regression model performed as well as a more complex XGBoost model, and that many commonly-used gene prioritization features could be removed without meaningfully affecting performance (e.g., expression quantitative trait locus colocalization and Mendelian randomization). We present CALDERA, a gene prioritization tool that uses a logistic regression model and uses just four input features. In independent benchmarking datasets of resolved GWAS loci, CALDERA achieved state-of-the-art performance in comparison with other methods (FLAMES, L2G, and cS2G). CALDERA outputs causal gene probabilities for all genes in a given GWAS locus and we show that these probabilities are well-calibrated. Applying CALDERA to 93 UK Biobank traits, we predicted 11,956 putative causal genes, potentially resolving up to 52% of loci. Overall, CALDERA provides a powerful solution for prioritizing potentially causal genes in GWAS loci that minimizes the data processing required to construct input features and generates an easily-interpretable output score. |
| Year of Publication | 2026
|
| Journal | PLoS genetics
|
| Volume | 22
|
| Issue | 3
|
| Pages | e1012079
|
| Date Published | 03/2026
|
| ISSN | 1553-7404
|
| DOI | 10.1371/journal.pgen.1012079
|
| PubMed ID | 41843578
|
| Links |