ӳ��ý

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.

Bioinformatics (Oxford, England)

Authors	Sophie Wharrie Zhiyu Yang Vishnu Raj Remo Monti Rahul Gupta Ying Wang Alicia Martin Luke O'Connor Samuel Kaski Pekka Marttinen Pier Palamara Christoph Lippert Andrea Ganna
Abstract	MOTIVATION: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.RESULTS: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.AVAILABILITY AND IMPLEMENTATION: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at . The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at .
Year of Publication	2023
Journal	Bioinformatics (Oxford, England)
Volume	39
Issue	9
Date Published	09/2023
ISSN	1367-4811
DOI	10.1093/bioinformatics/btad535
PubMed ID	37647640
Links

Recent ӳ��ý Publications

Studying clonal heterogeneity of acute myeloid leukemia under nutrient and chemotherapy stress.

Single-cell disentangled representations for perturbation modeling and treatment effect estimation.

Toward Breath-Based Diagnostics via Water-Mediated Capture of Synthetic Breath Biomarkers in SERS-Active Plasmonic Nanogaps.

Artificial Intelligence-Enabled ECG Analysis to Predict Incident Heart Failure.

Genetic regulation across germline and somatic variation on the Y chromosome contributes to type 2 diabetes.