Defining and cataloging variants in pangenome graphs.
Authors | |
Abstract | Structural variation causes some human haplotypes to align poorly with the linear reference genome, leading to 'reference bias'. A pangenome reference graph could ameliorate this bias by relating a sample to multiple reference assemblies. However, this approach requires a new definition of a 'genetic variant.' We introduce a definition of pangenome variants and a method, , to identify them. Our approach involves a pangenome which includes all nodes (sequences) of the pangenome graph, but only a subset of its edges; non-reference edges are . Our variants are biallelic and have well-defined positions. Analyzing the Minigraph-Cactus draft human pangenome reference graph, we identified 29.6 million genetic variants. Most variants (99.2%) are small, and most small variants (73.9%) are SNPs. 3.5 million variants (11.7%) have a reference allele which is not on GRCh38; these variants are difficult to detect without a pangenome reference, or with existing pangenome-based approaches. They tend to be embedded within tangled, multiallelic regions. We analyze two medically relevant regions, around the HLA-A and RHD genes, identifying thousands of small variants embedded within several large insertions, deletions, and inversions. We release an open-source software tool together with a VCF variant catalogue. |
Year of Publication | 2025
|
Journal | bioRxiv : the preprint server for biology
|
Date Published | 08/2025
|
ISSN | 2692-8205
|
DOI | 10.1101/2025.08.04.668502
|
PubMed ID | 40799530
|
Links |