Defining and cataloging variants in pangenome graphs.

bioRxiv : the preprint server for biology
Authors
Abstract

Structural variation causes some human haplotypes to align poorly with the linear reference genome, leading to 'reference bias'. A pangenome reference graph could ameliorate this bias by relating a sample to multiple reference assemblies. However, this approach requires a new definition of a 'genetic variant.' We introduce a definition of pangenome variants and a method, , to identify them. Our approach involves a pangenome which includes all nodes (sequences) of the pangenome graph, but only a subset of its edges; non-reference edges are . Our variants are biallelic and have well-defined positions. Analyzing the Minigraph-Cactus draft human pangenome reference graph, we identified 29.6 million genetic variants. Most variants (99.2%) are small, and most small variants (73.9%) are SNPs. 3.5 million variants (11.7%) have a reference allele which is not on GRCh38; these variants are difficult to detect without a pangenome reference, or with existing pangenome-based approaches. They tend to be embedded within tangled, multiallelic regions. We analyze two medically relevant regions, around the HLA-A and RHD genes, identifying thousands of small variants embedded within several large insertions, deletions, and inversions. We release an open-source software tool together with a VCF variant catalogue.

Year of Publication
2025
Journal
bioRxiv : the preprint server for biology
Date Published
08/2025
ISSN
2692-8205
DOI
10.1101/2025.08.04.668502
PubMed ID
40799530
Links