A computational model for quantifying instability of tandem repeats across the genome.
| Authors | |
| Abstract | Tandem repeats (TRs) exhibit high levels of somatic mosaicism, which is increasingly recognized as an important modifier of repeat expansion disorders. Long-read sequencing can capture full-length repeat alleles, yet robust frameworks for quantifying instability across TRs genome-wide are still needed. Here, we introduce a general-purpose model for quantifying TR instability in a given long-read sequencing dataset, without explicitly distinguishing biological mosaicism from technical noise, and which is broadly applicable to both simple and structurally complex loci. This model accurately characterizes allelic instability at each TR locus by representing the distribution of read-to-consensus deviations for each allele. Using HiFi sequencing data from 256 HPRC cell line samples, we fitted models for 617,007 TR loci, including known pathogenic repeats. We observe that instability levels are generally low, but vary substantially across individual TRs, and are driven more strongly by repeat composition than overall repeat length. Furthermore, we applied our method to targeted PureTarget long-read data from samples with known repeat expansions and identified significant mosaicism in the majority of expanded alleles. Our model offers a practical way to quantify instability of tandem repeats across the genome and to detect unusually unstable repeat alleles. |
| Year of Publication | 2026
|
| Journal | bioRxiv : the preprint server for biology
|
| Date Published | 04/2026
|
| ISSN | 2692-8205
|
| DOI | 10.64898/2026.04.08.717199
|
| PubMed ID | 41993463
|
| Links |