Analysis of the limited accessory genome reveals potential pitfalls of pan-genome analysis approaches.
Authors | |
Abstract | Pan-genome analysis is a fundamental tool in the study of bacterial genome evolution. Benchmarking the accuracy of pan-genome analysis methods is challenging, because it can be significantly influenced by both the methodology used to compare genomes, as well as differences in the accuracy and representativeness of the genomes analyzed. In this work, we curated a collection of 151 () isolates to evaluate sources of variability in pan-genome analysis. is characterized by its clonal evolution, absence of horizontal gene transfer, and limited accessory genome, making it an ideal test case for this study. Using a state-of-the-art graph-genome approach, we found that a majority of the structural variation observed in originates from rearrangement, deletion, and duplication of redundant nucleotide sequences. In contrast, we found that pan-genome analyses that focus on comparison of coding sequences (at the amino acid level) can yield surprisingly variable results, driven by differences in assembly quality and the softwares used. Upon closer inspection, we found that coding sequence annotation discrepancies were a major contributor to inflated accessory genome estimates. To address this, we developed panqc, a software that detects annotation discrepancies and collapses nucleotide redundancy in pan-genome estimates. We characterized the effect of the panqc adjustment on both pan-genome analysis of and genomes, and highlight how different levels of genomic diversity are prone to unique biases. Overall, this study illustrates the need for careful methodological selection and quality control to accurately map the evolutionary dynamics of a bacterial species. |
Year of Publication | 2024
|
Journal | bioRxiv : the preprint server for biology
|
Date Published | 03/2024
|
DOI | 10.1101/2024.03.21.586149
|
PubMed ID | 38585972
|
Links |