Succinct tree sequences for megasample genomics

Big Data Institute, University of Oxford

The era of megasample genomics, where datasets routinely contain millions of sampled genomes, is upon us. Present-day computational methods are fundamentally organised around the variant matrix, where each row describes the observations for every sample at a given genomic location. At megasample-scale, such matrices are massively unwieldy and cannot be processed without complex parallel algorithms. We show that the recently-introduced succinct tree sequence data structure has the potential to hugely reduce storage and processing costs; that it directly encodes important biological signals; and that it has led to efficiency gains of several orders of magnitude in simulation and whole-genome ancestry inference. We examine the underlying algorithmic properties of tree sequences that enable such breakthrough performance gains, and also discuss a preliminary algorithm for exactly solving the Li and Stephens model in logarithmic time.

ӳ��ý