The limits of Virus Discovery, and how to overcome them
Department of Molecular Genetics; Donnelly Centre for Cellular and Biomolecular Research; University of Toronto
Over the past 14 years, public databases have archived >30 petabases (3x10^16 bases) of sequencing data in >10 million samples, a modern ark of Earth’s genetic diversity. Millions of these samples contain viral sequences, often captured incidentally to the goals of the original study.
Recently, we developed Serratus, an ultra-high throughput cloud-computing architecture to explore the genetic diversity of RNA viruses. In 11 days, we processed 5.7 million sequencing datasets (10.2 petabases) to discover >130,000 novel RNA viruses, a 9.8-fold increase relative to the 15,000 known RNA viruses.
We will review the assortment of methods used in virus discovery, as well as how the limitations of each algorithm clash with the biology of RNA viruses. Then, we look ahead at the novel algorithms promising accelerated and deeper homology search.
Together, the advancement of sequence homology algorithms and exponential growth of public sequencing data will drive the new “Platinum Age” of virus discovery---we aim to uncover 100 million RNA viruses by 2030.
Also we, together with other researchers are in the process of organizing an "RdRp Summit" for the systematic and interoperable data standard for RNA viruses, if you're interested in computational standards for RNA virus classification please email: rdrp.summit at gmail dot com