Inferring microbial phenotypes through latent representations of biological diversity

Indigo Agriculture

Creating orderly representations of the vast diversity of microbial life is an ancient problem. The spread of mass sequencing has revealed the inability of the Linnean system both to identify organisms and to capture the variation between them. Marker sequences, such as 16S rRNA in Bacteria and ITS in Fungi, have been used to identify taxa, albeit crudely at times. Identification with a sequence allows for the explicit modeling of distance between organisms and ordination of the resulting phylogenetic distance space. We study the use of a common topic model, LDA (Latent Dirichlet Allocation), to capture the differences between sequences. We show that distance in the latent space of topics reproduces alignment distance between closely related taxa. Additionally, we find that the dimensions of this space reflect the hierarchy of biological relationships. This transformation allows for fast comparison of taxa and gaussian process modeling of the properties of unsequenced strains and phenotypic interpolation based on their neighbors. These results represent a comprehensive and extensible methodology for the modeling of biological diversity.

MIA Talks Search