Building and evaluating generative models of biological sequences, from proteins to whole genomes

Marks Lab, Harvard Medical School

Across biology and biomedicine, scientists are interested in measuring sequences, predicting sequences, and testing their predictions experimentally by synthesizing or editing sequences. Generative probabilistic modeling offers a flexible and rigorous framework for learning from sequence data and forming predictions, but building, inferring and critiquing probabilistic models of biological sequences remains challenging. In this talk we outline the major practical and theoretical limitations of existing techniques and propose alternatives. We first describe a structured output distribution for protein data, the “MuE” distribution, that enables the creation of regression models, forecasting models, latent feature models and more; models built with the MuE do not require alignments for training and meet key theoretical conditions. Second, we describe a new generative model that can be scaled to whole genomes, the “BEAR” model, and use it to construct a nonparametric density estimator, robust parameter estimators, a goodness-of-fit test, and a two-sample test, each with consistency guarantees. We illustrate the applications of these methods on a range of biological problems including characterizing immune receptor repertoires, mapping disordered protein families, comparing metagenomic samples, exploring unaligned read data, and forecasting pathogen evolution.

MIA Talks Search