Learning the "parts" of cells using topic models

Dept. of Human Genetics, and the Research Computing Center, University of Chicago

Methods for learning reduced representations of data such as PCA and t-SNE have become essential to single-cell genomics studies. Beyond their use in producing evocative visualizations of cell population structure, some of these methods have the potential to recover interpretable "parts" of cellular transcriptomes or epigenomes. Less well appreciated is the fact that the topic model, originally developed to analyze collections of text documents, is also well suited to single-cell genomics data.

Here, we reconsider some basic principles behind the topic model, motivated by aims that are different from early topic modeling papers. Specifically, we would like to make accurate inferences from large data sets, and to extract biological insights from these inferences. We show that making connections between the topic model and other models increases the potential for the topic model to tackle these aims. In particular, we borrow ideas from non-negative matrix factorization, the Structure model used in population genetics, and differential expression analysis. These borrowed ideas lead to faster and more accurate algorithms for topic models, produce effective visualizations of complex cell structure from topic model fits, and suggest a new way to interpret topics.

ӳ��ý