Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq

Sabeti Lab, Dept. of Systems Biology, Harvard Medical School

While matrix factorizations such as PCA or ICA are commonly used for dimensionality reduction of single-cell RNA-Seq data, the dimensions they infer may not necessarily align with biologically meaningful gene expression programs and are frequently ignored in practice. Here, I will discuss analysis of real and simulated single-cell data showing that matrix factorization can yield components corresponding to cell types and cellular activities such as life-cycle processes or responses to environmental stimuli. However, one limitation of many matrix factorizations is that their stochastic optimization algorithms can yield variable solutions when run multiple times on the same dataset which reduces the interpretability of the result. To address this limitation, we developed a meta-analysis approach that we call consensus matrix factorization which averages over multiple replicates to increase the robustness of the solution. We show with simulated data that, in particular, the consensus implementation of NMF (cNMF) outperforms several other factorizations at inferring cell-type and activity programs, including the relative contribution of programs in each cell. Applied to published brain organoid and visual cortex single-cell RNA-Seq datasets, cNMF refines the hierarchy of cell types and identifies both expected (e.g. cell-cycle and hypoxia) and intriguing novel activity programs. We make cNMF available to the community and illustrate how this approach can provide key insights into gene expression variation within and between cell types.

MIA Talks Search