Parameter representations outperform single-cell foundation models on downstream tasks
Boston University
Abstract:
Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure. This has motivated the development of large-scale foundation models, such as TranscriptFormer, that use transformer-based architectures to learn a generative model for gene expression by embedding genes into a latent vector space. These embeddings have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classification, disease-state prediction, and cross-species learning. Here, we ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations. Using simple, interpretable pipelines that rely on careful normalization and linear methods, we obtain SOTA or near SOTA performance across multiple benchmarks commonly used to evaluate single-cell foundation models, including outperforming foundation models on out-of-distribution tasks involving novel cell types and organisms absent from the training data. Our findings highlight the need for rigorous benchmarking and suggest that the biology of cell identity can be captured by simple linear representations of single cell gene expression data. As an illustration, I will show how we can use simple representations to find signatures of low-dimensional geometric landscapes in high-dimensional gene expression data on cell fate transitions.
Biography:
Pankaj Mehta is a Professor of Physics at Boston University, where he is also faculty in the Faculty of Computing and Data Science. He received his B.S. in Mathematics from Caltech and his Ph.D. in Physics from Rutgers, followed by postdoctoral work at Princeton. He is interested in theoretical problems at the interface of physics and biology — specifically how large-scale collective behaviors emerge from individual components, whether molecules enabling cellular computation, cell fate in development, or universal statistical behaviors in microbial ecosystems. He also has a running interest in problems at the intersection of machine learning and statistical physics, as well as occasional forays into quantum condensed matter; he is a firm believer that understanding biology requires developing a twenty-first-century statistical physics of life (see arXiv:2410.20506). He is a Simons Investigator in the Mathematical Modeling of Living Systems and an Alfred P. Sloan Fellow.