MIA: Lorin Crawford, Rethinking scale in single-cell foundation models; Primer: Davide D'Ascenzo

Lorin Crawford
Microsoft Research New England (MSR)

Meeting: When more isn’t better: rethinking scale in single-cell foundation models

The success of transformer-based foundation models on natural language and images has motivated their use in single-cell biology. Single-cell foundation models have been trained on increasingly larger transcriptomic datasets, scaling from initial studies with 1 million cells to newer atlases with over 100 million cells. In this talk, we will investigate the role of pre-training dataset size and diversity on the performance of single-cell foundation models on both zero-shot and fine-tuned tasks. In the first half, we use a large corpus of 22.2 million cells to pre-train a total of 400 models and evaluate over 6,400 experiments. We show that current methods tend to plateau in performance with pre-training datasets that are only a fraction of the size. This will lead us to the second half of the talk where we evaluate training data composition on model performance. Focusing on a tractable biological system (human hematopoiesis), we train and analyze deep generative models with a variety of training datasets, including cells from adult and developing tissues, disease states, and perturbation atlases. From the performance across these models, we observe that (1) deep generative models generalize poorly to unseen cell types, (2) addition of malignant cells to a healthy cell training corpus does not necessarily improve modeling of unseen malignant cells, and (3) inclusion of an embryonic stem cell transcription factor differentiation atlas in training data improves performance on out-of-distribution tasks. Our results highlight the distinct contributions of different training data types and point towards strategies for optimizing future single cell foundation models.

Davide D'Ascenzo
Polytechnic University of Turin

Primer: Infrastructure and modeling challenges in single-cell omics

Deep learning is expected to play a major role in advancing the analysis of single-cell omics data, but a number of practical challenges have so far limited its impact. In this talk, we will focus on two such challenges, one at the level of data infrastructure and one at the level of model design. With the rise of very large-scale single-cell experiments, we are now able to use datasets of hundreds of millions of cells. Model training on these datasets is often bottlenecked by the data-loading process of moving data from disk to GPU. To address this, we developed scDataset, a scalable and efficient data-loading solution for quasi-random sampling of single-cell data from disk [1]. We present block sampling and batched fetching strategies that balance I/O efficiency with memory consumption and minibatch diversity, demonstrating how to achieve increased throughput while maintaining sampling quality comparable to true random shuffling. Moving to biological tasks, we focus on cell type annotation. We find that both simple linear models and more complex transformer-based architectures struggle to generalize in out-of-distribution settings. To mitigate this, we introduce a hierarchical cross-entropy loss that incorporates the structure of cell type ontologies [2]. Across model classes, this leads to consistent improvements in performance, suggesting that structured biological priors may be more useful than scaling up model parameters.[1] Davide D’Ascenzo and Sebastiano Cultrera di Montesano. scDataset: Scalable data loading for deep learning on large-scale single-cell omics. arXiv, 2025. [2] Sebastiano Cultrera di Montesano, Davide D’Ascenzo, Srivatsan Raghavan, Ava P. Amini, Peter S. Winter, and Lorin Crawford. Hierarchical cross-entropy loss improves atlas-scale single-cell annotation models. bioRxiv, 2025.

For more information visit: /mia.