Evaluating the role of pretraining dataset size and diversity on single-cell foundation model performance.

Nature methods

Authors	Alan DenAdel Madeline Hughes Akshaya Thoutam Anay Gupta Andrew Navia Nicolo Fusi Srivatsan Raghavan Peter Winter Ava Amini Lorin Crawford
Abstract	The success of transformer-based foundation models on natural language and images has motivated their use in single-cell biology. Single-cell foundation models have been trained on increasingly larger transcriptomic datasets, scaling from initial studies with 1 million cells to newer atlases with over 100 million cells. Here we investigate the role of pretraining dataset size and diversity on the performance of single-cell foundation models on both zero-shot and fine-tuned tasks. Using a large corpus of 22.2 million cells, we pretrain a total of 400 models, which we evaluate by conducting 6,400 experiments. Our results show that current methods tend to plateau in performance with pretraining datasets that are only a fraction of the size of current training corpora. Unlike large language models, single-cell foundation models show no clear data scaling laws, indicating that developers should focus on balancing model capacity, dataset size and computational resources rather than indiscriminately increasing all three.
Year of Publication	2026
Journal	Nature methods
Date Published	06/2026
ISSN	1548-7105
DOI	10.1038/s41592-026-03120-y
PubMed ID	42265208
Links