Evaluating the role of pretraining dataset size and diversity on single-cell foundation model performance.
| Authors | |
| Abstract | The success of transformer-based foundation models on natural language and images has motivated their use in single-cell biology. Single-cell foundation models have been trained on increasingly larger transcriptomic datasets, scaling from initial studies with 1 million cells to newer atlases with over 100 million cells. Here we investigate the role of pretraining dataset size and diversity on the performance of single-cell foundation models on both zero-shot and fine-tuned tasks. Using a large corpus of 22.2 million cells, we pretrain a total of 400 models, which we evaluate by conducting 6,400 experiments. Our results show that current methods tend to plateau in performance with pretraining datasets that are only a fraction of the size of current training corpora. Unlike large language models, single-cell foundation models show no clear data scaling laws, indicating that developers should focus on balancing model capacity, dataset size and computational resources rather than indiscriminately increasing all three. |
| Year of Publication | 2026
|
| Journal | Nature methods
|
| Date Published | 06/2026
|
| ISSN | 1548-7105
|
| DOI | 10.1038/s41592-026-03120-y
|
| PubMed ID | 42265208
|
| Links |