When more isn’t better: rethinking scale in single-cell foundation models

Microsoft Research New England (MSR)

The success of transformer-based foundation models on natural language and images has motivated their use in single-cell biology. Single-cell foundation models have been trained on increasingly larger transcriptomic datasets, scaling from initial studies with 1 million cells to newer atlases with over 100 million cells. In this talk, we will investigate the role of pre-training dataset size and diversity on the performance of single-cell foundation models on both zero-shot and fine-tuned tasks. In the first half, we use a large corpus of 22.2 million cells to pre-train a total of 400 models and evaluate over 6,400 experiments. We show that current methods tend to plateau in performance with pre-training datasets that are only a fraction of the size. This will lead us to the second half of the talk where we evaluate training data composition on model performance. Focusing on a tractable biological system (human hematopoiesis), we train and analyze deep generative models with a variety of training datasets, including cells from adult and developing tissues, disease states, and perturbation atlases. From the performance across these models, we observe that (1) deep generative models generalize poorly to unseen cell types, (2) addition of malignant cells to a healthy cell training corpus does not necessarily improve modeling of unseen malignant cells, and (3) inclusion of an embryonic stem cell transcription factor differentiation atlas in training data improves performance on out-of-distribution tasks. Our results highlight the distinct contributions of different training data types and point towards strategies for optimizing future single cell foundation models.

ӳ��ý