Infrastructure and modeling challenges in single-cell omics

Polytechnic University of Turin

Deep learning is expected to play a major role in advancing the analysis of single-cell omics data, but a number of practical challenges have so far limited its impact. In this talk, we will focus on two such challenges, one at the level of data infrastructure and one at the level of model design. With the rise of very large-scale single-cell experiments, we are now able to use datasets of hundreds of millions of cells. Model training on these datasets is often bottlenecked by the data-loading process of moving data from disk to GPU. To address this, we developed scDataset, a scalable and efficient data-loading solution for quasi-random sampling of single-cell data from disk [1]. We present block sampling and batched fetching strategies that balance I/O efficiency with memory consumption and minibatch diversity, demonstrating how to achieve increased throughput while maintaining sampling quality comparable to true random shuffling. Moving to biological tasks, we focus on cell type annotation. We find that both simple linear models and more complex transformer-based architectures struggle to generalize in out-of-distribution settings. To mitigate this, we introduce a hierarchical cross-entropy loss that incorporates the structure of cell type ontologies [2]. Across model classes, this leads to consistent improvements in performance, suggesting that structured biological priors may be more useful than scaling up model parameters.[1] Davide D’Ascenzo and Sebastiano Cultrera di Montesano. scDataset: Scalable data loading for deep learning on large-scale single-cell omics. arXiv, 2025. [2] Sebastiano Cultrera di Montesano, Davide D’Ascenzo, Srivatsan Raghavan, Ava P. Amini, Peter S. Winter, and Lorin Crawford. Hierarchical cross-entropy loss improves atlas-scale single-cell annotation models. bioRxiv, 2025.

ӳ��ý