Modern Nonlinear Embedding Methods Unpacked: Empowering Biological Discoveries with Statistical Insights
Harvard T. H. Chan School of Public Health
Abstract:
Learning and representing low-dimensional structures from noisy, high-dimensional data is a cornerstone of modern data science. Stochastic neighbor embedding algorithms, a family of nonlinear dimensionality reduction and data visualization methods, with t-SNE and UMAP as two leading examples, have become very popular in recent years. Yet despite their wide applications, these methods remain subject to points of debate, including limited theoretical understanding, ambiguous interpretations, and sensitivity to tuning parameters. In this talk, I will present our recent efforts to decipher and improve these nonlinear embedding approaches. Our key results include a rigorous theoretical framework that uncovers the intrinsic mechanisms, large-sample limits, and fundamental principles underlying these algorithms; a set of theory-informed practical guidelines for their principled use in trustworthy biological discovery; and a collection of new algorithms that address current limitations and improve performance in areas such as bias reduction and stability. Throughout the talk, I will highlight how these advances not only deepen our theoretical understanding but also open new avenues for scientific discovery.
Biography:
I'm an Assistant Professor in the Department of Biostatistics at the Harvard T.H. Chan School of Public Health and an Associate Member of the Ó³»´«Ã½ of MIT and Harvard. I also hold affiliate positions in the Department of Data Science at Dana-Farber Cancer Institute and the Harvard Data Science Initiative. I got my Ph.D. in biostatistics from the University of Pennsylvania in 2021, where I was jointly advised by Professors T. Tony Cai and Hongzhe Li. After that, I was a postdoctoral scholar in statistics at Stanford University, advised by Professor David L. Donoho. My research interests lie at the intersection of statistics and computational biology. Currently, my research focuses on (i) statistical inference for high-dimensional data and theory of large random matrices, (ii) theoretical and computational underpinnings of manifold learning, geometric inference, and data integration algorithms, and (iii) developing interpretable machine learning methods for high-dimensional biomedical data, especially in the context of single-cell and spatial omics analyses.