Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores.

Proceedings of the National Academy of Sciences of the United States of America
Authors
Keywords
Abstract

Large-scale scientific datasets today contain tens of thousands of random variables across millions of samples (for example, the RNA expression levels of 20,000 protein-coding genes across 30 million single cells). Being able to quantify dependencies between these variables would help us discover novel relationships between variables of interest. Simple measures of dependence, such as Pearson correlation, are fast to compute, but limited in that they are designed to detect linear relationships between variables. Complex measures are known with the ability to detect any kind of dependence, but they do not readily scale to many modern datasets of interest. We introduce the InterDependence Score (IDS), a scalable measure of dependence that captures linear and various nonlinear dependencies between random variables. Our IDS algorithm is motivated by a dependence measure defined in infinite-dimensional Hilbert spaces, capable of capturing any type of dependence, and a fast (linear time) algorithm that neural networks natively implement to compute dependencies between random variables. We apply IDS to identify 1) relevant variables for predictive modeling tasks, 2) sets of words forming topics from millions of documents, and 3) sets of genes related to "gene-expression programs" in tens of millions of cells. We provide an efficient implementation that computes IDS between billions of pairs of variables across millions of samples in several hours on a single GPU. Given its speed and effectiveness in identifying nonlinear dependencies, we envision IDS will be a valuable tool for uncovering insights from scientific data.

Year of Publication
2025
Journal
Proceedings of the National Academy of Sciences of the United States of America
Volume
122
Issue
34
Pages
e2509860122
Date Published
08/2025
ISSN
1091-6490
DOI
10.1073/pnas.2509860122
PubMed ID
40833404
Links