Characterizing homology-induced data leakage and memorization in genome-trained sequence models
University of British Columbia
Abstract:
Models that predict function from DNA sequence have become critical tools in deciphering the roles of genomic sequences and genetic variation within them. However, traditional approaches for dividing the genomic sequences into training data, used to create the model, and test data, used to determine the model’s performance on unseen data, fail to account for the widespread homology that permeates the genome. Using controlled simulation, we demonstrate that homology-based data leakage can lead to overestimation of predictive models. Using models that predict human gene expression and epigenetic states from DNA sequence, we demonstrate that model performance on test sequences varies systematically by their similarity with training sequences. Across three distinct modeling regimes, fixed-context MPRA models (200 bp), short-context ChromBPNet (2 kb) and long-context Enformer (196 kb), we observe a consistent non-monotonic pattern: models perform better at the extremes of homology, but worse at intermediate similarity levels. These patterns are consistent with models achieving high performance on highly homologous test sequences through memorization of near-identical training examples, while performance on distant sequences appears to reflect application of learned generalizable principles. At intermediate homology, misleading sequence similarity likely triggers memorized associations that fail due to functional divergence, resulting in decreased predictive accuracy. To dissect and mitigate these effects, we introduce hashFrag, a scalable solution for homology detection and data partitioning. Using hashFrag, we demonstrate how to move beyond a single performance metric to characterize this complex dependency on homology. hashFrag improves estimates of model performance and can actually increase model performance and reliability by providing improved splits for model training. Altogether, we demonstrate how homology shapes performance of genome-trained models and must be accounted for to ensure reliable evaluation of sequence-to-function predictors.
Biography:
Abdul Muntakim Rafi is a Ph.D. candidate in Biomedical Engineering at the University of British Columbia, supervised by Carl de Boer. His research involves using machine learning to decode cis-regulatory logic—building models that predict gene expression from DNA sequence, developing methods to interpret them, and critically evaluating their reliability. His current work centers on continually improving sequence-to-expression models through synthesizing informative DNA sequences at scale and lab-in-the-loop experiments.