More robust data, models, and benchmarks to solve the genome regulation code
University of British Columbia
Abstract:
Genome-scale data has enabled the creation of deep learning and other models that predict gene regulation from DNA sequence. Such models are increasingly useful in understanding how changes to the genome impact molecular, cellular, and organismal phenotypes. However, our understanding of what models are good at and where they fail is limited by the ad hoc and inconsistent ways in which models are evaluated. Monitoring progress in the field has been extremely challenging as the benchmarks change with every publication. To address this, we created GAME, a modular framework that relies on Application Programming Interfaces (APIs) to enable consistent benchmarking across models and over time. GAME makes models and benchmarks inherently intercompatible to minimize the barrier to benchmarking, and the containerization of modules makes GAME stable across systems. One common task is predicting the effects of genetic variants on gene regulation, which can also be experimentally determined using Massive Parallel Reporter Assays (MPRAs). MPRAs are also an ideal way to generate training data for gene regulation models that can then be used to predict variant effects. In Part 2, I will discuss a complicating factor in using MPRA data and MPRA data-trained models to interpret the genome. In particular, the activity of a DNA sequence can depend on its surrounding DNA sequence context. This can result from position and orientation-specific interactions between factors binding DNA. We show in particular using experiments and model inference how the real molecular biology being captured in these systems complicates their use for understanding genetic variation. While we focus on MPRA data and models because that is where we have enough data to study this, the same phenomena likely also happen in the genome but there we at present lack sufficient data to study it. Finally, I will discuss the Genome Regulatory Code Consortium whose goal is to solve the human genome regulatory code.
Biography:
Dr. de Boer is an Assistant Professor in the School of Biomedical Engineering at the University of British Columbia. He did his PhD in the lab of Tim Hughes at the University of Toronto, and was a postdoctoral fellow in Aviv Regev’s lab at the Ó³»´«Ã½ institute until the end of 2019, after which he moved to his current position at UBC. His research group aims to develop genomic and computational tools that will enable us to understand how the genome is regulated so that we can understand and treat disease.