Learning the rules of gene regulation with millions of synthetic promoters

Carl de Boer
Regev Lab, ӳý

Abstract:  Gene regulatory programs are encoded in the sequence of the DNA. However, how the cell uses transcription factors (TFs) to interpret regulatory sequence remains incompletely known. Synthetic regulatory sequences can provide insight into this logic by providing additional examples of sequences and their regulatory output in a controlled setting. Here, we have measured the gene expression output of tens of millions of unique promoter sequences, whose expressions span a range of 1000-fold, in a controlled reporter construct. This vast dataset of expression-DNA pairs represents a unique machine learning opportunity, and we use it to build quantitative models of transcriptional regulation based on biochemical principles. Even with a naive “billboard” model of gene regulation (with no positioning or complex TF-interactions), we can explain upwards of 92% of the variation in expression. We gain numerous insights into gene regulation, including a quantitative description of activation, repression, and chromatin modification for each TF, consistent with known TF activities and condition-specific regulators, and even use our data to refine the specificities of TFs. Although a “billboard” model explains the majority of expression in our system, certain TFs show position-, orientation-, and even DNA helical-face-dependent activities. We have so many promoter examples that we can look for potential spacing/orientation-dependent interactions between most TF pairs at base pair resolution, and find certain interactions consistent with biochemical cooperativity. Altogether, the principles learned here help us to better understand when and where TFs bind DNA, what they do when they get there, and how regulatory sequences evolve.