Using knockoffs to find important variables with statistical guarantees
Dept. of Statistics, Harvard University
Despite the significant recent progress in high-dimensional variable selection (reviewed in the primer), it remains unclear how to powerfully select important variables while controlling the fraction of false discoveries, even in simple models like logistic regression, not to mention general high-dimensional nonlinear models. To address this practical problem, we propose a new framework of model-X knockoffs, which acts as a wrapper around any (arbitrarily complex, e.g., drawn from machine learning) measure of variable importance and identifies important variables while exactly controlling the false discovery rate. Our method relies only on a model for the explanatory variables X, and in fact makes no assumptions at all about the response variable's distribution. To our knowledge, no other procedure solves the FDR-controlled variable selection problem in such generality, but in the restricted settings where competitors exist we demonstrate the superior power of knockoffs through simulations. We also demonstrate model-X knockoffs on GWAS data from a case-control study of Crohn’s disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data.