Efficient Lipschitz-constrained neural networks
Grosse Group, University of Toronto; Vector Institute
Grosse Group, University of Toronto; Vector Institute
Training neural networks under a strict Lipschitz constraint is useful for provable adversarial robustness, generalization bounds, interpretable gradients, and Wasserstein distance estimation. By the composition property of Lipschitz functions, it suffices to ensure that each individual affine transformation or nonlinear activation is 1-Lipschitz. The challenge is to do this while maintaining the expressive power. We identify a necessary property for such an architecture: each of the layers must preserve the gradient norm during backpropagation. We propose two architectural components that satisfy strict Lipschitz constraints with norm preservation. First is the GroupSort activation function, which sorts units within a group. Second is the use of orthogonal linear layers; this is straightforward for fully connected layers, but more involved for convolution layers. We present a flexible and efficient representation of orthogonal convolutions. Our provably Lipschitz-constrained architectures perform competitively at Wasserstein distance estimation and provable adversarial robustness.