
parametrization achieves higher model expressiveness and robustness (Section 6.2), and provides
more meaningful representation vectors (Section 6.3).
In addition, several works have shown that semi-supervised learning will help improve the empirical
robustness of models under adversarial training and other settings [
4
]. In this work, we take the
first attempt to bridge semi-supervised training with certified robustness based on our 1-Lipschitz
DNNs. Theoretically, we show that semi-supervised learning can help improve the error bound of
Lipschitz-bounded models. We also lower bound the certified radius as a function of the model
performance and Lipschitz property. Empirically, we indeed observe that including un-labelled data
will help with the certified robustness of 1-Lipschitz models, especially at a larger radius (e.g. from
36.04% to 42.39% at ρ= 108/255 on CIFAR-10).
We conduct comprehensive experiments to evaluate our approach, and we show that LOT significantly
outperforms the state-of-the-art in terms of the deterministic
`2
certified robustness. We also conduct
different ablation studies to show that (1) LOT can produce more meaningful features for visualization;
(2) residual connections help to smoothify the training of LOT model.
Technical contributions
. In this work, we aim to train a certifiably robust 1-Lipschitz model and
also analyze the certified radius of the Lipschitz bounded model under semi-supervised learning.
•
We propose a layer-wise orthogonal training method LOT for convolution layers to train 1-Lipschitz
models based on Newton’s iteration, and thus compute the deterministic certified robustness for the
model. We prove the convergence of Newton’s iteration used in our algorithm.
•
We derive the certified robustness of lipschitz constrained model under semi-supervised setting,
and formally show how semi-supervised learning affects the certified radius.
•
We evaluate our LOT method under different settings (i.e. supervised and semi-supervised) on
different models and datasets. With supervised learning, we show that it significantly outperforms
state-of-the-art baselines, and on some deep architectures the performance gain is over 4%. With
semi-supervised learning, we further improve certified robust accuracy by over 6% at a large radius.
2 Related Work
Certified Robustness for Lipschitz Constrained Models
Several studies have been conducted to
explore the Lipschitz-constrained models for certified robustness. [
26
] first certifies model robustness
based on its Lipschitz constant and propose training algorithms to regularize the model Lipschitz.
Multiple works [
5
,
19
,
20
,
8
] have been proposed to achieve 1-Lipschitz during training for linear
networks by regularizing or normalizing the spectral norm of the weight matrix. However, when
applying these approaches on convolution layers, the spectral norm is bounded by unrolling the
convolution into linear operations, which leads to a loose Lipschitz bound [
27
]. Recently, [
2
] shows
that the 1-Lipschitz requirement is not enough for a good robust model; rather, the gradient-norm-
preserving property is important. Besides these training-time techniques, different approaches have
been proposed to calculate a tight Lipschitz bound during evaluation. [
6
] upper bounds the Lipschitz
with semi-definite programming while [
11
] upper bounds the Lipschitz with polynomial optimization.
In this work we aim to effectively train 1-Lipschitz convolution models.
Orthogonal Convolution Neural Networks
[
16
] first proposes to directly construct orthogonal
convolution operations. Such operations are not only 1-Lipschitz, but also gradient norm preserving,
which provides a higher model capability and a smoother training process [
2
]. BCOP [
16
] trains
orthogonal convolution by iteratively generating
2×2
orthogonal kernels from orthogonal matrices.
[
25
] proposes to parametrize an orthogonal convolution with Cayley transformation
W= (I−V+
V|)(I+V−V|)−1
where the convolution inverse is calculated on the Fourier frequency domain.
ECO [
36
] proposes to explicitly normalize all the singular values [
22
] of convolution operations to be
1. So far, the best-performing orthogonal convolution approach is SOC [
23
], where they parametrize
the orthogonal convolution with
W= exp(V−V|)
where the exponential and transpose are defined
with the convolution operation. However, one major weakness of SOC is that it will rescale
V
to be
small so that the approximation of
exp
can converge soon, which will impose a bias on the resulting
output space. For example, when
V
is very small
W
will be close to
I
. Such norm-dependent
property is not desired, and thus we propose a parametrization that is invariant to rescaling. Finally,
[
24
] proposes several techniques for orthogonal CNNs, including a generalized Householder (HH)
activation function, a certificate regularizer (CReg) loss, and a last layer normalization (LLN). They
can be integrated with our training approach to improve model robustness.
2