LOT Layer-wise Orthogonal Training on Improving 2Certified Robustness Xiaojun Xu Linyi Li Bo Li

2025-05-02 0 0 894.08KB 23 页 10玖币
侵权投诉
LOT: Layer-wise Orthogonal Training on Improving
`2Certified Robustness
Xiaojun Xu Linyi Li Bo Li
University of Illinois Urbana-Champaign
{xiaojun3, linyi2, lbo}@illinois.edu
Abstract
Recent studies show that training deep neural networks (DNNs) with Lipschitz
constraints are able to enhance adversarial robustness and other model properties
such as stability. In this paper, we propose a layer-wise orthogonal training method
(LOT) to effectively train 1-Lipschitz convolution layers via parametrizing an
orthogonal matrix with an unconstrained matrix. We then efficiently compute the
inverse square root of a convolution kernel by transforming the input domain to the
Fourier frequency domain. On the other hand, as existing works show that semi-
supervised training helps improve empirical robustness, we aim to bridge the gap
and prove that semi-supervised learning also improves the certified robustness of
Lipschitz-bounded models. We conduct comprehensive evaluations for LOT under
different settings. We show that LOT significantly outperforms baselines regarding
deterministic `2certified robustness, and scales to deeper neural networks. Under
the supervised scenario, we improve the state-of-the-art certified robustness for
all architectures (e.g. from 59.04% to 63.50% on CIFAR-10 and from 32.57% to
34.59% on CIFAR-100 at radius
ρ= 36/255
for 40-layer networks). With semi-
supervised learning over unlabelled data, we are able to improve state-of-the-art
certified robustness on CIFAR-10 at
ρ= 108/255
from 36.04% to 42.39%. In
addition, LOT consistently outperforms baselines on different model architectures
with only 1/3 evaluation time.
1 Introduction
Given the wide applications of deep neural networks (DNNs), ensuring their robustness against
potential adversarial attacks [
7
,
21
,
31
,
32
] is of great importance. There has been a line of research
providing defense approaches to improve the empirical robustness of DNNs [
18
,
34
,
30
,
29
], and
certification methods to certify DNN robustness [
14
,
35
,
15
,
13
,
33
]. Existing certification techniques
can be categorized as deterministic and probabilistic certifications [
14
], and in this work we will
focus on improving the deterministic `2certified robustness by training 1-Lipschitz DNNs.
Although different approaches have been proposed to empirically enforce the Lipschitz constant of
the trained model [
26
], it is still challenging to strictly ensure the 1-Lipschitz, which can lead to tight
robustness certification. One recent work SOC [
23
] proposes to parametrize the orthogonal weight
matrices with the exponential of skew-symmetric matrices (i.e.
W= exp(VV|)
). However,
such parametrization will be biased when the matrix norm is constrained and the expressiveness
is limited, especially when they rescale
V
to be small to help with convergence. In this work, we
propose a layer-wise orthogonal training approach (LOT) by parameterizing the orthogonal weight
matrix with an unconstrained matrix
W= (V V |)1
2V
. In order to calculate the inverse square root
for convolution kernels, we will perform Fourier Transformation and calculate the inverse square root
of matrices on the frequency domain using Newton’s iteration. In our parametrization, the output is
agnostic to the input norm (i.e. scaling
V=αV
does not change the value of
W
). We show that such
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11620v2 [cs.LG] 27 Mar 2023
parametrization achieves higher model expressiveness and robustness (Section 6.2), and provides
more meaningful representation vectors (Section 6.3).
In addition, several works have shown that semi-supervised learning will help improve the empirical
robustness of models under adversarial training and other settings [
4
]. In this work, we take the
first attempt to bridge semi-supervised training with certified robustness based on our 1-Lipschitz
DNNs. Theoretically, we show that semi-supervised learning can help improve the error bound of
Lipschitz-bounded models. We also lower bound the certified radius as a function of the model
performance and Lipschitz property. Empirically, we indeed observe that including un-labelled data
will help with the certified robustness of 1-Lipschitz models, especially at a larger radius (e.g. from
36.04% to 42.39% at ρ= 108/255 on CIFAR-10).
We conduct comprehensive experiments to evaluate our approach, and we show that LOT significantly
outperforms the state-of-the-art in terms of the deterministic
`2
certified robustness. We also conduct
different ablation studies to show that (1) LOT can produce more meaningful features for visualization;
(2) residual connections help to smoothify the training of LOT model.
Technical contributions
. In this work, we aim to train a certifiably robust 1-Lipschitz model and
also analyze the certified radius of the Lipschitz bounded model under semi-supervised learning.
We propose a layer-wise orthogonal training method LOT for convolution layers to train 1-Lipschitz
models based on Newton’s iteration, and thus compute the deterministic certified robustness for the
model. We prove the convergence of Newton’s iteration used in our algorithm.
We derive the certified robustness of lipschitz constrained model under semi-supervised setting,
and formally show how semi-supervised learning affects the certified radius.
We evaluate our LOT method under different settings (i.e. supervised and semi-supervised) on
different models and datasets. With supervised learning, we show that it significantly outperforms
state-of-the-art baselines, and on some deep architectures the performance gain is over 4%. With
semi-supervised learning, we further improve certified robust accuracy by over 6% at a large radius.
2 Related Work
Certified Robustness for Lipschitz Constrained Models
Several studies have been conducted to
explore the Lipschitz-constrained models for certified robustness. [
26
] first certifies model robustness
based on its Lipschitz constant and propose training algorithms to regularize the model Lipschitz.
Multiple works [
5
,
19
,
20
,
8
] have been proposed to achieve 1-Lipschitz during training for linear
networks by regularizing or normalizing the spectral norm of the weight matrix. However, when
applying these approaches on convolution layers, the spectral norm is bounded by unrolling the
convolution into linear operations, which leads to a loose Lipschitz bound [
27
]. Recently, [
2
] shows
that the 1-Lipschitz requirement is not enough for a good robust model; rather, the gradient-norm-
preserving property is important. Besides these training-time techniques, different approaches have
been proposed to calculate a tight Lipschitz bound during evaluation. [
6
] upper bounds the Lipschitz
with semi-definite programming while [
11
] upper bounds the Lipschitz with polynomial optimization.
In this work we aim to effectively train 1-Lipschitz convolution models.
Orthogonal Convolution Neural Networks
[
16
] first proposes to directly construct orthogonal
convolution operations. Such operations are not only 1-Lipschitz, but also gradient norm preserving,
which provides a higher model capability and a smoother training process [
2
]. BCOP [
16
] trains
orthogonal convolution by iteratively generating
2×2
orthogonal kernels from orthogonal matrices.
[
25
] proposes to parametrize an orthogonal convolution with Cayley transformation
W= (IV+
V|)(I+VV|)1
where the convolution inverse is calculated on the Fourier frequency domain.
ECO [
36
] proposes to explicitly normalize all the singular values [
22
] of convolution operations to be
1. So far, the best-performing orthogonal convolution approach is SOC [
23
], where they parametrize
the orthogonal convolution with
W= exp(VV|)
where the exponential and transpose are defined
with the convolution operation. However, one major weakness of SOC is that it will rescale
V
to be
small so that the approximation of
exp
can converge soon, which will impose a bias on the resulting
output space. For example, when
V
is very small
W
will be close to
I
. Such norm-dependent
property is not desired, and thus we propose a parametrization that is invariant to rescaling. Finally,
[
24
] proposes several techniques for orthogonal CNNs, including a generalized Householder (HH)
activation function, a certificate regularizer (CReg) loss, and a last layer normalization (LLN). They
can be integrated with our training approach to improve model robustness.
2
3 Problem Setup
3.1 Lipschitz Constant of Neural Networks and Certified Robustness
Let
f:RdRC
denote a neural network for classification, where
d
is the input dimension and
C
is the number of output classes. The model prediction is given by
arg maxcf(x)c
, where
f(x)c
represents the prediction probability for class
c
. The Lipschitz constant of the model under
p
-norm
is defined as:
Lipp(f) = sup ||f(x1)f(x2)||p
||x1x2||px1, x2Rd
. Unless specified, we will focus on
`2
-norm and use
Lip(f)
to denote
Lip2(f)
in this work. We can observe that the definition of model
Lipschitz aligns with its robustness property - both require the model not to change much w.r.t. input
changes. Formally speaking, define
Mf(x) = maxif(x)imaxj6=arg maxif(x)if(x)j
to be the
prediction gap of
f
on the input
x
, then we can guarantee that
f(x)
will not change its prediction
within |x0x|< r, where
r=Mf(x)/(2Lip(f)).
Therefore, people have proposed to utilize the model Lipschitz to provide certified robustness and
investigated training algorithms to train a model with a small Lipschitz constant.
Note that the Lipschitz constant of a composed function
f=f1f2
satisfies
Lip(f)Lip(f1)×
Lip(f2)
. Since a neural network is usually composed of several layers, we can investigate the
Lipschitz of each layer to calculate the final upper bound. If we can restrict each layer to be
1-Lipschitz, then the overall model will be 1-Lipschitz with an arbitrary number of layers.
3.2 Orthogonal Linear and Convolution Operations
Consider a linear operation with equal input and output dimensions
y=W x
, where
x, y Rn
and
WRn×n
. We say
W
is an orthogonal matrix if
W W |=W|W=I
and call
y=W x
an orthogonal linear operation. The orthogonal operation is not only 1-Lipschitz, but also norm-
preserving, i.e.,
||W x||2=||x||2
for all
xRn
. If the input and output dimensions do not match,
i.e.,
WRm×n
where
n6=m
, we say
W
is semi-orthogonal if either
W W |=I
or
W|W=I
.
The semi-orthogonal operation is 1-Lipschitz and non-expansive, i.e., ||W x||2≤ ||x||2.
The orthogonal convolution operation is defined in a similar way. Let
y=WX
be an orthogonal
operation, where
x, y Rc×w×w
and
WRc×c×k×k
. We say
W
is an orthogonal convolution
kernel if
WW|=W|W=Iconv
where the transpose here refers to the convolution transpose
and
Iconv
denotes the identity convolution kernel. Such orthogonal convolution is 1-Lipschitz and
norm-preserving. When the input and output channel numbers are different, the semi-orthogonal
convolution kernel
W
satisfies either
WW|=Iconv
or
W|W=Iconv
and it is 1-Lipschitz and
non-expansive.
4 LOT: Layer-wise Orthogonal Training
In this section, we propose our LOT framework to achieve certified robustness.
1
We will first
introduce how LOT layer works to achieve 1-Lipschitz. The key idea of our method is to parametrize
an orthogonal convolution
W
with an un-constrained convolution kernel
W= (V V |)1
2V
. Next,
we will propose several techniques to improve the training and evaluation processes of our model.
Finally, we discuss how semi-supervised learning can help with the certified robustness of our model.
4.1 1-Lipschitz Neural Networks via LOT
Our key observation is that we can parametrize an orthogonal matrix
WRn×n
with an uncon-
strained matrix
VRn×n
by
W= (V V |)1
2V
[
9
]. In addition, this equation also holds true
in the case of convolution kernel - given any convolution kernel
VRc×c×k×k
, where
c
denotes
the channel number and
k
denotes the kernel size, we can get an orthogonal convolution kernel by
W= (VV|)1
2V
, where transpose and inverse square root are with respect to the convolution
operations. The orthogonality of
W
can be verified by
WW|=W|W=Iconv
, where
I
is
1The code is available at https://github.com/AI-secure/Layerwise-Orthogonal-Training.
3
the identity convolution kernel. This way, we can parametrize an orthogonal convolution layer by
training over the un-constrained parameter V.
Formally speaking, we will parametrize an orthogonal convolution layer by
Y= (VV|)1
2VX
,
where
X, Y Rc×w×w
and
wk
is the input shape. The key obstacle here is how to calculate the
inverse square root of a convolution kernel. Inspired by [
25
], we can leverage the convolution theorem
which states that the convolution in the input domain equals the element-wise multiplication in the
Fourier frequency domain. In the case of multi-channel convolution, the convolution corresponds
to the matrix multiplication on each pixel location. That is, let
FFT :Rw×wCw×w
be the 2D
Discrete Fourier Transformation and
FFT1
be the inverse Fourier Transformation. We will zero-pad
the input to
w×w
if the original shape is smaller than
w
. Let
˜
Xi=FFT(Xi)
and
˜
Vj,i =FFT(Vj,i)
denote the input and kernel on frequency domain, then we have:
FFT(Y):,a,b = ( ˜
V:,:,a,b ˜
V
:,:,a,b)1
2˜
V:,:,a,b ˜
X:,a,b
in which multiplication, transpose and inverse square root operations are matrix-wise. Therefore, we
can first calculate
FFT(Y)
on the frequency domain and perform the inverse Fourier transformation
to get the final output.
We will use Newton’s iteration to calculate the inverse square root of the positive semi-definite matrix
A=˜
V:,:,a,b ˜
V
:,:,a,b
in a differentiable way [
17
]. If we initialize
Y0=A
and
Z0=I
and perform the
following update rule:
Yk+1 =1
2Yk(3IZkYk), Zk+1 =1
2(3IZkYk)Zk,(1)
Zk
will converge to
A1
2
when
||IA||2<1
. The condition can be satisfied by rescaling the
parameter
V=αV
, noticing that the scaling factor will not change the resulting
W
. In practice, we
execute a finite number of Newton’s iteration steps and we provide a rigorous error bound for this
finite iteration scheme in Appendix D to show that the error will decay exponentially. In addition,
we show that although the operation is over complex number domain after the FFT, the resulting
parameters will still be real domain, as shown below.
Theorem 4.1.
Say
JCm×m
is unitary so that
JJ=I
, and
V=J˜
V J
for
VRm×m
and
˜
VCm×m. Let F(V)=(V V )1
2Vbe our transformation. Then F(V) = JF (˜
V)J.
Proof. First, notice that VT=V=J˜
VJ. Second, we have:
(˜
V˜
V)1=J(J˜
V˜
VJ)1J
= (J(J˜
V˜
VJ)1
2J)2,
so that (˜
V˜
V)1
2=J(J˜
V˜
VJ)1
2J. Thus, we have
JF(V)J=J(V V T)1
2V J
by V=J˜
V J
=J(J˜
V JJ˜
VJ)1
2J˜
V JJ=J(J˜
V˜
VJ)1
2J˜
V
= ( ˜
V˜
V)1
2˜
V .
Remark. From this theorem it is clear that our returned value
JF (˜
V)J
equals to the transformed
version of the original matrix F(V)Rm×m, and thus is guaranteed to be in the real domain.
Circular Padding vs. Zero Padding
When we apply the convolution theorem to calculate the
convolution
Y=WX
with Fourier Transformation, the result implicitly uses circular padding.
However, in neural networks, zero padding is usually a better choice. Therefore, we will first
perform zero padding on both sides of input
Xpad =zero_pad(X)
and calculate the resulting
Ypad =WXpad
with Fourier Transformation. Thus, the implicit circular padding in this process
will actually pad the zeros which we padded beforehand. Finally, we truncate the padded part and get
our desired output
Y=truncate(Ypad)
. We empirically observe that this technique helps improve
the model robustness as shown in Appendix E.3.
4
When Input and Output Dimensions Differ
In previous discussion, we assume that the convolu-
tion kernel
V
has equal input and output channels. In the case
WRcout×cin ×k×k
where
cout 6=cin
,
we aim to get a semi-orthogonal convolution kernel
W
(i.e.
WW|=Iconv
if
cout < cin
or
W|W=Iconv
if
cout > cin
). As pointed out in [
9
], calculating
W
with Newton’s iteration will
naturally lead to a semi-orthogonal convolution kernel when the input and output dimensions differ.
Emulating a Larger Stride
To emulate the case when the convolution stride is 2, we follow
previous works [
23
] and use the invertible downsampling layer [
10
], in which the input dimension
cin
will be increased by
×4
times. Strides larger than 2 can be emulated with similar techniques if
needed.
Overall Algorithm
Taking the techniques we discussed before, we can get our final LOT convolu-
tion layer. The detailed algorithm is shown in Appendix B. First, we will pad the input to prevent the
implicit circular padding mechanism and pad the kernel so that they are in the same shape. Next, we
perform the Fourier transformation and calculate the output on the frequency domain with Newton’s
iteration. Finally, we perform inverse Fourier transformation and return the desired output.
Comparison with SOC
Several works have been proposed on orthogonal convolution layers with
re-parametrization[
25
,
23
], among which the SOC approach via
W= exp(VV|)
has achieved the
best performance. Compared with SOC, LOT has the following advantages.
First
, the parametrization
in LOT is norm-independent. Rescaling
V
to
αV
will not change the resulting
W
. By comparison,
V
with a smaller norm in SOC will lead to a
W
closer to the identity transformation. Considering
that the norm of
V
will be regularized during training (e.g. SOC will re-scale
V
to have a small
norm; people usually initialize weight to be small and impose l2-regularization during training),
the orthogonal weight space in SOC may be biased.
Second
, we can see that LOT is able to model
any orthogonal kernel
W
by noticing that
(W W |)1
2W
is
W
itself; by comparison, SOC cannot
parametrize all the orthogonal operations. For example, in the case of orthogonal matrices, the
exponential of a skew-symmetric matrix only models the special orthogonal group (i.e. the matrices
with +1 determinant).
Third
, we directly handle the case when
cin 6=cout
, while SOC needs to do
extra padding so that the channel numbers match.
Finally
, LOT is more efficient during evaluation,
when we only need to perform the Fourier and inverse Fourier transformation as extra overhead,
while SOC needs multiple convolution operations to calculate the exponential. We will further show
quantitative comparisons with SOC in Section 6.2.
The major limitation of LOT is its large overhead during training, since we need to calculate Newton’s
iteration in each training step which takes more time and memory. In addition, we sometimes observe
that Newton’s iteration is not stable when we perform many steps with 32-bit precision. To overcome
this problem, we will pre-calculate Newton’s iteration with 64-bit precision during evaluation, as we
will introduce in Section 4.2.
4.2 Training and Evaluation of LOT
Smoothing the Training Stage
In practice, we observe that the LOT layers are highly non-smooth
with respect to the parameter
V
(see Section 6.3). Therefore, the model is difficult to converge
during the training process especially when the model is deep. To smooth the training, we propose
two techniques. First, we will initialize all except bottleneck layers where
cin =cout
with identity
parameter
V=I
. The bottleneck layers where
cin =cout
will still be randomly initialized.
Second, as pointed out in [
12
], residual connection helps with model smoothness. Therefore, for the
intermediate layers, we will add the 1-Lipschitz residual connection
y=λx + (1 λ)f(x)
. Some
work suggests that λhere can be trainable [25], while we observe that setting λ= 0.5is enough.
Speeding up the Evaluation Stage
Notice that after the model is trained, the orthogonal kernel
˜
W
will no longer change. Therefore, we can pre-calculate Newton’s iteration and use the result for each
evaluation step. Thus, the only runtime overhead compared with a standard convolution layer during
evaluation is the Fourier transformation part. In addition, we observe that using 64-bit precision
instead of the commonly used 32-bit precision in Newton’s iteration will help with the numerical
stability. Therefore, when we pre-calculate
˜
W
, we will first transform
˜
V
into float64. After Newton’s
iterations, we transform the resulting ˜
Wback to float32 type for efficiency.
5
摘要:

LOT:Layer-wiseOrthogonalTrainingonImproving`2CertiedRobustnessXiaojunXuLinyiLiBoLiUniversityofIllinoisUrbana-Champaign{xiaojun3,linyi2,lbo}@illinois.eduAbstractRecentstudiesshowthattrainingdeepneuralnetworks(DNNs)withLipschitzconstraintsareabletoenhanceadversarialrobustnessandothermodelpropertiessu...

展开>> 收起<<
LOT Layer-wise Orthogonal Training on Improving 2Certified Robustness Xiaojun Xu Linyi Li Bo Li.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:894.08KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注