LOT Layer-wise Orthogonal Training on Improving 2Certiﬁed Robustness Xiaojun Xu Linyi Li Bo Li

2025-05-02 0 0 894.08KB 23 页 10玖币

侵权投诉

LOT: Layer-wise Orthogonal Training on Improving

`2Certiﬁed Robustness

Xiaojun Xu Linyi Li Bo Li

University of Illinois Urbana-Champaign

{xiaojun3, linyi2, lbo}@illinois.edu

Abstract

Recent studies show that training deep neural networks (DNNs) with Lipschitz

constraints are able to enhance adversarial robustness and other model properties

such as stability. In this paper, we propose a layer-wise orthogonal training method

(LOT) to effectively train 1-Lipschitz convolution layers via parametrizing an

orthogonal matrix with an unconstrained matrix. We then efﬁciently compute the

inverse square root of a convolution kernel by transforming the input domain to the

Fourier frequency domain. On the other hand, as existing works show that semi-

supervised training helps improve empirical robustness, we aim to bridge the gap

and prove that semi-supervised learning also improves the certiﬁed robustness of

Lipschitz-bounded models. We conduct comprehensive evaluations for LOT under

different settings. We show that LOT signiﬁcantly outperforms baselines regarding

deterministic `2certiﬁed robustness, and scales to deeper neural networks. Under

the supervised scenario, we improve the state-of-the-art certiﬁed robustness for

all architectures (e.g. from 59.04% to 63.50% on CIFAR-10 and from 32.57% to

34.59% on CIFAR-100 at radius

ρ= 36/255

for 40-layer networks). With semi-

supervised learning over unlabelled data, we are able to improve state-of-the-art

certiﬁed robustness on CIFAR-10 at

ρ= 108/255

from 36.04% to 42.39%. In

addition, LOT consistently outperforms baselines on different model architectures

with only 1/3 evaluation time.

1 Introduction

Given the wide applications of deep neural networks (DNNs), ensuring their robustness against

potential adversarial attacks [

] is of great importance. There has been a line of research

providing defense approaches to improve the empirical robustness of DNNs [

], and

certiﬁcation methods to certify DNN robustness [

]. Existing certiﬁcation techniques

can be categorized as deterministic and probabilistic certiﬁcations [

], and in this work we will

focus on improving the deterministic `2certiﬁed robustness by training 1-Lipschitz DNNs.

Although different approaches have been proposed to empirically enforce the Lipschitz constant of

the trained model [

], it is still challenging to strictly ensure the 1-Lipschitz, which can lead to tight

robustness certiﬁcation. One recent work SOC [

] proposes to parametrize the orthogonal weight

matrices with the exponential of skew-symmetric matrices (i.e.

W= exp(V−V|)

). However,

such parametrization will be biased when the matrix norm is constrained and the expressiveness

is limited, especially when they rescale

to be small to help with convergence. In this work, we

propose a layer-wise orthogonal training approach (LOT) by parameterizing the orthogonal weight

matrix with an unconstrained matrix

W= (V V |)−1

. In order to calculate the inverse square root

for convolution kernels, we will perform Fourier Transformation and calculate the inverse square root

of matrices on the frequency domain using Newton’s iteration. In our parametrization, the output is

agnostic to the input norm (i.e. scaling

V=αV

does not change the value of

). We show that such

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11620v2 [cs.LG] 27 Mar 2023

parametrization achieves higher model expressiveness and robustness (Section 6.2), and provides

more meaningful representation vectors (Section 6.3).

In addition, several works have shown that semi-supervised learning will help improve the empirical

robustness of models under adversarial training and other settings [

]. In this work, we take the

ﬁrst attempt to bridge semi-supervised training with certiﬁed robustness based on our 1-Lipschitz

DNNs. Theoretically, we show that semi-supervised learning can help improve the error bound of

Lipschitz-bounded models. We also lower bound the certiﬁed radius as a function of the model

performance and Lipschitz property. Empirically, we indeed observe that including un-labelled data

will help with the certiﬁed robustness of 1-Lipschitz models, especially at a larger radius (e.g. from

36.04% to 42.39% at ρ= 108/255 on CIFAR-10).

We conduct comprehensive experiments to evaluate our approach, and we show that LOT signiﬁcantly

outperforms the state-of-the-art in terms of the deterministic

certiﬁed robustness. We also conduct

different ablation studies to show that (1) LOT can produce more meaningful features for visualization;

(2) residual connections help to smoothify the training of LOT model.

Technical contributions

. In this work, we aim to train a certiﬁably robust 1-Lipschitz model and

also analyze the certiﬁed radius of the Lipschitz bounded model under semi-supervised learning.

•

We propose a layer-wise orthogonal training method LOT for convolution layers to train 1-Lipschitz

models based on Newton’s iteration, and thus compute the deterministic certiﬁed robustness for the

model. We prove the convergence of Newton’s iteration used in our algorithm.

•

We derive the certiﬁed robustness of lipschitz constrained model under semi-supervised setting,

and formally show how semi-supervised learning affects the certiﬁed radius.

•

We evaluate our LOT method under different settings (i.e. supervised and semi-supervised) on

different models and datasets. With supervised learning, we show that it signiﬁcantly outperforms

state-of-the-art baselines, and on some deep architectures the performance gain is over 4%. With

semi-supervised learning, we further improve certiﬁed robust accuracy by over 6% at a large radius.

2 Related Work

Certiﬁed Robustness for Lipschitz Constrained Models

Several studies have been conducted to

explore the Lipschitz-constrained models for certiﬁed robustness. [

] ﬁrst certiﬁes model robustness

based on its Lipschitz constant and propose training algorithms to regularize the model Lipschitz.

Multiple works [

] have been proposed to achieve 1-Lipschitz during training for linear

networks by regularizing or normalizing the spectral norm of the weight matrix. However, when

applying these approaches on convolution layers, the spectral norm is bounded by unrolling the

convolution into linear operations, which leads to a loose Lipschitz bound [

]. Recently, [

] shows

that the 1-Lipschitz requirement is not enough for a good robust model; rather, the gradient-norm-

preserving property is important. Besides these training-time techniques, different approaches have

been proposed to calculate a tight Lipschitz bound during evaluation. [

] upper bounds the Lipschitz

with semi-deﬁnite programming while [

] upper bounds the Lipschitz with polynomial optimization.

In this work we aim to effectively train 1-Lipschitz convolution models.

Orthogonal Convolution Neural Networks

[

] ﬁrst proposes to directly construct orthogonal

convolution operations. Such operations are not only 1-Lipschitz, but also gradient norm preserving,

which provides a higher model capability and a smoother training process [

]. BCOP [

] trains

orthogonal convolution by iteratively generating

2×2

orthogonal kernels from orthogonal matrices.

[

] proposes to parametrize an orthogonal convolution with Cayley transformation

W= (I−V+

V|)(I+V−V|)−1

where the convolution inverse is calculated on the Fourier frequency domain.

ECO [

] proposes to explicitly normalize all the singular values [

] of convolution operations to be

1. So far, the best-performing orthogonal convolution approach is SOC [

], where they parametrize

the orthogonal convolution with

W= exp(V−V|)

where the exponential and transpose are deﬁned

with the convolution operation. However, one major weakness of SOC is that it will rescale

to be

small so that the approximation of

exp

can converge soon, which will impose a bias on the resulting

output space. For example, when

is very small

will be close to

. Such norm-dependent

property is not desired, and thus we propose a parametrization that is invariant to rescaling. Finally,

[

] proposes several techniques for orthogonal CNNs, including a generalized Householder (HH)

activation function, a certiﬁcate regularizer (CReg) loss, and a last layer normalization (LLN). They

can be integrated with our training approach to improve model robustness.

3 Problem Setup

3.1 Lipschitz Constant of Neural Networks and Certiﬁed Robustness

Let

f:Rd→RC

denote a neural network for classiﬁcation, where

is the input dimension and

is the number of output classes. The model prediction is given by

arg maxcf(x)c

, where

f(x)c

represents the prediction probability for class

. The Lipschitz constant of the model under

-norm

is deﬁned as:

Lipp(f) = sup ||f(x1)−f(x2)||p

||x1−x2||p∀x1, x2∈Rd

. Unless speciﬁed, we will focus on

-norm and use

Lip(f)

to denote

Lip2(f)

in this work. We can observe that the deﬁnition of model

Lipschitz aligns with its robustness property - both require the model not to change much w.r.t. input

changes. Formally speaking, deﬁne

Mf(x) = maxif(x)i−maxj6=arg maxif(x)if(x)j

to be the

prediction gap of

on the input

, then we can guarantee that

f(x)

will not change its prediction

within |x0−x|< r, where

r=Mf(x)/(√2Lip(f)).

Therefore, people have proposed to utilize the model Lipschitz to provide certiﬁed robustness and

investigated training algorithms to train a model with a small Lipschitz constant.

Note that the Lipschitz constant of a composed function

f=f1◦f2

satisﬁes

Lip(f)≤Lip(f1)×

Lip(f2)

. Since a neural network is usually composed of several layers, we can investigate the

Lipschitz of each layer to calculate the ﬁnal upper bound. If we can restrict each layer to be

1-Lipschitz, then the overall model will be 1-Lipschitz with an arbitrary number of layers.

3.2 Orthogonal Linear and Convolution Operations

Consider a linear operation with equal input and output dimensions

y=W x

, where

x, y ∈Rn

and

W∈Rn×n

. We say

is an orthogonal matrix if

W W |=W|W=I

and call

y=W x

an orthogonal linear operation. The orthogonal operation is not only 1-Lipschitz, but also norm-

preserving, i.e.,

||W x||2=||x||2

for all

x∈Rn

. If the input and output dimensions do not match,

i.e.,

W∈Rm×n

where

n6=m

, we say

is semi-orthogonal if either

W W |=I

W|W=I

The semi-orthogonal operation is 1-Lipschitz and non-expansive, i.e., ||W x||2≤ ||x||2.

The orthogonal convolution operation is deﬁned in a similar way. Let

y=W◦X

be an orthogonal

operation, where

x, y ∈Rc×w×w

and

W∈Rc×c×k×k

. We say

is an orthogonal convolution

kernel if

W◦W|=W|◦W=Iconv

where the transpose here refers to the convolution transpose

and

Iconv

denotes the identity convolution kernel. Such orthogonal convolution is 1-Lipschitz and

norm-preserving. When the input and output channel numbers are different, the semi-orthogonal

convolution kernel

satisﬁes either

W◦W|=Iconv

W|◦W=Iconv

and it is 1-Lipschitz and

non-expansive.

4 LOT: Layer-wise Orthogonal Training

In this section, we propose our LOT framework to achieve certiﬁed robustness.

We will ﬁrst

introduce how LOT layer works to achieve 1-Lipschitz. The key idea of our method is to parametrize

an orthogonal convolution

with an un-constrained convolution kernel

W= (V V |)−1

. Next,

we will propose several techniques to improve the training and evaluation processes of our model.

Finally, we discuss how semi-supervised learning can help with the certiﬁed robustness of our model.

4.1 1-Lipschitz Neural Networks via LOT

Our key observation is that we can parametrize an orthogonal matrix

W∈Rn×n

with an uncon-

strained matrix

V∈Rn×n

W= (V V |)−1

[

]. In addition, this equation also holds true

in the case of convolution kernel - given any convolution kernel

V∈Rc×c×k×k

, where

denotes

the channel number and

denotes the kernel size, we can get an orthogonal convolution kernel by

W= (V◦V|)−1

2◦V

, where transpose and inverse square root are with respect to the convolution

operations. The orthogonality of

can be veriﬁed by

W◦W|=W|◦W=Iconv

, where

1The code is available at https://github.com/AI-secure/Layerwise-Orthogonal-Training.

the identity convolution kernel. This way, we can parametrize an orthogonal convolution layer by

training over the un-constrained parameter V.

Formally speaking, we will parametrize an orthogonal convolution layer by

Y= (V◦V|)−1

2◦V◦X

where

X, Y ∈Rc×w×w

and

w≥k

is the input shape. The key obstacle here is how to calculate the

inverse square root of a convolution kernel. Inspired by [

], we can leverage the convolution theorem

which states that the convolution in the input domain equals the element-wise multiplication in the

Fourier frequency domain. In the case of multi-channel convolution, the convolution corresponds

to the matrix multiplication on each pixel location. That is, let

FFT :Rw×w→Cw×w

be the 2D

Discrete Fourier Transformation and

FFT−1

be the inverse Fourier Transformation. We will zero-pad

the input to

w×w

if the original shape is smaller than

. Let

Xi=FFT(Xi)

and

Vj,i =FFT(Vj,i)

denote the input and kernel on frequency domain, then we have:

FFT(Y):,a,b = ( ˜

V:,:,a,b ˜

V∗

:,:,a,b)−1

2˜

V:,:,a,b ˜

X:,a,b

in which multiplication, transpose and inverse square root operations are matrix-wise. Therefore, we

can ﬁrst calculate

FFT(Y)

on the frequency domain and perform the inverse Fourier transformation

to get the ﬁnal output.

We will use Newton’s iteration to calculate the inverse square root of the positive semi-deﬁnite matrix

A=˜

V:,:,a,b ˜

V∗

:,:,a,b

in a differentiable way [

]. If we initialize

Y0=A

and

Z0=I

and perform the

following update rule:

Yk+1 =1

2Yk(3I−ZkYk), Zk+1 =1

2(3I−ZkYk)Zk,(1)

will converge to

A−1

when

||I−A||2<1

. The condition can be satisﬁed by rescaling the

parameter

V=αV

, noticing that the scaling factor will not change the resulting

. In practice, we

execute a ﬁnite number of Newton’s iteration steps and we provide a rigorous error bound for this

ﬁnite iteration scheme in Appendix D to show that the error will decay exponentially. In addition,

we show that although the operation is over complex number domain after the FFT, the resulting

parameters will still be real domain, as shown below.

Theorem 4.1.

Say

J∈Cm×m

is unitary so that

J∗J=I

, and

V=J˜

V J∗

for

V∈Rm×m

and

V∈Cm×m. Let F(V)=(V V ∗)−1

2Vbe our transformation. Then F(V) = JF (˜

V)J∗.

Proof. First, notice that VT=V∗=J˜

V∗J∗. Second, we have:

(˜

V˜

V∗)−1=J∗(J˜

V˜

V∗J∗)−1J

= (J∗(J˜

V˜

V∗J∗)−1

2J)2,

so that (˜

V˜

V∗)−1

2=J∗(J˜

V˜

V∗J∗)−1

2J. Thus, we have

J∗F(V)J=J∗(V V T)−1

2V J

by V=J˜

V J∗

=J∗(J˜

V J∗J˜

V∗J∗)−1

2J˜

V J∗J=J∗(J˜

V˜

V∗J∗)−1

2J˜

= ( ˜

V˜

V∗)−1

2˜

V .

Remark. From this theorem it is clear that our returned value

JF (˜

V)J∗

equals to the transformed

version of the original matrix F(V)∈Rm×m, and thus is guaranteed to be in the real domain.

Circular Padding vs. Zero Padding

When we apply the convolution theorem to calculate the

convolution

Y=W◦X

with Fourier Transformation, the result implicitly uses circular padding.

However, in neural networks, zero padding is usually a better choice. Therefore, we will ﬁrst

perform zero padding on both sides of input

Xpad =zero_pad(X)

and calculate the resulting

Ypad =W◦Xpad

with Fourier Transformation. Thus, the implicit circular padding in this process

will actually pad the zeros which we padded beforehand. Finally, we truncate the padded part and get

our desired output

Y=truncate(Ypad)

. We empirically observe that this technique helps improve

the model robustness as shown in Appendix E.3.

When Input and Output Dimensions Differ

In previous discussion, we assume that the convolu-

tion kernel

has equal input and output channels. In the case

W∈Rcout×cin ×k×k

where

cout 6=cin

we aim to get a semi-orthogonal convolution kernel

(i.e.

W◦W|=Iconv

cout < cin

W|◦W=Iconv

cout > cin

). As pointed out in [

], calculating

with Newton’s iteration will

naturally lead to a semi-orthogonal convolution kernel when the input and output dimensions differ.

Emulating a Larger Stride

To emulate the case when the convolution stride is 2, we follow

previous works [

] and use the invertible downsampling layer [

], in which the input dimension

cin

will be increased by

×4

times. Strides larger than 2 can be emulated with similar techniques if

needed.

Overall Algorithm

Taking the techniques we discussed before, we can get our ﬁnal LOT convolu-

tion layer. The detailed algorithm is shown in Appendix B. First, we will pad the input to prevent the

implicit circular padding mechanism and pad the kernel so that they are in the same shape. Next, we

perform the Fourier transformation and calculate the output on the frequency domain with Newton’s

iteration. Finally, we perform inverse Fourier transformation and return the desired output.

Comparison with SOC

Several works have been proposed on orthogonal convolution layers with

re-parametrization[

], among which the SOC approach via

W= exp(V−V|)

has achieved the

best performance. Compared with SOC, LOT has the following advantages.

First

, the parametrization

in LOT is norm-independent. Rescaling

αV

will not change the resulting

. By comparison,

with a smaller norm in SOC will lead to a

closer to the identity transformation. Considering

that the norm of

will be regularized during training (e.g. SOC will re-scale

to have a small

norm; people usually initialize weight to be small and impose l2-regularization during training),

the orthogonal weight space in SOC may be biased.

Second

, we can see that LOT is able to model

any orthogonal kernel

by noticing that

(W W |)−1

itself; by comparison, SOC cannot

parametrize all the orthogonal operations. For example, in the case of orthogonal matrices, the

exponential of a skew-symmetric matrix only models the special orthogonal group (i.e. the matrices

with +1 determinant).

Third

, we directly handle the case when

cin 6=cout

, while SOC needs to do

extra padding so that the channel numbers match.

Finally

, LOT is more efﬁcient during evaluation,

when we only need to perform the Fourier and inverse Fourier transformation as extra overhead,

while SOC needs multiple convolution operations to calculate the exponential. We will further show

quantitative comparisons with SOC in Section 6.2.

The major limitation of LOT is its large overhead during training, since we need to calculate Newton’s

iteration in each training step which takes more time and memory. In addition, we sometimes observe

that Newton’s iteration is not stable when we perform many steps with 32-bit precision. To overcome

this problem, we will pre-calculate Newton’s iteration with 64-bit precision during evaluation, as we

will introduce in Section 4.2.

4.2 Training and Evaluation of LOT

Smoothing the Training Stage

In practice, we observe that the LOT layers are highly non-smooth

with respect to the parameter

(see Section 6.3). Therefore, the model is difﬁcult to converge

during the training process especially when the model is deep. To smooth the training, we propose

two techniques. First, we will initialize all except bottleneck layers where

cin =cout

with identity

parameter

V=I

. The bottleneck layers where

cin =cout

will still be randomly initialized.

Second, as pointed out in [

], residual connection helps with model smoothness. Therefore, for the

intermediate layers, we will add the 1-Lipschitz residual connection

y=λx + (1 −λ)f(x)

. Some

work suggests that λhere can be trainable [25], while we observe that setting λ= 0.5is enough.

Speeding up the Evaluation Stage

Notice that after the model is trained, the orthogonal kernel

will no longer change. Therefore, we can pre-calculate Newton’s iteration and use the result for each

evaluation step. Thus, the only runtime overhead compared with a standard convolution layer during

evaluation is the Fourier transformation part. In addition, we observe that using 64-bit precision

instead of the commonly used 32-bit precision in Newton’s iteration will help with the numerical

stability. Therefore, when we pre-calculate

, we will ﬁrst transform

into ﬂoat64. After Newton’s

iterations, we transform the resulting ˜

Wback to ﬂoat32 type for efﬁciency.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LOT:Layer-wiseOrthogonalTrainingonImproving`2CertiedRobustnessXiaojunXuLinyiLiBoLiUniversityofIllinoisUrbana-Champaign{xiaojun3,linyi2,lbo}@illinois.eduAbstractRecentstudiesshowthattrainingdeepneuralnetworks(DNNs)withLipschitzconstraintsareabletoenhanceadversarialrobustnessandothermodelpropertiessu...

展开>> 收起<<

LOT Layer-wise Orthogonal Training on Improving 2Certiﬁed Robustness Xiaojun Xu Linyi Li Bo Li.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LOT Layer-wise Orthogonal Training on Improving 2Certiﬁed Robustness Xiaojun Xu Linyi Li Bo Li

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: