Rethinking Lipschitz Neural Networks and Certiﬁed Robustness A Boolean Function Perspective Bohang Zhang1Du Jiang1Di He1Liwei Wang12y

2025-04-29 0 0 711.57KB 37 页 10玖币

侵权投诉

Rethinking Lipschitz Neural Networks and Certiﬁed

Robustness: A Boolean Function Perspective

Bohang Zhang1Du Jiang1Di He1Liwei Wang1,2,†

1Key Laboratory of Machine Perception, MOE, School of Artiﬁcial Intelligence, Peking University

2Center for Data Science, Peking University †Corresponding author

{zhangbohang,dujiang,dihe}@pku.edu.cn,wanglw@cis.pku.edu.cn

Abstract

Designing neural networks with bounded Lipschitz constant is a promising way

to obtain certiﬁably robust classiﬁers against adversarial examples. However, the

relevant progress for the important

`∞

perturbation setting is rather limited, and a

principled understanding of how to design expressive

`∞

Lipschitz networks is still

lacking. In this paper, we bridge the gap by studying certiﬁed

`∞

robustness from

a novel perspective of representing Boolean functions. We derive two fundamental

impossibility results that hold for any standard Lipschitz network: one for robust

classiﬁcation on ﬁnite datasets, and the other for Lipschitz function approximation.

These results identify that networks built upon norm-bounded afﬁne layers and Lip-

schitz activations intrinsically lose expressive power even in the two-dimensional

case, and shed light on how recently proposed Lipschitz networks (e.g., GroupSort

and

`∞

-distance nets) bypass these impossibilities by leveraging order statistic

functions. Finally, based on these insights, we develop a uniﬁed Lipschitz network

that generalizes prior works, and design a practical version that can be efﬁciently

trained (making certiﬁed robust training free). Extensive experiments show that our

approach is scalable, efﬁcient, and consistently yields better certiﬁed robustness

across multiple datasets and perturbation radii than prior Lipschitz networks. Our

code is available at https://github.com/zbh2047/SortNet.

1 Introduction

Modern neural networks, despite their great success in various applications [

], typically suffer

from a severe drawback of lacking adversarial robustness. In classiﬁcation tasks, given an image

correctly classiﬁed by a neural network, there often exists a small adversarial perturbation

, making

the perturbed image

x+δ

look indistinguishable to humans but fool the network to predict a wrong

class with high conﬁdence [57, 4].

It is well-known that the adversarial robustness of a neural network is closely related to its Lipschitz

continuity [

] (see Section 3.1). Accordingly, training neural networks with bounded Lipschitz

constant has been considered a promising way to address the problem. A variety of works studied

Lipschitz architectures for the ordinary Euclidean norm [

], and recent works even

established state-of-the-art (deterministic) certiﬁed

robustness [

]. However, when it comes

to the more critical (realistic)

`∞

perturbation setting, the progress seems to be rather limited. In fact,

standard Lipschitz ReLU networks have been shown to perform poorly in terms of

`∞

robustness

[

]. While other more advanced Lipschitz networks have been proposed [

], the achieved

results are still not satisfactory even on simple datasets like MNIST. Until recently, Zhang et al.

[

] designed a quite unusual Lipschitz network based on a heuristic choice of the

`∞

-distance

function, which surprisingly established state-of-the-art certiﬁed

`∞

robustness on multiple datasets

over prior works. Yet, it remains unclear why previous Lipschitz networks typically failed, and what

is the essential reason behind the success of this particular `∞-distance network structure.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01787v3 [cs.LG] 26 Oct 2022

Theoretical contributions

. In this work, we systematically investigate how to design expressive

Lipschitz neural networks (w.r.t.

`∞

-norm) through the novel lens of representing discrete Boolean

functions, which provides a deep understanding on the aforementioned problems. Speciﬁcally, we

ﬁrst ﬁgure out a fundamental limitation of standard Lipschitz networks in representing a class of

logical operations called symmetric Boolean functions (SBF), which comprises the basic logical

AND/OR as special cases. We prove that for any non-constant SBF of

variables, there exists a ﬁnite

dataset of size

O(d)

such that the certiﬁed

`∞

robust radius must vanish as

O(1/d)

for any classiﬁer

induced by a standard Lipschitz network. Remarkably, since logical AND/OR operations correspond

to perhaps the most basic classiﬁers, our result indicates an intrinsic difﬁculty of such networks in

ﬁtting high-dimensional real-world datasets with guaranteed certiﬁed `∞robustness.

Our analysis can be readily extended into the Lipschitz function approximation setting. We point out

the relationship between monotonic SBF and the order statistics (which are 1-Lipschitz functions),

and then prove that any

-dimensional order statistic (including the max/min function) on a compact

domain cannot be approximated by standard Lipschitz networks with error

O(1 −1/d)

, regardless

of the network size. This impossibility result is signiﬁcant in that:

(i)

it applies to all Lipschitz

activations (thus extending prior works [

]),

(ii)

it resolves an open problem raised recently in

[43], and (iii) aquantitative lower bound of approximation error is established.

Equipped by the above impossibility results, we proceed to examine two advanced Lipschitz ar-

chitectures: the GroupSort network [

] and the recently proposed

`∞

-distance net [

]. We

ﬁnd that besides the linear operation, both networks incorporate other Lipschitz aggregation oper-

ations into the neuron design, especially the order statistic functions, thus shedding light on how

they work. However, for the MaxMin network [

] — a computationally efﬁcient version of the

GroupSort network implemented in practice, representing Boolean functions and order statistics

is possible only when the network is very deep. In particular, we prove that representing certain

-dimensional Boolean functions requires a depth of

Ω(d)

, implying that shallow MaxMin networks

are not Lipschitz-universal function approximators. In contrast, we show a two-layer

`∞

-distance net

sufﬁces to represent any order statistic function on a compact domain or even all Boolean functions.

This strongly justiﬁes the empirical success of

`∞

-distance net over GroupSort (MaxMin) networks.

Practical contributions

. Our theoretical insights can also guide in designing better Lipschitz network

architectures. Inspired by the importance of order statistics, we propose a general form of Lipschitz

network, called SortNet, that extends both GroupSort and

`∞

distance networks and incorporates

them into a uniﬁed framework. Yet, the full-sort operation is computationally expensive and leads to

optimization difﬁculties (as with the GroupSort network). We further propose a specialized SortNet

that can be efﬁciently trained, by assigning each weight vector

using geometric series, i.e.

proportional to

ρi

for some

0≤ρ < 1

. This leads to a restricted version of SortNet but still covers

`∞

-distance net as a special case. For this particular SortNet, we skillfully derive a stochastic

estimation that gives an unbiased approximation of the neuron output without performing sorting

operations explicitly. This eventually yields an efﬁcient training strategy with similar cost as training

standard networks, thus making certiﬁed robust training free. Extensive experiments demonstrate that

the proposed SortNet is scalable, efﬁcient, and consistently achieves better certiﬁed robustness than

prior Lipschitz networks across multiple datasets and perturbation radii. In particular, our approach

even scales on a variant of ImageNet, and surpasses the best-known result [

] with a 22-fold decrease

in training time thanks to our “free” certiﬁed training approach.

The contribution and organization of this paper can be summarized as follows:

•

We develop a systematic study for the expressive power of Lipschitz neural networks using

the tools of Boolean function theory. We prove the impossibility results of standard Lipschitz

networks in two settings: a) certiﬁed

`∞

robustness on discrete datasets (Section 3.2); b)

Lipschitz function approximation (Section 3.3).

•

We provide insights into how recently proposed networks can bypass the impossibility results.

In particular, we show that a two-layer

`∞

-distance net can precisely represent any Boolean

functions, while shallow GroupSort networks cannot (Section 3.4).

•

We propose SortNet, a Lipschitz network that generalizes GroupSort and

`∞

-distance net. For

a special type of SortNet, we derive a stochastic training approach that bypasses the difﬁculties

in calculating sorting operations explicitly and makes certiﬁed training free (Section 4).

•

Extensive experiments demonstrate that SortNet exhibits better certiﬁed robustness on several

benchmark datasets over baseline methods with high training efﬁciency (Section 5).

2 Related Work

Extensive studies have been devoted to developing neural networks with certiﬁed robustness guaran-

tees. Existing approaches can be mainly divided into the following three categories.

Certiﬁed defenses for standard networks

. A variety of works focus on establishing certiﬁed

robustness for standard neural networks. However, exactly calculating the certiﬁed radius of a

standard ReLU network is known to be NP-hard [

]. Researchers thus developed a class of

relaxation-based approaches that provide a tight lower bound estimate of the certiﬁed robustness

efﬁciently. These approaches typically use convex relaxation to calculate a bound of the neuron

outputs under input perturbations layer by layer [

]. See also [

] for more advanced approaches. However, most of these works suffer

from high computational costs and are hard to scale up to large datasets. Currently, the only scalable

convex relaxation approach is based on interval bound propagation (IBP) [

], but

the produced bound is known to be loose [

], and a recent study showed that IBP cannot achieve

enough certiﬁed robustness on simple datasets for any standard ReLU network [41].

Certiﬁed defenses using Lipschitz networks

. On the other hand, Lipschitz networks inherently

imply certiﬁed robustness, resulting in a much simpler certiﬁcation process based on the output margin

(see Proposition 3.1). Yet, most prior works can only handle the

-norm Lipschitz situation by

leveraging speciﬁc mathematical properties such as the spectral norm [

]

or orthogonality of weight matrices [

]. For the

`∞

-norm, standard Lipschitz networks

were shown to give only a vanishingly small certiﬁed radius [

]. Huster et al. [

] found that

standard Lipschitz ReLU networks cannot represent certain simple functions such as the absolute

value, which inspired the ﬁrst expressive Lipschitz architecture called the GroupSort network [

Since then, GroupSort has been extensively investigated [

], but its performance is still much

worse than the above relaxation-based approaches even on MNIST. Recently, Zhang et al. [

]

ﬁrst proposed a practical 1-Lipschitz architecture w.r.t.

`∞

-norm based on a special neuron called the

`∞

-distance neuron, which can scale to TinyImageNet with state-of-the-art certiﬁed robustness over

relaxation-based approaches. However, despite its practical success, it is rather puzzling how such a

simple architecture can work while prior approaches all failed. Answering this question may require

an in-depth re-examination of Lipschitz networks (w.r.t. `∞-norm), which is the focus of this paper.

Certiﬁed defenses via randomized smoothing

. As a rather different and parallel research line,

randomized smoothing typically provides probabilistic certiﬁed

robustness guarantees. Due to

the wide applicability, randomized smoothing has been scaled up to ImageNet and achieves state-of-

the-art certiﬁed accuracy for

perturbations [

]. However, certifying robustness

with high probability requires sampling a large number of noisy inputs (e.g.,

105

) for a single image,

leading to a high computational cost at inference. Moreover, theoretical results pointed out that it

cannot achieve non-trivial certiﬁed

`∞

robustness if the perturbation radius is larger than

Ω(d−1/2)

where dis the input dimension [70, 5, 30, 67].

3 The Expressive Power of Lipschitz Neural Networks

3.1 Preliminaries

Notations

. We use boldface letters to denote vectors (e.g.,

) or vector functions (e.g.,

), and use

(or

) to denote its

-th element. For a unary function

σ(x)

applies

σ(·)

element-wise on

vector

. The

-norm (

p≥1

) and

`∞

-norm of a vector

are deﬁned as

kxkp= (Pi|xi|p)1/p

and

kxk∞= maxi|xi|

, respectively. The matrix

∞

-norm is deﬁned as

kWk∞= maxikWi,:k1

where

Wi,:

is the

-th row of the matrix

. The

-th largest element of a vector

is denoted as

x(k)

. We

use

[n]

to denote the set

{1,··· , n}

, and use

to denote the unit vector with the

-th element being

one. We adopt the big O notations by using O(·),Ω(·), and Θ(·)to hide universal constants.

Lipschitzness

. A mapping

f:Rn→Rm

is said to be

-Lipschitz continuous w.r.t. norm

k·k

if for

any pair of inputs x1,x2∈Rn,

kf(x1)−f(x2)k ≤ Lkx1−x2k.(1)

If the mapping

represented by a neural network has a small Lipschitz constant

, then (1) implies

that the change of network output can be strictly controlled under input perturbations, resulting in

certiﬁed robustness guarantees as shown in the following proposition.

Proposition 3.1.

(Certiﬁed robustness of Lipschitz networks) For a neural network

with Lipschitz

constant

under

-norm

k · kp

, deﬁne the resulting classiﬁer

g(x) := arg maxkfk(x)

for an

input x. Then gis provably robust under perturbations kδkp<c

Lmargin(f(x)), i.e.

g(x+δ) = g(x)for all δwith kδkp< c/L ·margin(f(x)).(2)

Here

c=p

√2/2

is a constant depending only on the norm

k · kp

, which is

1/2

for the

`∞

-norm, and

margin(f(x)) is the margin between the largest and second largest output logits.

The proof of Proposition 3.1 is simple and can be found in Appendix B.1 or [

, Appendix P]. It can

be seen that the robust radius is inversely proportional to the Lipschitz constant L.

Standard Lipschitz networks

. Throughout this paper, we refer to standard neural networks as neural

networks formed by afﬁne layers (e.g., fully-connected or convolutional layers) and element-wise

activation functions. Based on the Lipschitz property of composite functions, most prior works

enforce the 1-Lipschitzness of a multi-layer neural network by constraining each layer to be a 1-

Lipschitz mapping. For the

`∞

-norm, it is further equivalent to constraining the weight matrices to

have bounded ∞-norm, plus using Lipschitz activation functions [2], which can be formalized as

x(l)=σ(l)(W(l)x(l−1) +b(l))s.t. kW(l)k∞≤1and σ(l)being 1-Lipschitz, l ∈[M].(3)

Here

is the number of layers and usually

σ(M)(x) = x

is the identity function. The network takes

x(0) := x

as the input and outputs

x(M)

. For

-class classiﬁcation problems,

x(M)∈RK

and

the network predicts the class

g(x) := arg maxk∈[K]x(M)

. We refer to the resulting network as a

standard Lipschitz network.

3.2 Certiﬁed robustness on discrete Boolean datasets

In this section, we will construct a class of counterexamples for which certiﬁed

`∞

robustness can be

arbitrarily poor using standard Lipschitz networks. We focus on the Boolean dataset, a discrete dataset

where both inputs and labels are Boolean-valued and the relationship between inputs and their labels

(x(i), y(i))∈ {0,1}d×{0,1}

can be described using a Boolean function

gB:{0,1}d→ {0,1}

. The

key motivation lies in the ﬁnding that Boolean vectors correspond to the vertices of a

-dimensional

hypercube, and thus are geometrically related to the

`∞

-distance metric. In particular, the

`∞

-distance

between any two different data points in a Boolean dataset is always 1, which means that the dataset

is well-separated. This yields the following proposition, stating that it is always possible to achieve

optimal certiﬁed `∞robustness on Boolean datasets by using Lipschitz classiﬁers.

Proposition 3.2.

For any Boolean dataset

D={(x(i), y(i))}n

i=1

, there exists a classiﬁer

ˆg:Rd→

{0,1}

induced by a 1-Lipschitz mapping

f:Rd→R2

, such that

ˆg

can ﬁt the whole dataset with

margin( ˆ

f(x(i))) = 1 ∀i∈[n], thus achieving a certiﬁed `∞radius of 1/2by Proposition 3.1.

The key observation in the proof (Appendix B.2) is that one can construct a so-called nearest neighbor

classiﬁer that achieves a large margin on the whole dataset and is 1-Lipschitz. Based on Proposition

3.2, it is natural to ask whether standard Lipschitz networks of the form (3) can perform well on

Boolean datasets. Unfortunately, we show it is not the case, even on a class of simple datasets

constructed using symmetric Boolean functions.

Deﬁnition 3.3.

A Boolean function

gB:{0,1}d→ {0,1}

is symmetric if it is invariant under input

permutation, i.e. gB(x1,··· , xd) = gB(xπ(1),··· , xπ(d))for any x∈ {0,1}dand π∈Sd.

Example 3.4.

Two of the most basic operations in Boolean algebra are the logical AND/OR, both

of which belong to the class of symmetric Boolean functions. Other important examples include

the exclusive-or (XOR, also called the parity function), NAND, NOR, and the majority function (or

generally, the threshold functions). See Appendix A.2 for a detailed description of these examples.

Theorem 3.5.

For any non-constant symmetric Boolean function

gB:{0,1}d→ {0,1}

, there exists

a Boolean dataset with labels

y(i)=gB(x(i))

, such that no standard Lipschitz network can achieve a

certiﬁed `∞robust radius larger than 1/2don the dataset.

Implications

. Theorem 3.5 shows that the certiﬁed radius of standard Lipschitz networks must

vanish as dimension

grows, which is in stark contrast to the constant radius given by Proposition

3.2. Also, note that the logical AND/OR functions are perhaps the most basic classiﬁers (which

simply make predictions based on the existence of binary input features). It is thus not surprising to

see that standard Lipschitz networks perform poorly on real-world datasets (e.g., even the simple

MNIST dataset where input pixels are almost Boolean-valued (black/white) [61]).

We give an elegant proof of Theorem 3.5 in Appendix B.3, where we also prove that the bound of

1/2d

is tight in Proposition B.8. Moreover, we discuss the sample complexity by proving that a

dataset of size

O(d)

already sufﬁces to give

O(1/d)

certiﬁed radius (Corollary B.10). To the best of

our knowledge, Theorem 3.5 is the ﬁrst impossibility result that targets the certiﬁed

`∞

robustness

of standard Lipschitz networks with an quantitative upper bound on the certiﬁed radius. In the next

section, we will extend our analysis to the function approximation setting and make discussions with

literature results [24, 2].

3.3 Lipschitz function approximation

Classic approximation theory has shown that standard neural networks are universal function approxi-

mators [

], in that they can approximate any continuous function on a compact domain arbitrarily

well. For 1-Lipschitz neural networks, an analogous question is whether they can approximate all

1-Lipschitz functions accordingly. Unfortunately, the result in Section 3.2 already implies a negative

answer. Indeed, by combining Proposition 3.2 and Theorem 3.5,

is clearly a 1-Lipschitz function

that cannot be approximated by any standard Lipschitz network.

To gain further insights into the structure of unrepresentable 1-Lipschitz functions, let us consider the

continuousization of discrete Boolean functions. For the symmetric case, one needs to ﬁnd a class of

1-Lipschitz continuous functions that are also invariant under permutations. It can be found that a

simple class of candidates is the order statistics, i.e. the

-th largest element of a vector. One can

check that the

-th order statistic

x(k)

is indeed 1-Lipschitz and is precisely the continuousization of

the

-threshold Boolean function deﬁned as

gB,k(x) := I(Pixi≥k)

. In particular,

x(1) = maxixi

and

x(d)= minixi

corresponds to the logical OR/AND functions, respectively. Importantly, note that

any Boolean function that is both symmetric and monotonic is a

-threshold function, and vice versa

(Appendix A.2). Therefore,

-threshold functions can be regarded as the most elementary Boolean

functions, suggesting that the ability to express order statistics is necessary for neural networks.

Unfortunately, using a similar analysis as the previous section, we have the following theorem:

Theorem 3.6.

Any standard Lipschitz network

f:Rd→R

cannot approximate the simple 1-

Lipschitz function

x→x(k)

for arbitrary

k∈[d]

on a bounded domain

K= [0,1]d

d≥2

Moreover, there exists a point b

x∈ K, such that

|f(b

x)−bx(k)| ≥ 1

2−1

2d.(4)

We give a proof in Appendix B.4. The above theorem indicates that order statistics cannot be uniformly

approximated using any standard Lipschitz network regardless of the network size. Moreover, note

that the trivial constant function

f(x) = 1/2

already achieves an approximation error of

1/2

uniformly on

, implying that standard Lipschitz networks can hardly improve upon trivial solutions.

Remark 3.7.

Theorem 3.6 can be easily generalized to weaker forms of non-uniform approximation,

e.g., using the

-norm as distance metrics [

], by proving that there exists a hypercube

B∞(b

centered at

with length

Θ(1)

, such that

|f(˜

x)−˜x(k)| ≥ Θ(1)

holds for all

˜x∈ B∞(b

when

d≥2. See Corollary B.13 for details.

Discussion with prior works

[

]. The work of Anil et al. also gave negative results on the

expressive power of standard Lipschitz networks

[

, Theorem 1]. They proved a different (weaker)

version of Theorem 3.6, showing that if the activation function

is monotonic, the network cannot

precisely represent non-linear 1-Lipschitz functions whose gradient norm is 1 almost everywhere (e.g.,

the absolute value function proved before by [

]). They did not give a quantitative approximation

error. The intuition is that any monotonic non-linear 1-Lipschitz activation (e.g., ReLU) must have

regions with slopes less than 1, leading to gradient attenuation during backpropagation. The authors

thus attributed the reason to the activation function, which is not gradient-norm preserving (GNP).

However, such an explanation is still not fully satisfactory, as GNP can be simply achieved using a

non-monotonic activation (e.g.,

σ(x) = |x|

). Consequently, one may expect that a standard Lipschitz

network built on a suitable (non-monotonic) activation function can have sufﬁcient expressive power.

Such an idea is recently explored in [

], where the authors proved that using a general 1-Lipschitz

piecewise linear activation with 3 linear regions, the corresponding network achieves the maximum

expressive power compared with other Lipschitz activations and can approximate any one-dimensional

1-Lipschitz function. They pose the high dimension setting as an open problem.

1The result is described w.r.t. `2-norm, but with some effort, it can be extended to the `∞-norm case.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RethinkingLipschitzNeuralNetworksandCertiedRobustness:ABooleanFunctionPerspectiveBohangZhang1DuJiang1DiHe1LiweiWang1;2;y1KeyLaboratoryofMachinePerception,MOE,SchoolofArticialIntelligence,PekingUniversity2CenterforDataScience,PekingUniversityyCorrespondingauthor{zhangbohang,dujiang,dihe}@pku.edu.cn...

展开>> 收起<<

Rethinking Lipschitz Neural Networks and Certiﬁed Robustness A Boolean Function Perspective Bohang Zhang1Du Jiang1Di He1Liwei Wang12y.pdf

共37页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Rethinking Lipschitz Neural Networks and Certiﬁed Robustness A Boolean Function Perspective Bohang Zhang1Du Jiang1Di He1Liwei Wang12y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: