Rethinking Lipschitz Neural Networks and Certified Robustness A Boolean Function Perspective Bohang Zhang1Du Jiang1Di He1Liwei Wang12y

2025-04-29 0 0 711.57KB 37 页 10玖币
侵权投诉
Rethinking Lipschitz Neural Networks and Certified
Robustness: A Boolean Function Perspective
Bohang Zhang1Du Jiang1Di He1Liwei Wang1,2,
1Key Laboratory of Machine Perception, MOE, School of Artificial Intelligence, Peking University
2Center for Data Science, Peking University Corresponding author
{zhangbohang,dujiang,dihe}@pku.edu.cn,wanglw@cis.pku.edu.cn
Abstract
Designing neural networks with bounded Lipschitz constant is a promising way
to obtain certifiably robust classifiers against adversarial examples. However, the
relevant progress for the important
`
perturbation setting is rather limited, and a
principled understanding of how to design expressive
`
Lipschitz networks is still
lacking. In this paper, we bridge the gap by studying certified
`
robustness from
a novel perspective of representing Boolean functions. We derive two fundamental
impossibility results that hold for any standard Lipschitz network: one for robust
classification on finite datasets, and the other for Lipschitz function approximation.
These results identify that networks built upon norm-bounded affine layers and Lip-
schitz activations intrinsically lose expressive power even in the two-dimensional
case, and shed light on how recently proposed Lipschitz networks (e.g., GroupSort
and
`
-distance nets) bypass these impossibilities by leveraging order statistic
functions. Finally, based on these insights, we develop a unified Lipschitz network
that generalizes prior works, and design a practical version that can be efficiently
trained (making certified robust training free). Extensive experiments show that our
approach is scalable, efficient, and consistently yields better certified robustness
across multiple datasets and perturbation radii than prior Lipschitz networks. Our
code is available at https://github.com/zbh2047/SortNet.
1 Introduction
Modern neural networks, despite their great success in various applications [
22
,
13
], typically suffer
from a severe drawback of lacking adversarial robustness. In classification tasks, given an image
x
correctly classified by a neural network, there often exists a small adversarial perturbation
δ
, making
the perturbed image
x+δ
look indistinguishable to humans but fool the network to predict a wrong
class with high confidence [57, 4].
It is well-known that the adversarial robustness of a neural network is closely related to its Lipschitz
continuity [
8
,
61
] (see Section 3.1). Accordingly, training neural networks with bounded Lipschitz
constant has been considered a promising way to address the problem. A variety of works studied
Lipschitz architectures for the ordinary Euclidean norm [
61
,
60
,
36
,
54
], and recent works even
established state-of-the-art (deterministic) certified
`2
robustness [
55
,
40
]. However, when it comes
to the more critical (realistic)
`
perturbation setting, the progress seems to be rather limited. In fact,
standard Lipschitz ReLU networks have been shown to perform poorly in terms of
`
robustness
[
61
,
24
,
2
]. While other more advanced Lipschitz networks have been proposed [
2
,
10
], the achieved
results are still not satisfactory even on simple datasets like MNIST. Until recently, Zhang et al.
[
73
,
74
] designed a quite unusual Lipschitz network based on a heuristic choice of the
`
-distance
function, which surprisingly established state-of-the-art certified
`
robustness on multiple datasets
over prior works. Yet, it remains unclear why previous Lipschitz networks typically failed, and what
is the essential reason behind the success of this particular `-distance network structure.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01787v3 [cs.LG] 26 Oct 2022
Theoretical contributions
. In this work, we systematically investigate how to design expressive
Lipschitz neural networks (w.r.t.
`
-norm) through the novel lens of representing discrete Boolean
functions, which provides a deep understanding on the aforementioned problems. Specifically, we
first figure out a fundamental limitation of standard Lipschitz networks in representing a class of
logical operations called symmetric Boolean functions (SBF), which comprises the basic logical
AND/OR as special cases. We prove that for any non-constant SBF of
d
variables, there exists a finite
dataset of size
O(d)
such that the certified
`
robust radius must vanish as
O(1/d)
for any classifier
induced by a standard Lipschitz network. Remarkably, since logical AND/OR operations correspond
to perhaps the most basic classifiers, our result indicates an intrinsic difficulty of such networks in
fitting high-dimensional real-world datasets with guaranteed certified `robustness.
Our analysis can be readily extended into the Lipschitz function approximation setting. We point out
the relationship between monotonic SBF and the order statistics (which are 1-Lipschitz functions),
and then prove that any
d
-dimensional order statistic (including the max/min function) on a compact
domain cannot be approximated by standard Lipschitz networks with error
O(1 1/d)
, regardless
of the network size. This impossibility result is significant in that:
(i)
it applies to all Lipschitz
activations (thus extending prior works [
2
,
24
]),
(ii)
it resolves an open problem raised recently in
[43], and (iii) aquantitative lower bound of approximation error is established.
Equipped by the above impossibility results, we proceed to examine two advanced Lipschitz ar-
chitectures: the GroupSort network [
2
] and the recently proposed
`
-distance net [
73
,
74
]. We
find that besides the linear operation, both networks incorporate other Lipschitz aggregation oper-
ations into the neuron design, especially the order statistic functions, thus shedding light on how
they work. However, for the MaxMin network [
2
] — a computationally efficient version of the
GroupSort network implemented in practice, representing Boolean functions and order statistics
is possible only when the network is very deep. In particular, we prove that representing certain
d
-dimensional Boolean functions requires a depth of
Ω(d)
, implying that shallow MaxMin networks
are not Lipschitz-universal function approximators. In contrast, we show a two-layer
`
-distance net
suffices to represent any order statistic function on a compact domain or even all Boolean functions.
This strongly justifies the empirical success of
`
-distance net over GroupSort (MaxMin) networks.
Practical contributions
. Our theoretical insights can also guide in designing better Lipschitz network
architectures. Inspired by the importance of order statistics, we propose a general form of Lipschitz
network, called SortNet, that extends both GroupSort and
`
distance networks and incorporates
them into a unified framework. Yet, the full-sort operation is computationally expensive and leads to
optimization difficulties (as with the GroupSort network). We further propose a specialized SortNet
that can be efficiently trained, by assigning each weight vector
w
using geometric series, i.e.
wi
proportional to
ρi
for some
0ρ < 1
. This leads to a restricted version of SortNet but still covers
`
-distance net as a special case. For this particular SortNet, we skillfully derive a stochastic
estimation that gives an unbiased approximation of the neuron output without performing sorting
operations explicitly. This eventually yields an efficient training strategy with similar cost as training
standard networks, thus making certified robust training free. Extensive experiments demonstrate that
the proposed SortNet is scalable, efficient, and consistently achieves better certified robustness than
prior Lipschitz networks across multiple datasets and perturbation radii. In particular, our approach
even scales on a variant of ImageNet, and surpasses the best-known result [
69
] with a 22-fold decrease
in training time thanks to our “free” certified training approach.
The contribution and organization of this paper can be summarized as follows:
We develop a systematic study for the expressive power of Lipschitz neural networks using
the tools of Boolean function theory. We prove the impossibility results of standard Lipschitz
networks in two settings: a) certified
`
robustness on discrete datasets (Section 3.2); b)
Lipschitz function approximation (Section 3.3).
We provide insights into how recently proposed networks can bypass the impossibility results.
In particular, we show that a two-layer
`
-distance net can precisely represent any Boolean
functions, while shallow GroupSort networks cannot (Section 3.4).
We propose SortNet, a Lipschitz network that generalizes GroupSort and
`
-distance net. For
a special type of SortNet, we derive a stochastic training approach that bypasses the difficulties
in calculating sorting operations explicitly and makes certified training free (Section 4).
Extensive experiments demonstrate that SortNet exhibits better certified robustness on several
benchmark datasets over baseline methods with high training efficiency (Section 5).
2
2 Related Work
Extensive studies have been devoted to developing neural networks with certified robustness guaran-
tees. Existing approaches can be mainly divided into the following three categories.
Certified defenses for standard networks
. A variety of works focus on establishing certified
robustness for standard neural networks. However, exactly calculating the certified radius of a
standard ReLU network is known to be NP-hard [
28
]. Researchers thus developed a class of
relaxation-based approaches that provide a tight lower bound estimate of the certified robustness
efficiently. These approaches typically use convex relaxation to calculate a bound of the neuron
outputs under input perturbations layer by layer [
65
,
66
,
64
,
53
,
42
,
19
,
62
,
76
]. See also [
3
,
15
,
16
,
47
,
48
,
68
,
11
,
35
,
63
,
23
,
51
] for more advanced approaches. However, most of these works suffer
from high computational costs and are hard to scale up to large datasets. Currently, the only scalable
convex relaxation approach is based on interval bound propagation (IBP) [
42
,
21
,
75
,
69
,
52
], but
the produced bound is known to be loose [
50
], and a recent study showed that IBP cannot achieve
enough certified robustness on simple datasets for any standard ReLU network [41].
Certified defenses using Lipschitz networks
. On the other hand, Lipschitz networks inherently
imply certified robustness, resulting in a much simpler certification process based on the output margin
(see Proposition 3.1). Yet, most prior works can only handle the
`2
-norm Lipschitz situation by
leveraging specific mathematical properties such as the spectral norm [
8
,
71
,
20
,
61
,
17
,
46
,
2
,
36
,
40
]
or orthogonality of weight matrices [
39
,
60
,
54
,
55
]. For the
`
-norm, standard Lipschitz networks
were shown to give only a vanishingly small certified radius [
61
]. Huster et al. [
24
] found that
standard Lipschitz ReLU networks cannot represent certain simple functions such as the absolute
value, which inspired the first expressive Lipschitz architecture called the GroupSort network [
2
].
Since then, GroupSort has been extensively investigated [
10
,
58
], but its performance is still much
worse than the above relaxation-based approaches even on MNIST. Recently, Zhang et al. [
73
,
74
]
first proposed a practical 1-Lipschitz architecture w.r.t.
`
-norm based on a special neuron called the
`
-distance neuron, which can scale to TinyImageNet with state-of-the-art certified robustness over
relaxation-based approaches. However, despite its practical success, it is rather puzzling how such a
simple architecture can work while prior approaches all failed. Answering this question may require
an in-depth re-examination of Lipschitz networks (w.r.t. `-norm), which is the focus of this paper.
Certified defenses via randomized smoothing
. As a rather different and parallel research line,
randomized smoothing typically provides probabilistic certified
`2
robustness guarantees. Due to
the wide applicability, randomized smoothing has been scaled up to ImageNet and achieves state-of-
the-art certified accuracy for
`2
perturbations [
34
,
38
,
9
,
49
,
72
,
27
]. However, certifying robustness
with high probability requires sampling a large number of noisy inputs (e.g.,
105
) for a single image,
leading to a high computational cost at inference. Moreover, theoretical results pointed out that it
cannot achieve non-trivial certified
`
robustness if the perturbation radius is larger than
Ω(d1/2)
where dis the input dimension [70, 5, 30, 67].
3 The Expressive Power of Lipschitz Neural Networks
3.1 Preliminaries
Notations
. We use boldface letters to denote vectors (e.g.,
x
) or vector functions (e.g.,
f
), and use
xi
(or
fi
) to denote its
i
-th element. For a unary function
σ
,
σ(x)
applies
σ(·)
element-wise on
vector
x
. The
`p
-norm (
p1
) and
`
-norm of a vector
x
are defined as
kxkp= (Pi|xi|p)1/p
and
kxk= maxi|xi|
, respectively. The matrix
-norm is defined as
kWk= maxikWi,:k1
where
Wi,:
is the
i
-th row of the matrix
W
. The
k
-th largest element of a vector
x
is denoted as
x(k)
. We
use
[n]
to denote the set
{1,··· , n}
, and use
ei
to denote the unit vector with the
i
-th element being
one. We adopt the big O notations by using O(·),Ω(·), and Θ(·)to hide universal constants.
Lipschitzness
. A mapping
f:RnRm
is said to be
L
-Lipschitz continuous w.r.t. norm
k·k
if for
any pair of inputs x1,x2Rn,
kf(x1)f(x2)k ≤ Lkx1x2k.(1)
If the mapping
f
represented by a neural network has a small Lipschitz constant
L
, then (1) implies
that the change of network output can be strictly controlled under input perturbations, resulting in
certified robustness guarantees as shown in the following proposition.
3
Proposition 3.1.
(Certified robustness of Lipschitz networks) For a neural network
f
with Lipschitz
constant
L
under
`p
-norm
k · kp
, define the resulting classifier
g
as
g(x) := arg maxkfk(x)
for an
input x. Then gis provably robust under perturbations kδkp<c
Lmargin(f(x)), i.e.
g(x+δ) = g(x)for all δwith kδkp< c/L ·margin(f(x)).(2)
Here
c=p
2/2
is a constant depending only on the norm
k · kp
, which is
1/2
for the
`
-norm, and
margin(f(x)) is the margin between the largest and second largest output logits.
The proof of Proposition 3.1 is simple and can be found in Appendix B.1 or [
39
, Appendix P]. It can
be seen that the robust radius is inversely proportional to the Lipschitz constant L.
Standard Lipschitz networks
. Throughout this paper, we refer to standard neural networks as neural
networks formed by affine layers (e.g., fully-connected or convolutional layers) and element-wise
activation functions. Based on the Lipschitz property of composite functions, most prior works
enforce the 1-Lipschitzness of a multi-layer neural network by constraining each layer to be a 1-
Lipschitz mapping. For the
`
-norm, it is further equivalent to constraining the weight matrices to
have bounded -norm, plus using Lipschitz activation functions [2], which can be formalized as
x(l)=σ(l)(W(l)x(l1) +b(l))s.t. kW(l)k1and σ(l)being 1-Lipschitz, l [M].(3)
Here
M
is the number of layers and usually
σ(M)(x) = x
is the identity function. The network takes
x(0) := x
as the input and outputs
x(M)
. For
K
-class classification problems,
x(M)RK
and
the network predicts the class
g(x) := arg maxk[K]x(M)
k
. We refer to the resulting network as a
standard Lipschitz network.
3.2 Certified robustness on discrete Boolean datasets
In this section, we will construct a class of counterexamples for which certified
`
robustness can be
arbitrarily poor using standard Lipschitz networks. We focus on the Boolean dataset, a discrete dataset
where both inputs and labels are Boolean-valued and the relationship between inputs and their labels
(x(i), y(i))∈ {0,1}d×{0,1}
can be described using a Boolean function
gB:{0,1}d→ {0,1}
. The
key motivation lies in the finding that Boolean vectors correspond to the vertices of a
d
-dimensional
hypercube, and thus are geometrically related to the
`
-distance metric. In particular, the
`
-distance
between any two different data points in a Boolean dataset is always 1, which means that the dataset
is well-separated. This yields the following proposition, stating that it is always possible to achieve
optimal certified `robustness on Boolean datasets by using Lipschitz classifiers.
Proposition 3.2.
For any Boolean dataset
D={(x(i), y(i))}n
i=1
, there exists a classifier
ˆg:Rd
{0,1}
induced by a 1-Lipschitz mapping
ˆ
f:RdR2
, such that
ˆg
can fit the whole dataset with
margin( ˆ
f(x(i))) = 1 i[n], thus achieving a certified `radius of 1/2by Proposition 3.1.
The key observation in the proof (Appendix B.2) is that one can construct a so-called nearest neighbor
classifier that achieves a large margin on the whole dataset and is 1-Lipschitz. Based on Proposition
3.2, it is natural to ask whether standard Lipschitz networks of the form (3) can perform well on
Boolean datasets. Unfortunately, we show it is not the case, even on a class of simple datasets
constructed using symmetric Boolean functions.
Definition 3.3.
A Boolean function
gB:{0,1}d→ {0,1}
is symmetric if it is invariant under input
permutation, i.e. gB(x1,··· , xd) = gB(xπ(1),··· , xπ(d))for any x∈ {0,1}dand πSd.
Example 3.4.
Two of the most basic operations in Boolean algebra are the logical AND/OR, both
of which belong to the class of symmetric Boolean functions. Other important examples include
the exclusive-or (XOR, also called the parity function), NAND, NOR, and the majority function (or
generally, the threshold functions). See Appendix A.2 for a detailed description of these examples.
Theorem 3.5.
For any non-constant symmetric Boolean function
gB:{0,1}d→ {0,1}
, there exists
a Boolean dataset with labels
y(i)=gB(x(i))
, such that no standard Lipschitz network can achieve a
certified `robust radius larger than 1/2don the dataset.
Implications
. Theorem 3.5 shows that the certified radius of standard Lipschitz networks must
vanish as dimension
d
grows, which is in stark contrast to the constant radius given by Proposition
3.2. Also, note that the logical AND/OR functions are perhaps the most basic classifiers (which
simply make predictions based on the existence of binary input features). It is thus not surprising to
see that standard Lipschitz networks perform poorly on real-world datasets (e.g., even the simple
MNIST dataset where input pixels are almost Boolean-valued (black/white) [61]).
4
We give an elegant proof of Theorem 3.5 in Appendix B.3, where we also prove that the bound of
1/2d
is tight in Proposition B.8. Moreover, we discuss the sample complexity by proving that a
dataset of size
O(d)
already suffices to give
O(1/d)
certified radius (Corollary B.10). To the best of
our knowledge, Theorem 3.5 is the first impossibility result that targets the certified
`
robustness
of standard Lipschitz networks with an quantitative upper bound on the certified radius. In the next
section, we will extend our analysis to the function approximation setting and make discussions with
literature results [24, 2].
3.3 Lipschitz function approximation
Classic approximation theory has shown that standard neural networks are universal function approxi-
mators [
12
,
37
], in that they can approximate any continuous function on a compact domain arbitrarily
well. For 1-Lipschitz neural networks, an analogous question is whether they can approximate all
1-Lipschitz functions accordingly. Unfortunately, the result in Section 3.2 already implies a negative
answer. Indeed, by combining Proposition 3.2 and Theorem 3.5,
ˆ
f
is clearly a 1-Lipschitz function
that cannot be approximated by any standard Lipschitz network.
To gain further insights into the structure of unrepresentable 1-Lipschitz functions, let us consider the
continuousization of discrete Boolean functions. For the symmetric case, one needs to find a class of
1-Lipschitz continuous functions that are also invariant under permutations. It can be found that a
simple class of candidates is the order statistics, i.e. the
k
-th largest element of a vector. One can
check that the
k
-th order statistic
x(k)
is indeed 1-Lipschitz and is precisely the continuousization of
the
k
-threshold Boolean function defined as
gB,k(x) := I(Pixik)
. In particular,
x(1) = maxixi
and
x(d)= minixi
corresponds to the logical OR/AND functions, respectively. Importantly, note that
any Boolean function that is both symmetric and monotonic is a
k
-threshold function, and vice versa
(Appendix A.2). Therefore,
k
-threshold functions can be regarded as the most elementary Boolean
functions, suggesting that the ability to express order statistics is necessary for neural networks.
Unfortunately, using a similar analysis as the previous section, we have the following theorem:
Theorem 3.6.
Any standard Lipschitz network
f:RdR
cannot approximate the simple 1-
Lipschitz function
xx(k)
for arbitrary
k[d]
on a bounded domain
K= [0,1]d
if
d2
.
Moreover, there exists a point b
x∈ K, such that
|f(b
x)bx(k)| ≥ 1
21
2d.(4)
We give a proof in Appendix B.4. The above theorem indicates that order statistics cannot be uniformly
approximated using any standard Lipschitz network regardless of the network size. Moreover, note
that the trivial constant function
˜
f(x) = 1/2
already achieves an approximation error of
1/2
uniformly on
K
, implying that standard Lipschitz networks can hardly improve upon trivial solutions.
Remark 3.7.
Theorem 3.6 can be easily generalized to weaker forms of non-uniform approximation,
e.g., using the
`p
-norm as distance metrics [
45
], by proving that there exists a hypercube
B(b
x)
centered at
b
x
with length
Θ(1)
, such that
|f(˜
x)˜x(k)| ≥ Θ(1)
holds for all
˜x∈ B(b
x)
when
d2. See Corollary B.13 for details.
Discussion with prior works
[
2
,
24
,
43
]. The work of Anil et al. also gave negative results on the
expressive power of standard Lipschitz networks
1
[
2
, Theorem 1]. They proved a different (weaker)
version of Theorem 3.6, showing that if the activation function
σ
is monotonic, the network cannot
precisely represent non-linear 1-Lipschitz functions whose gradient norm is 1 almost everywhere (e.g.,
the absolute value function proved before by [
24
]). They did not give a quantitative approximation
error. The intuition is that any monotonic non-linear 1-Lipschitz activation (e.g., ReLU) must have
regions with slopes less than 1, leading to gradient attenuation during backpropagation. The authors
thus attributed the reason to the activation function, which is not gradient-norm preserving (GNP).
However, such an explanation is still not fully satisfactory, as GNP can be simply achieved using a
non-monotonic activation (e.g.,
σ(x) = |x|
). Consequently, one may expect that a standard Lipschitz
network built on a suitable (non-monotonic) activation function can have sufficient expressive power.
Such an idea is recently explored in [
43
], where the authors proved that using a general 1-Lipschitz
piecewise linear activation with 3 linear regions, the corresponding network achieves the maximum
expressive power compared with other Lipschitz activations and can approximate any one-dimensional
1-Lipschitz function. They pose the high dimension setting as an open problem.
1The result is described w.r.t. `2-norm, but with some effort, it can be extended to the `-norm case.
5
摘要:

RethinkingLipschitzNeuralNetworksandCertiedRobustness:ABooleanFunctionPerspectiveBohangZhang1DuJiang1DiHe1LiweiWang1;2;y1KeyLaboratoryofMachinePerception,MOE,SchoolofArticialIntelligence,PekingUniversity2CenterforDataScience,PekingUniversityyCorrespondingauthor{zhangbohang,dujiang,dihe}@pku.edu.cn...

展开>> 收起<<
Rethinking Lipschitz Neural Networks and Certified Robustness A Boolean Function Perspective Bohang Zhang1Du Jiang1Di He1Liwei Wang12y.pdf

共37页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:37 页 大小:711.57KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 37
客服
关注