A PAC-Bayesian Generalization Bound for Equivariant Networks Arash Behboodi

2025-04-30 0 0 3.96MB 41 页 10玖币
侵权投诉
A PAC-Bayesian Generalization Bound for Equivariant
Networks
Arash Behboodi
Qualcomm AI reasarch
Gabriele Cesa
Qualcomm AI reasarch
Taco Cohen
Qualcomm AI reasarch
Abstract
Equivariant networks capture the inductive bias about the symmetry of the learning task
by building those symmetries into the model. In this paper, we study how equivariance
relates to generalization error utilizing PAC Bayesian analysis for equivariant networks, where
the transformation laws of feature spaces are determined by group representations. By using
perturbation analysis of equivariant networks in Fourier domain for each layer, we derive
norm-based PAC-Bayesian generalization bounds. The bound characterizes the impact of
group size, and multiplicity and degree of irreducible representations on the generalization
error and thereby provide a guideline for selecting them. In general, the bound indicates
that using larger group size in the model improves the generalization error substantiated by
extensive numerical experiments.
1 Introduction
Equivariant networks are widely believed to enjoy better generalization properties than their
fully-connected counterparts by including an inductive bias about the invariance of the task. A
canonical example is a convolutional layer for which the translation of the input image (over
R2
)
leads to translation of convolutional features. This construction can be generalized to actions of
other groups including rotation and permutation. Intuitively, the inductive bias about invariance
and equivariance helps to focus on features that matter for the task, and thereby help the
generalization. Recently, many works provide theoretical support for this claim mainly for general
equivariant and invariant models. In this work, however, we focus on a class of equivariant neural
networks built using representation theoretic tools. As in [
STW96
,
CW16b
,
WGW+18
], this
approach can build general equivariant networks using irreducible representations of a group.
An open question is about how to design feature spaces and combine representations. Besides,
the question of impact of these choices on generalization remain. On the other hand, while it is
rather well understood how to build a
G
equivariant model when
G
is finite, the choice of the
group
G
to consider is not always obvious. Indeed, using a subgroup
H < G
smaller than
G
is
usually sufficient to achieve significant improvements over a non-equivariant baseline. Moreover,
the symmetries
G
of the data are often continuous, which means
G
-equivariance can not be built
into the model using a group convolution design as in [
CW16a
]. In such cases, the most successful
approaches approximate it with a discrete subgroup
H < G
, e.g. see [
WC19
] for the
G
=
O
(2)
case. In addition, the choice of the equivariance group
H
can affect either the complexity or the
expressiveness of the model and, therefore, its generalization capability, depending on how the
architecture is adapted. For instance, if one can not freely increase the model size, using a larger
group
H
only increases the parameters sharing but reduces the expressiveness of the model. For
this reason, using the whole group
G
might not be beneficial even when feasible. In that line, we
consider how the choice of group size affects generalization.
equal contribution
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
1
arXiv:2210.13150v1 [cs.LG] 24 Oct 2022
Contributions
In this paper, we utilize PAC Bayes framework to derive generalization bounds
on equivariant networks. Our focus is on equivariant networks built following [
WGW+18
,
CW16b
].
We combine the representation theoretic framework of [
WC19
] with PAC-Bayes framework of
[
NBS18
] to get generalization bounds on equivariant networks as a function of irreducible
representations (irreps) and their multiplicities. Different from previous PAC-Bayes analysis, we
derive the perturbation analysis in Fourier domain. Without this approach, as we will show, the
naive application of [
NBS18
] lead to degenerate bounds. As part of the proof, we derive a tail
bound on the spectral norm of random matrices characterized as direct sum of irreps. Note that
for building equivariant networks, we need to work with real representations and not complex
ones, which are conventionally studied in representation theory. The obtained generalization
bound provide new insights about the effect of design choices on the generalization error verified
via numerical results. To the best of our knowledge, this is the first generalization bound for
equivariant networks with an explicit equivariant structure. We conduct extensive experiments
to verify our theoretical insights, as well as some of the previous observations about neural
networks.
2 Related Works
Equivariant networks.
In machine learning, there has been many works on incorporating
symmetries and invariances in neural network design. This class of models has been extensively
studied from different perspectives [
ST93
,
ST89
,
KT18
,
CGW18
,
Mal12
,
DDFK16
,
CW16a
,
CW16b
,
WGTB17
,
WHS18
,
WGW+18
,
TSK+18
,
BLV+18
,
WC19
,
Bek20
,
DMGP19
,
FSIW20
,
WFVW21
,
CLW22
] to mention only some. Indeed, the most recent developments in the field
of equivariant networks suggest a design which enforces equivariance not only at a global level
[
LSBP16
] but also at each layer of the model. An interesting question is how a particular choice
for designing equivariant networks affect its performance and in particular generalization. A
related question is whether including this inductive bias helps the generalization. The authors
in [
SGSR17a
] consider general invariant classifiers and obtain robustness based generalization
bounds, as in [
SGSR17b
], assuming that data transformation changes the inputs drastically.
For finite groups, they reported a scaling of generalization error with 1
/p|H|
where
|H|
is the
cardinality of underlying group. In [
Ele22
], the gain of invariant or equivariant hypotheses was
studied in PAC learning framework, where the gain in generalization is attributed to shrinking the
hypothesis space to the space of orbit representatives. A similar argument can be found in [
SIK21
],
where it is shown that the invariant and equivariant models effectively operate on a shrunk space
called Quotient Feature Space (
QFS
), and the generalization error is proportional to its volume.
They generalize the result of [
SGSR17a
] and relax the robustness assumption, although their
bound shows suboptimal exponent for the sample size. Similar to [
SGSR17a
], they focus on
scaling improvement of the generalization error with invariance and do not consider computable
architecture dependent bounds. The authors in [
LvdWK+20
] use PAC-Bayesian framework to
study the effect of invariance on generalization, although do not obtain any explicit bound.
[
BM19
] studies the stability of models equivariant to compact groups from a RKHS point of view,
relating this stability to a Rademacher complexity of the model and a generalization bound. The
authors in [
EZ21
] provided the generalization gain of equivariant models more concretely and
reported VC-dimension analysis. Subsequent works also considered the connection of invariance
and generalization [
ZAH21
]. In contrast with these models, we consider the equivariant networks
with representation theoretic construction. Beyond generalization, we are interested in getting
design insights from the bound.
Generalization error for neural networks.
A more general study of generalization error
in neural networks, however, has been ongoing for some time with huge body of works on the
topic. We refer to some highlights here. A challenge for this problem was brought up in [
ZBH+17
],
where it was shown using the image-net dataset with random labels, that the generalization error
2
can be arbitrarily large for neural networks, as they achieve small training error but naturally
cannot generalize because of random labels of images. Therefore, any complexity measure for
neural networks should be consistent with the above observation. As a result, uniform complexity
measures like VC-dimension do not satisfy this requirement as they provide a uniform measure
over the whole hypothesis space.
In this light, recent works related the generalization errors to quantities like margin and
different norms of weights for example in, among others, [
WM19
,
SGSR17b
,
NBS18
,
AGNZ18
,
BFT17
,
GRS18
,
DR18a
,
LS19
,
VSS22
,
LMLK21
,
VPL20
]. Some of these bounds are still
vacuous or dimension-dependent (see [
JNK+20
] for detailed experimental investigation and
[
NK19
,
KZSS21
,
NDR21
] for follow-up discussions on uniform complexity measures). Class of
convolutional neural networks are particularly relevant as they can be considered as a special
case of equivariant models. These models are discussed in [
PLDV19
,
LS19
,
VSS22
,
LMLK21
].
We choose PAC Bayesian framework for deriving generalization bounds. There are many works
on PAC Bayesian bounds for neural networks [
NBS18
,
BG22
,
DR17
,
DR18b
,
DR18a
,
DDN+20
]
ranging from randomized and deterministic bounds to choosing priors for non-vacuous bounds.
Our method combines the machinery of [NBS18] with representation theoretic tools.
3 Background
3.1 Preliminaries and notation
We fix some notations first. We consider a classification task with input space
X
and output space
Y
. The input is assumed to be bounded in
`2
-norm by
B
. The data distribution, over
X × Y
, is
denoted by
D
. The hypothesis space,
H
, consists of all functions realized by a
L
-layer general
neural network with 1-Lipschitz homogeneous activation functions
1
. The network function,
fW
(
·
)
with fixed architecture, is specified by its parameters
W
:= (
W1,...,WL
). The operations
on
W
are extended from the underlying vector spaces of weights. For any function
f
(
·
), the
k
’th component of the function is denoted by
f
(
x
)[
k
]throughout the text. For a loss function
L
:
H × X × Y R
, the empirical loss is defined by
ˆ
L
(
f
) :=
1
mPm
i=1 L
(
f,
(
xi, yi
)) and the true
loss is defined by L(f) = E(x,y)∼D[L(f, (x, y))]. For classification, the margin loss is given by
Lγ(fW) = P(x,y)∼D fW(x)[y]γ+ max
j6=yfW(x)[j].(1)
The true loss of a given classifier is then given by
L
(
fW
) =
L0
(
fW
). The empirical margin loss
is similarly defined and denoted by
ˆ
Lγ
. In this work, we consider invariance with respect to
transformations modeled as the action of a compact group
G
. See Supplementary 7for a brief
introduction to group theory and the concepts that we will use throughout this paper.
3.2 Equivariant Neural networks
We provide a concise introduction to equivariant networks here with more details given in
Supplementary 8. We adopt a representation theoretic framework for building equivariant
models.
Example.
Consider a square integrable complex valued function
f
with bandwidth
B
defined
on 2D-rotation group
SO
(2) and represented using its Fourier series as
f
(
θ
) =
PB
n=0 aneinθ
.
Consider an equivariant linear functional of
f
mapping it to another function on
SO
(2). In
Fourier space, the linear transformation is given by
W a
with
a
as the Fourier coefficient vector
a
= (
a0, . . . , aB
). The action of rotation group on the input and output space is simply given
by
f
(
θ
)
f
(
θθ0
)with
θ0
as the rotation angle, or equivalently in Fourier space as a linear
transformation
aρ
(
θ0
)
a
with
ρ
(
θ0
) =
diag (1, e0. . . eiBθ0)
. Since it holds for all
a
,
1The function σ(·)is homogeneous if and only if for all xand λR, we have σ(λx) = λσ(x).
3
Figure 1: An equivariant layer maps the coefficients corresponding to each frequency (irrep)
ψ
to
a new set of coefficients based on the matrix
c
Wl
(
ψ
), which include blocks of
c
Wl
(
ψ, i, j
)with
i[ml1]and j[ml,ψ].
Equivariance condition means
Wρ
(
θ0
) =
ρ
(
θ0
)
W
for all
θ0
, which implies that
W
should be
diagonal. A general feature space can be built by stacking multiple linear transformations
W
, and more flexibly, by letting the transformations acting only on some of the frequencies
(for example, a transformation that acts only on
aB
for frequency
B
). This machinery can
be generalized to represent equivariant networks with respect to any compact group, by using
irreducible representations (irreps) of the group in place of complex exponential
einθ
of this
example. The key is to notice that einθ’s are irreps of SO(2).
General Model.
Consider an Multi-Layer Perceptron (
MLP
) with
L
layers with the layer
l
[
L
]given by the matrix
WlRcl×cl1
. The action of the group
H
on layer
l
is determined
by a linear transformation
ρl
, which is called the group representation on
Rcl
(see section 8for
more details). A group element
hH
linearly transforms the layer
l
by the matrix
ρl
(
h
). The
matrix
Wl
is equivariant w.r.t. the representations
ρl
and
ρl1
acting on its input and output if
and only if for all hH, we have:
Wlρl1(h) = ρl(h)Wl.
Representation theory provides a way of characterizing equivariant layers using Maschke’s theorem
and Schur’s lemma. The key step is to start from characterizing the irreducible representations
(irreps)
{ψ}
of the group
H
. As we will see, irreps are closely related to generalization of
Fourier analysis for functions defined on
H
. Maschke’s theorem implies that the representation
ρl
decomposes into direct sum of irreps as
ρl
=
QlLψLml,ψ
i=1 ψQ1
l
, where
ml,ψ
is the number
of times the irrep
ψ
is present in the representation
ρl
, that is, its multiplicity. Each
ψ
is a
dimψ×dimψ
matrix, and the direct sum is effectively a block diagonal matrix with matrix irreps
ψ
on the diagonal. Through this decomposition, we can parameterize equivariant networks in
terms of irreps, that is in Fourier space. Defining
c
Wl
=
Q1
lWlQl1
, the equivariance condition
writes as
c
Wl
M
ψ
ml1
M
i=1
ψ
=
M
ψ
ml,ψ
M
i=1
ψ
c
Wl.
The block diagonal structure of
LψLml,ψ
i=1 ψ
induces a similar structure on
c
Wl
(see Figure 6).
By
c
Wl
(
ψ2, j, ψ1, i
)denote the block in
c
W
that relates
i
’th instance of
ψ1
to
j
’th instance of
ψ2
,
with i[ml11]and j[ml,ψ2], as:
h, c
Wl(ψ2, j, ψ1, i)ψ1(h) = ψ2(h)c
Wl(ψ2, j, ψ1, i).
4
Schur’s lemma helps us to characterize the equivariant kernels. Note for neural network im-
plementation, we need to work with a version of Schur’s lemma for real-valued representations
given in Supplementary 8. Schur’s lemma implies that if
ψ16
=
ψ2
, the block needs to be zero to
guarantee equivariance. Otherwise, it needs to have one of the three forms in Supplementary 8,
depending on the type of
ψ1
. To simplify the notation, we write
c
Wl
(
ψ, j, ψ, i
)as
c
Wl
(
ψ, j, i
). As
it is shown in Supplementary 8, the matrix c
Wl(ψ, j, i)can be written as the linear sum
c
Wl(ψ, j, i) =
cψ
X
k=1
[wl,i,j (ψ)kBl,ψ,i,j,k (2)
where
{[wl,i,j (ψ)k}cψ
k=1
are learnable parameters with fixed matrices
Bl,ψ,i,j,k
. The value of
cψ
is either 1,2 or 4. The structure of Bl,ψ,i,j,k changes according to each type cψ.
Each layer is parameterized by the matrices
c
Wl
(
ψ
), which include
cψ
parameters
[wl,i,j
(
ψ
)
Rcψ
with
i
[
ml1
]and
j
[
ml,ψ
]. Therefore, for each layer, there will be
Pψml,ψml1cψ
parameters with the width of the layer
l
given as
cl
=
Pψml,ψ dimψ
. Note that, if
xl
=
Wlxl1
,
then
b
xl
=
c
Wlb
xl1
, where
b
xl
=
Q1
lxl
. The non-linearities should satisfy equivariance condition
to have full end-to-end equivariance. An important question is which representation
ρl
of
H
should be used for each layer. For the input, the representation is fixed by the input space
and its structure. However, the intermediate layers can choose
ρl
by selecting irreps and their
multiplicity. We explain how this choice impacts generalization bounds. We consider a general
equivariant network denoted by
fW
with the parameters of
l
’th layer given by
[wl,i,j
(
ψ
)for all
irreps ψ,i[ml1]and j[ml,ψ].
4 PAC-Bayesian bound
We derive PAC-Bayes bounds for equivariant networks
fW
presented in the previous section.
Although the proof follows a similar strategy as in [
NBS18
], it differs in two important aspects.
First, the weights of equivariant
MLP
s have specific structure requiring reworking proof steps.
Second, we carry out the analysis in Fourier domain. In Supplementary 11, we also show
that naively applying norm-bounds of [
NBS18
] cannot explain the generalization behaviour of
equivariant networks. A simple version of our result is given below.
Theorem 4.1
(Homogeneous Bounds for Equivariant Networks)
.
For any equivariant network,
with high probability we have:
L(fW)ˆ
Lγ(fW) + ˜
O
v
u
u
u
u
tQlkWlk2
2
γ2 L
X
l=1 pM(l, η)!2
X
lPψ,i,j
c
Wl(ψ, i, j)
2
F/dimψ
kWlk2
2
where η(0,1) and
M(l, η) := log PL
l=1 Pψml,ψ
1η!max
ψ(5ml1ml,ψcψ).(3)
The complete version of the theorem is given in Theorem 9.1.
We comment briefly on the proof steps. PAC-Bayesian generalization bounds start with
defining a prior
P
and posterior
Q
over the network parameters. The network is drawn randomly
using
Q
, which is also used to evaluate the average loss and generalization error. The PAC-
Bayesian bound provides a bound on the average generalization error of networks randomly
drawn from
Q
(Section 4.1). The next step is to de-randomize the bound and to obtain the
5
摘要:

APAC-BayesianGeneralizationBoundforEquivariantNetworksArashBehboodi*QualcommAIreasarch„GabrieleCesaQualcommAIreasarchyTacoCohenQualcommAIreasarchyAbstractEquivariantnetworkscapturetheinductivebiasaboutthesymmetryofthelearningtaskbybuildingthosesymmetriesintothemodel.Inthispaper,westudyhowequivarian...

展开>> 收起<<
A PAC-Bayesian Generalization Bound for Equivariant Networks Arash Behboodi.pdf

共41页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:41 页 大小:3.96MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 41
客服
关注