A PAC-Bayesian Generalization Bound for Equivariant Networks Arash Behboodi

2025-04-30 0 0 3.96MB 41 页 10玖币

侵权投诉

A PAC-Bayesian Generalization Bound for Equivariant

Networks

Arash Behboodi∗

Qualcomm AI reasarch†

Gabriele Cesa∗

Qualcomm AI reasarch†

Taco Cohen

Qualcomm AI reasarch†

Abstract

Equivariant networks capture the inductive bias about the symmetry of the learning task

by building those symmetries into the model. In this paper, we study how equivariance

relates to generalization error utilizing PAC Bayesian analysis for equivariant networks, where

the transformation laws of feature spaces are determined by group representations. By using

perturbation analysis of equivariant networks in Fourier domain for each layer, we derive

norm-based PAC-Bayesian generalization bounds. The bound characterizes the impact of

group size, and multiplicity and degree of irreducible representations on the generalization

error and thereby provide a guideline for selecting them. In general, the bound indicates

that using larger group size in the model improves the generalization error substantiated by

extensive numerical experiments.

1 Introduction

Equivariant networks are widely believed to enjoy better generalization properties than their

fully-connected counterparts by including an inductive bias about the invariance of the task. A

canonical example is a convolutional layer for which the translation of the input image (over

)

leads to translation of convolutional features. This construction can be generalized to actions of

other groups including rotation and permutation. Intuitively, the inductive bias about invariance

and equivariance helps to focus on features that matter for the task, and thereby help the

generalization. Recently, many works provide theoretical support for this claim mainly for general

equivariant and invariant models. In this work, however, we focus on a class of equivariant neural

networks built using representation theoretic tools. As in [

STW96

CW16b

WGW+18

], this

approach can build general equivariant networks using irreducible representations of a group.

An open question is about how to design feature spaces and combine representations. Besides,

the question of impact of these choices on generalization remain. On the other hand, while it is

rather well understood how to build a

equivariant model when

is ﬁnite, the choice of the

group

to consider is not always obvious. Indeed, using a subgroup

H < G

smaller than

usually suﬃcient to achieve signiﬁcant improvements over a non-equivariant baseline. Moreover,

the symmetries

of the data are often continuous, which means

-equivariance can not be built

into the model using a group convolution design as in [

CW16a

]. In such cases, the most successful

approaches approximate it with a discrete subgroup

H < G

, e.g. see [

WC19

] for the

(2)

case. In addition, the choice of the equivariance group

can aﬀect either the complexity or the

expressiveness of the model and, therefore, its generalization capability, depending on how the

architecture is adapted. For instance, if one can not freely increase the model size, using a larger

group

only increases the parameters sharing but reduces the expressiveness of the model. For

this reason, using the whole group

might not be beneﬁcial even when feasible. In that line, we

consider how the choice of group size aﬀects generalization.

∗equal contribution

†Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.

arXiv:2210.13150v1 [cs.LG] 24 Oct 2022

Contributions

In this paper, we utilize PAC Bayes framework to derive generalization bounds

on equivariant networks. Our focus is on equivariant networks built following [

WGW+18

CW16b

We combine the representation theoretic framework of [

WC19

] with PAC-Bayes framework of

[

NBS18

] to get generalization bounds on equivariant networks as a function of irreducible

representations (irreps) and their multiplicities. Diﬀerent from previous PAC-Bayes analysis, we

derive the perturbation analysis in Fourier domain. Without this approach, as we will show, the

naive application of [

NBS18

] lead to degenerate bounds. As part of the proof, we derive a tail

bound on the spectral norm of random matrices characterized as direct sum of irreps. Note that

for building equivariant networks, we need to work with real representations and not complex

ones, which are conventionally studied in representation theory. The obtained generalization

bound provide new insights about the eﬀect of design choices on the generalization error veriﬁed

via numerical results. To the best of our knowledge, this is the ﬁrst generalization bound for

equivariant networks with an explicit equivariant structure. We conduct extensive experiments

to verify our theoretical insights, as well as some of the previous observations about neural

networks.

2 Related Works

Equivariant networks.

In machine learning, there has been many works on incorporating

symmetries and invariances in neural network design. This class of models has been extensively

studied from diﬀerent perspectives [

ST93

ST89

KT18

CGW18

Mal12

DDFK16

CW16a

CW16b

WGTB17

WHS18

WGW+18

TSK+18

BLV+18

WC19

Bek20

DMGP19

FSIW20

WFVW21

CLW22

] to mention only some. Indeed, the most recent developments in the ﬁeld

of equivariant networks suggest a design which enforces equivariance not only at a global level

[

LSBP16

] but also at each layer of the model. An interesting question is how a particular choice

for designing equivariant networks aﬀect its performance and in particular generalization. A

related question is whether including this inductive bias helps the generalization. The authors

in [

SGSR17a

] consider general invariant classiﬁers and obtain robustness based generalization

bounds, as in [

SGSR17b

], assuming that data transformation changes the inputs drastically.

For ﬁnite groups, they reported a scaling of generalization error with 1

/p|H|

where

|H|

is the

cardinality of underlying group. In [

Ele22

], the gain of invariant or equivariant hypotheses was

studied in PAC learning framework, where the gain in generalization is attributed to shrinking the

hypothesis space to the space of orbit representatives. A similar argument can be found in [

SIK21

where it is shown that the invariant and equivariant models eﬀectively operate on a shrunk space

called Quotient Feature Space (

QFS

), and the generalization error is proportional to its volume.

They generalize the result of [

SGSR17a

] and relax the robustness assumption, although their

bound shows suboptimal exponent for the sample size. Similar to [

SGSR17a

], they focus on

scaling improvement of the generalization error with invariance and do not consider computable

architecture dependent bounds. The authors in [

LvdWK+20

] use PAC-Bayesian framework to

study the eﬀect of invariance on generalization, although do not obtain any explicit bound.

[

BM19

] studies the stability of models equivariant to compact groups from a RKHS point of view,

relating this stability to a Rademacher complexity of the model and a generalization bound. The

authors in [

EZ21

] provided the generalization gain of equivariant models more concretely and

reported VC-dimension analysis. Subsequent works also considered the connection of invariance

and generalization [

ZAH21

]. In contrast with these models, we consider the equivariant networks

with representation theoretic construction. Beyond generalization, we are interested in getting

design insights from the bound.

Generalization error for neural networks.

A more general study of generalization error

in neural networks, however, has been ongoing for some time with huge body of works on the

topic. We refer to some highlights here. A challenge for this problem was brought up in [

ZBH+17

where it was shown using the image-net dataset with random labels, that the generalization error

can be arbitrarily large for neural networks, as they achieve small training error but naturally

cannot generalize because of random labels of images. Therefore, any complexity measure for

neural networks should be consistent with the above observation. As a result, uniform complexity

measures like VC-dimension do not satisfy this requirement as they provide a uniform measure

over the whole hypothesis space.

In this light, recent works related the generalization errors to quantities like margin and

diﬀerent norms of weights for example in, among others, [

WM19

SGSR17b

NBS18

AGNZ18

BFT17

GRS18

DR18a

LS19

VSS22

LMLK21

VPL20

]. Some of these bounds are still

vacuous or dimension-dependent (see [

JNK+20

] for detailed experimental investigation and

[

NK19

KZSS21

NDR21

] for follow-up discussions on uniform complexity measures). Class of

convolutional neural networks are particularly relevant as they can be considered as a special

case of equivariant models. These models are discussed in [

PLDV19

LS19

VSS22

LMLK21

We choose PAC Bayesian framework for deriving generalization bounds. There are many works

on PAC Bayesian bounds for neural networks [

NBS18

BG22

DR17

DR18b

DR18a

DDN+20

]

ranging from randomized and deterministic bounds to choosing priors for non-vacuous bounds.

Our method combines the machinery of [NBS18] with representation theoretic tools.

3 Background

3.1 Preliminaries and notation

We ﬁx some notations ﬁrst. We consider a classiﬁcation task with input space

and output space

. The input is assumed to be bounded in

-norm by

. The data distribution, over

X × Y

, is

denoted by

. The hypothesis space,

, consists of all functions realized by a

-layer general

neural network with 1-Lipschitz homogeneous activation functions

. The network function,

(

)

with ﬁxed architecture, is speciﬁed by its parameters

:= (

W1,...,WL

). The operations

are extended from the underlying vector spaces of weights. For any function

(

), the

’th component of the function is denoted by

(

)[

]throughout the text. For a loss function

H × X × Y → R

, the empirical loss is deﬁned by

(

) :=

mPm

i=1 L

(

xi, yi

)) and the true

loss is deﬁned by L(f) = E(x,y)∼D[L(f, (x, y))]. For classiﬁcation, the margin loss is given by

Lγ(fW) = P(x,y)∼D fW(x)[y]≤γ+ max

j6=yfW(x)[j].(1)

The true loss of a given classiﬁer is then given by

(

) =

(

). The empirical margin loss

is similarly deﬁned and denoted by

Lγ

. In this work, we consider invariance with respect to

transformations modeled as the action of a compact group

. See Supplementary 7for a brief

introduction to group theory and the concepts that we will use throughout this paper.

3.2 Equivariant Neural networks

We provide a concise introduction to equivariant networks here with more details given in

Supplementary 8. We adopt a representation theoretic framework for building equivariant

models.

Example.

Consider a square integrable complex valued function

with bandwidth

deﬁned

on 2D-rotation group

(2) and represented using its Fourier series as

(

) =

n=0 aneinθ

Consider an equivariant linear functional of

mapping it to another function on

(2). In

Fourier space, the linear transformation is given by

W a

with

as the Fourier coeﬃcient vector

= (

a0, . . . , aB

). The action of rotation group on the input and output space is simply given

(

)

→f

(

θ−θ0

)with

θ0

as the rotation angle, or equivalently in Fourier space as a linear

transformation

a→ρ

(

θ0

)

with

(

θ0

) =

diag (1, e−iθ0. . . e−iBθ0)

. Since it holds for all

1The function σ(·)is homogeneous if and only if for all xand λ∈R, we have σ(λx) = λσ(x).

Figure 1: An equivariant layer maps the coeﬃcients corresponding to each frequency (irrep)

a new set of coeﬃcients based on the matrix

(

), which include blocks of

(

ψ, i, j

)with

i∈[ml−1,ψ]and j∈[ml,ψ].

Equivariance condition means

Wρ

(

θ0

) =

(

θ0

)

for all

θ0

, which implies that

should be

diagonal. A general feature space can be built by stacking multiple linear transformations

, and more ﬂexibly, by letting the transformations acting only on some of the frequencies

(for example, a transformation that acts only on

for frequency

). This machinery can

be generalized to represent equivariant networks with respect to any compact group, by using

irreducible representations (irreps) of the group in place of complex exponential

einθ

of this

example. The key is to notice that einθ’s are irreps of SO(2).

General Model.

Consider an Multi-Layer Perceptron (

MLP

) with

layers with the layer

l∈

[

]given by the matrix

Wl∈Rcl×cl−1

. The action of the group

on layer

is determined

by a linear transformation

ρl

, which is called the group representation on

Rcl

(see section 8for

more details). A group element

h∈H

linearly transforms the layer

by the matrix

ρl

(

). The

matrix

is equivariant w.r.t. the representations

ρl

and

ρl−1

acting on its input and output if

and only if for all h∈H, we have:

Wlρl−1(h) = ρl(h)Wl.

Representation theory provides a way of characterizing equivariant layers using Maschke’s theorem

and Schur’s lemma. The key step is to start from characterizing the irreducible representations

(irreps)

{ψ}

of the group

. As we will see, irreps are closely related to generalization of

Fourier analysis for functions deﬁned on

. Maschke’s theorem implies that the representation

ρl

decomposes into direct sum of irreps as

ρl

QlLψLml,ψ

i=1 ψQ−1

, where

ml,ψ

is the number

of times the irrep

is present in the representation

ρl

, that is, its multiplicity. Each

is a

dimψ×dimψ

matrix, and the direct sum is eﬀectively a block diagonal matrix with matrix irreps

on the diagonal. Through this decomposition, we can parameterize equivariant networks in

terms of irreps, that is in Fourier space. Deﬁning

Q−1

lWlQl−1

, the equivariance condition

writes as

Wl

M

ml−1,ψ

i=1

ψ

=

M

ml,ψ

i=1

ψ

c

Wl.

The block diagonal structure of

LψLml,ψ

i=1 ψ

induces a similar structure on

(see Figure 6).

(

ψ2, j, ψ1, i

)denote the block in

that relates

’th instance of

ψ1

’th instance of

ψ2

with i∈[ml−1,ψ1]and j∈[ml,ψ2], as:

∀h, c

Wl(ψ2, j, ψ1, i)ψ1(h) = ψ2(h)c

Wl(ψ2, j, ψ1, i).

Schur’s lemma helps us to characterize the equivariant kernels. Note for neural network im-

plementation, we need to work with a version of Schur’s lemma for real-valued representations

given in Supplementary 8. Schur’s lemma implies that if

ψ16

ψ2

, the block needs to be zero to

guarantee equivariance. Otherwise, it needs to have one of the three forms in Supplementary 8,

depending on the type of

ψ1

. To simplify the notation, we write

(

ψ, j, ψ, i

)as

(

ψ, j, i

). As

it is shown in Supplementary 8, the matrix c

Wl(ψ, j, i)can be written as the linear sum

Wl(ψ, j, i) =

cψ

k=1

[wl,i,j (ψ)kBl,ψ,i,j,k (2)

where

{[wl,i,j (ψ)k}cψ

k=1

are learnable parameters with ﬁxed matrices

Bl,ψ,i,j,k

. The value of

cψ

is either 1,2 or 4. The structure of Bl,ψ,i,j,k changes according to each type cψ.

Each layer is parameterized by the matrices

(

), which include

cψ

parameters

[wl,i,j

(

)

∈

Rcψ

with

i∈

[

ml−1,ψ

]and

j∈

[

ml,ψ

]. Therefore, for each layer, there will be

Pψml,ψml−1,ψcψ

parameters with the width of the layer

given as

Pψml,ψ dimψ

. Note that, if

Wlxl−1

then

Wlb

xl−1

, where

Q−1

lxl

. The non-linearities should satisfy equivariance condition

to have full end-to-end equivariance. An important question is which representation

ρl

should be used for each layer. For the input, the representation is ﬁxed by the input space

and its structure. However, the intermediate layers can choose

ρl

by selecting irreps and their

multiplicity. We explain how this choice impacts generalization bounds. We consider a general

equivariant network denoted by

with the parameters of

’th layer given by

[wl,i,j

(

)for all

irreps ψ,i∈[ml−1,ψ]and j∈[ml,ψ].

4 PAC-Bayesian bound

We derive PAC-Bayes bounds for equivariant networks

presented in the previous section.

Although the proof follows a similar strategy as in [

NBS18

], it diﬀers in two important aspects.

First, the weights of equivariant

MLP

s have speciﬁc structure requiring reworking proof steps.

Second, we carry out the analysis in Fourier domain. In Supplementary 11, we also show

that naively applying norm-bounds of [

NBS18

] cannot explain the generalization behaviour of

equivariant networks. A simple version of our result is given below.

Theorem 4.1

(Homogeneous Bounds for Equivariant Networks)

For any equivariant network,

with high probability we have:

L(fW)≤ˆ

Lγ(fW) + ˜

O



v

tQlkWlk2

γ2mη L

l=1 pM(l, η)!2



X

lPψ,i,j 



c

Wl(ψ, i, j)



2

F/dimψ

kWlk2

2









where η∈(0,1) and

M(l, η) := log PL

l=1 Pψml,ψ

1−η!max

ψ(5ml−1,ψml,ψcψ).(3)

The complete version of the theorem is given in Theorem 9.1.

We comment brieﬂy on the proof steps. PAC-Bayesian generalization bounds start with

deﬁning a prior

and posterior

over the network parameters. The network is drawn randomly

using

, which is also used to evaluate the average loss and generalization error. The PAC-

Bayesian bound provides a bound on the average generalization error of networks randomly

drawn from

(Section 4.1). The next step is to de-randomize the bound and to obtain the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

APAC-BayesianGeneralizationBoundforEquivariantNetworksArashBehboodi*QualcommAIreasarchGabrieleCesaQualcommAIreasarchyTacoCohenQualcommAIreasarchyAbstractEquivariantnetworkscapturetheinductivebiasaboutthesymmetryofthelearningtaskbybuildingthosesymmetriesintothemodel.Inthispaper,westudyhowequivarian...

展开>> 收起<<

A PAC-Bayesian Generalization Bound for Equivariant Networks Arash Behboodi.pdf

共41页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A PAC-Bayesian Generalization Bound for Equivariant Networks Arash Behboodi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: