In What Ways Are Deep Neural Networks Invariant and How Should We Measure This Henry Kvinge12 Tegan H. Emerson134 Grayson Jorgenson1

2025-05-06 0 0 2.46MB 22 页 10玖币

侵权投诉

In What Ways Are Deep Neural Networks Invariant

and How Should We Measure This?

Henry Kvinge1,2, Tegan H. Emerson1,3,4, Grayson Jorgenson1,

Scott Vasquez1, Timothy Doster1, Jesse D. Lew5

1Paciﬁc Northwest National Laboratory

2Department of Mathematics, University of Washington

3Department of Mathematics, Colorado State University

4Department of Mathematical Sciences at the University of Texas, El Paso

5National Geospatial-Intelligence Agency

henry.kvinge@pnnl.gov

Abstract

It is often said that a deep learning model is “invariant” to some speciﬁc type of

transformation. However, what is meant by this statement strongly depends on

the context in which it is made. In this paper we explore the nature of invariance

and equivariance of deep learning models with the goal of better understanding

the ways in which they actually capture these concepts on a formal level. We

introduce a family of invariance and equivariance metrics that allows us to quantify

these properties in a way that disentangles them from other metrics such as loss or

accuracy. We use our metrics to better understand the two most popular methods

used to build invariance into networks: data augmentation and equivariant layers.

We draw a range of conclusions about invariance and equivariance in deep learning

models, ranging from whether initializing a model with pretrained weights has an

effect on a trained model’s invariance, to the extent to which invariance learned via

training can generalize to out-of-distribution data.

1 Introduction

The notions of invariance and equivariance have been guiding concepts across a diverse range of

scientiﬁc domains, from physics to psychology. In machine learning (ML) the concept of invariance

is frequently invoked to describe models whose output does not change when its input is transformed

in ways that are irrelevant to the task the model was designed for. For a dog or cat image classiﬁer,

for example, it is desirable for the model to be reﬂection invariant so that the same prediction is made

whether or not the input image is reﬂected across its vertical axis. This type of invariance is useful in

this model because although an image generally changes when reﬂected, whether it contains a dog

or cat does not. Equivariance is used to describe models whose output changes in a manner that is

aligned with the way that input is transformed. For example, as an image is rotated, the position of

bounding boxes predicted by an object detector should also rotate (thus, the object detector should be

rotation equivariant).

While mathematicians have developed rigorous theory that can describe invariance and equivariance,

the amount of this that is actually used in the context of ML varies dramatically among different

works. Within computer vision, research on relatively simple image transformations (i.e., rotations

or translations) is often presented within a solid group-theoretic framework [

;

]. Other work

dealing with more complex types of transformations such as changes in image background have (by

necessity) needed to be more informal [

]. Furthermore, while model invariance and equivariance

are frequently a central component in a broad range of ML research, limited effort has been put

Preprint. Under review.

arXiv:2210.03773v1 [cs.LG] 7 Oct 2022

into trying to measure them directly with the purpose of understanding the general invariance and

equivariance properties of deep learning models. Rather, most works measure them indirectly through

other ML metrics that align with the ultimate purpose of the model (e.g., loss or accuracy). While

such a strategy makes sense when optimizing model performance is the primary objective, we miss

an opportunity to better understand how and why deep neural networks work (or fail). The purpose

of this paper is to propose a group-theoretic family of metrics, associated to an arbitrary group

of symmetries

, which we call

-empirical equivariance deviation (

-EED), that quantify the

G-equivariance (and G-invariance) of a model.

Informally, the

-EED of a model measures the extent to which it fails to be

-equivariant on a

speciﬁc data distribution and with respect to a speciﬁc notion of distance in output space. This

aligns with needs in ML where the user of a model may care only that their model is equivariant on

data that the model will actually encounter in practice and not on any possible input. Further, since

invariance is a special case of equivariance,

-EED also measures the extent to which a model fails

to be invariant to the action of

. To show the breadth of the

-EED concept, we give a number of

different ways it can be applied to measure different aspects of equivariance in a model. For example,

we show how

-EED can be applied to a model’s latent space representations to measure the extent

to which a model extracts G-invariant features.

Finally, we use

-EED to answer a range of questions about invariance and equivariance in neural

networks, with a focus on the two most popular ways of inducing invariance in these models: data

augmentation and equivariant architectures. Some of the conclusions we draw from our experiments

include the following. (1) Training with augmentation does not tend to induce invariance through

learned equivariant layers; rather, invariance arises through some other mechanism. (2) Invariance

learned through augmentation does generalize mildly to out-of-distribution data (e.g., common

image corruptions), but apparent invariance seen for far out-of-distribution data (e.g., images from a

completely different domain) could be the result of model insensitivity. (3) Invariance should not be

assumed to correlate with model performance: models with random weights can be more invariant

(in the usual mathematical sense) than models with learned or hard-coded invariance. (4) Models

initialized with pretrained weights tend to have different invariance or equivariance properties than

models initialized with random weights though whether these models are judged to be more invariant

depends on the speciﬁc notion of distance one chooses to use. (5) Self-supervised models do not

seem to be more invariant than supervised models except when they are trained with contrastive loss

and augmentations from the relevant symmetry group G.

In summary, the contributions of this paper include:

•

The introduction of

-empirical equivariance deviation, the ﬁrst family of metrics that can

rigorously and directly measure a range of notions of invariance and equivariance in deep

learning models.

•

A demonstration of the ﬂexibility of

-EED, showing that it can be applied easily to a range

of different components of a deep learning model to measure different forms of invariance

and equivariance.

•

Use of

-EED to answer a range of questions, shedding light on the extent to which neural

networks are or are not invariant and equivariant.

2 Related Work

The literature on invariance and equivariance in machine learning can roughly be partitioned into

two groups: those works that focus on how to build invariance and equivariance into a model or its

components [

;

] and those works that focus on the theory behind

invariance and equivariance [3; 6; 5; 34; 40].

Three common approaches are used to build invariance into deep learning models: data augmentation,

feature averaging, and equivariant architectures. In this paper we focus on the ﬁrst and third of

these. Outside of a few ubiquitous layer types (such as standard translation equivariant convolutional

layers), data augmentation is by far the most commonly used method, being a standard component of

many training routines, particularly in computer vision. Since it has become a common procedure

when training deep learning models, data augmentation research has expanded in many directions

[12; 37; 41; 2].

The idea that

-invariance can be hardcoded into a model by combining multiple

-equivariant

layers with data reduction layers (e.g., pooling) has a long history in deep learning. The most famous

example of this idea is the conventional convolutional neural network (CNN) [

]. Since then a

multitude of other group equivariant layers have been designed including two-dimensional rotation

equivariant layers [

;

], three-dimensional rotational equivariant layers [

;

layers equivariant to the Euclidean group and its subgroups [

] (which we test against in this paper),

and layers that are equivariant with respect to the symmetric group [33].

Our work is not the ﬁrst to analyze various aspects of invariance in neural network models. Lyle

et al.

[34]

analyzed invariance with respect to the beneﬁts and limitations of data augmentation and

feature averaging, presenting both theoretical and empirical arguments for using feature averaging

over data augmentation. More recently, Chen et al.

[6]

presented a useful group-theoretic framework

with which to understand data augmentation. Relevant to the present work, Chen et al.

[5]

introduced

a notion of approximate invariance. Unlike that work, however, which focused on theoretical results

related to data augmentation, this paper aims to introduce metrics that can be applied to modern deep

learning architectures and answers questions about invariance from an empirical perspective. There

are a number of existing works that proposed metrics aimed at measuring the extent to which a model

is not equivariant (e.g. [

;

]). Our work differs from these in two ways: (1) we build

general metrics based on basic group theory that are designed to work across different groups and

datatypes and (2) unlike other works that use their metric to evaluate the equivariance of a speciﬁc

model, we use our metrics to explore how models learn (or do not learn) to be equivariant generally.

Finally, a range of recent works have shown that even beyond the standard evaluation statistics (e.g.,

accuracy), invariance is an important concept to consider when studying deep learning models. For

example, Kaur et al.

[25]

showed that lack of invariance can be used to identify out-of-distribution

inputs. A further series of works investigated whether excessive invariance can reduce adversarial

robustness [

;

]. All of this work reinforces one of the primary messages of this paper, that it

is important to be able to measure invariance and equivariance directly in a model.

3 Quantifying Invariance and Equivariance

We begin this section by recalling the mathematical deﬁnitions of equivariance and invariance. We

present these deﬁnitions in terms of the mathematical concept of a group, which formally captures

the notion of symmetry [15].

Assume that

is a group. We say that

acts on sets

and

if there are maps

φX:G×X→X

and

φY:G×Y→Y

that respect the composition operation of

. That is, for

g1, g2∈G

and

x∈X,

φX(g2, φX(g1, x)) = φX(g2g1, x),

with an analogous condition for

φY

. Whenever the meaning is clear, we simplify notation by

writing

φX(g, x) = gx

(with an analogous convention for

φY

). A map

f:X→Y

is said to be

G-equivariant if for all x∈Xand g∈G,

f(gx) = gf(x).(1)

In the case where the map

φY

is trivial so that

gy =y

for all

g∈G

and

y∈Y

, we say that

G-invariant. Thus, invariance is a special case of equivariance.

Assume that

f:X→Y

is a neural network where

is the ambient space of input data and

the target space. In many cases there is a natural way to factorize

into a composition

f=f2◦f1

where

f1:X→Z

is known as the feature extractor,

is the latent space of

, and

f2:Z→Y

is the classiﬁer. For example, if

is a ResNet50 CNN [

] then

may consist of all residual

blocks while

would consist of the ﬁnal afﬁne classiﬁcation and softmax layers. We say that

machine learning model

extracts

-equivariant features if

is a

-equivariant function. This is an

especially meaningful distinction in the context of transfer learning where invariance or equivariance

can be transferred to a new task via the invariance or equivariance of

. Note that the deﬁnition of

-equivariant feature extraction requires a well-deﬁned action of

, which may not be obvious

in many cases. Because the trivial action is deﬁned for any

and

, we can always ask whether

extracts G-invariant features.

The following proposition provides some insight into how the invariance (or lack of invariance) of

relates to the invariance (or lack of invariance) of f.

Proposition 3.1.

Let

f:X→Y

be a function that decomposes into

f=f2◦f1

where

f1:X→Z

and f2:Z→Y. Suppose that Gacts on X,Z, and Y.

1. fcan be G-invariant even if f1and f2are not.

2. If f1is G-invariant, then fis G-invariant.

A proof of Proposition 3.1 can be found in Section A.5. Note that Proposition 3.1.2 implies that if

earlier layers of a network achieve invariance with respect to some group, then this invariance will

persist into later layers. This may be seen as part of the justiﬁcation of the common practice within

the equivariant architectures community of building invariance through successive combinations of

-equivariant layers and pooling layers. Of course, the statement holds only when exact invariance is

achieved. We see below that this is not generally the case.

3.1 Measuring equivariance

In this section we assume that both

and

are vector spaces and the action of

on both

and

is linear. In all of our experiments we assume that the action of

and

is known. In the case

that

is also a linear map, the equivariance of

can be checked directly by checking equivariance on

a basis for

. By extension, the equivariance of many common types of neural network layers can be

checked when these can be framed as linear maps (e.g., convolutional layers). However, there is no

systematic procedure for checking the equivariance of a nonlinear function

. If

is the composition

of a sequence of functions

f1, f2, . . . , fn

and we can check that each is equivariant, then we know

that

is equivariant, but we cannot prove that a function is not equivariant just by proving that each

layer is not equivariant (this follows from Proposition 3.1.1). We are thus motivated to introduce a

family of metrics that can be used to empirically measure the extent to which a function deviates

from being equivariant on a data distribution Don X.

Since we assume in this section that the action of

is linear, we can deﬁne the kernel of this

action,

ker(φY)

, which is a subgroup of

ker(φY)

consists of all those

g∈G

such that

acts as

the identity on Y. For x∈X, we deﬁne ˆ

f(x)to be the expected value of f(gx)over ker(φY),

f(x) := Eg∈ker(φY)(f(gx)) = Zg∈ker(φY)

f(gx)dµ (2)

where

is the usual normalized Haar measure on subgroup

ker(G)

. Note that when

-equivariant,

then for each g∈ker(φY),f(gx) = gf(x) = f(x)and hence ˆ

f=f.

Let

(D, ν)

be a probability distribution on

and let

m:Y×Y→R≥0

be a distance function on

We can use mto measure the extent to which fdeviates from being G-equivariant by computing

ZDZg∈G

m(f(gx), g ˆ

f(x))dµdx (3)

where this time

is the normalized Haar measure on

. Note that the argument

m(f(gx), g ˆ

f(x))

measures the extent to which

(1)

fails to hold across distribution

, except that

gf(x)

is replaced by

gˆ

f(x)

. We use

f(x)

because it averages over all values in the orbit of

(under

) that should map to

f(x)

. If

were genuinely

-equivariant, all these values

f(gx)

would yield

f(x)

. This choice is

supported by the fact that it naturally interpolates between the two extreme cases:

acts trivially on

(invariance) and

acts faithfully on

. In the former case

f(x)

is the average value of

f(gx)

over

all of g∈Gand in the latter case ˆ

f(x) = f(x).

The proposition below proves that when the action of

is faithful on

(3)

being

is equivalent

to fsatisfying (1) on a set of measure 1.

Proposition 3.2.

Let

f:X→Y

be a continuous function,

a group that acts linearly and faithfully

on both

and

, and

m:Y×Y→R≥0

a metric. Let

be a distribution on

. Then

(3)

is zero if

and only if fis G-equivariant almost surely, i.e., on a set of measure 1.

We provide a proof of Proposition 3.2 in Appendix A.6

Figure 1: A diagram illustrating the different types of empirical equivariance deviation (EED) that

we investigate in this paper for the rotation action of cyclic group C4on MNIST images.

To approximate

(3)

for real models and data where we always work with ﬁnite groups and ﬁnite

samples of D, we deﬁne the G-EED of fwith respect to mto be

E(f, G) := 1

|D||G|X

x∈DX

g∈G

m(f(gx), g ˆ

f(x)).(4)

Note that since

is a ﬁnite group (and hence discrete), the Haar measure turns into the usual counting

measure.

In the remainder of this section we describe some speciﬁc types of

-EED that may be relevant to

computer vision tasks. By convention, when using a distance function

for which larger values of

m(x1, x2)

indicates that

and

are “closer” (the opposite of a proper metric), we attach a negative

sign to

. For example, rather than using cosine similarity, we use negative cosine similarity (e.g.,

(5)

). This way, larger values of

-EED consisently indicate less invariance, regardless of the

used.

The channelwise G-equivariance of convolutional layer activations:

Throughout most layers of a

CNN an individual image is represented as a

-tensor. Let

f`:X→RC`×H`×W`

be the composition

of the ﬁrst

layers of a CNN such that for input image

x∈X

f`(x)

is a

C`×H`×W`

tensor

where the ﬁrst dimension corresponds to the channels of the representation and the second and third

dimensions correspond to the two spatial dimensions.

Suppose that Gis a ﬁnite group that acts on images and other 2-tensors (e.g., rotations, translations,

and reﬂections). To understand the extent to which the ﬁrst

layers of the network are

-equivariant,

we can measure the

-EED of

. Although there are numerous choices of

for

(3)

that could be

used to measure the channelwise difference between

f`(gx)

and

gf`(x)

, we choose the following: let

S:RH`×W`×RH`×W`→[0,1]

be the cosine similarity on individual channels treated as vectors in

RH`W`. Write [f`(x)]ifor the ith channel of f`(x). Then we set

m(f`(gx), g ˆ

f`(x)) = −1

i=1

S(f`(gx)i, gˆ

f`(x)i).

That is, let

be the negative of the average cosine similarity between individual channels of

f`(gx)

and gˆ

f`(x). This gives:

Echannel(f, G, `) := 1

|D||G|X

x∈DX

g∈G

m(f`(gx), g ˆ

f`(x)) (5)

Note that since rotation, translation, and reﬂection groups all act faithfully on

RH`×W`

, for these

speciﬁc

f=f

. We call this version of

-EED channelwise

-EED. This metric assumes that

acts on each channel in a 3-tensor independently. It does not account for the more complicated setting

where the action of

either permutes or mixes channels of

RC`×H`×W`

in a non-trivial way. In this

work we always assume that the action of

on channels of

RC`×H`×W`

is identical to the action

(up to differences in spatial scale and number of channels). We consider the case where

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InWhatWaysAreDeepNeuralNetworksInvariantandHowShouldWeMeasureThis?HenryKvinge1;2,TeganH.Emerson1;3;4,GraysonJorgenson1,ScottVasquez1,TimothyDoster1,JesseD.Lew51PacicNorthwestNationalLaboratory2DepartmentofMathematics,UniversityofWashington3DepartmentofMathematics,ColoradoStateUniversity4Departmento...

展开>> 收起<<

In What Ways Are Deep Neural Networks Invariant and How Should We Measure This Henry Kvinge12 Tegan H. Emerson134 Grayson Jorgenson1.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

In What Ways Are Deep Neural Networks Invariant and How Should We Measure This Henry Kvinge12 Tegan H. Emerson134 Grayson Jorgenson1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: