In What Ways Are Deep Neural Networks Invariant and How Should We Measure This Henry Kvinge12 Tegan H. Emerson134 Grayson Jorgenson1

2025-05-06 0 0 2.46MB 22 页 10玖币
侵权投诉
In What Ways Are Deep Neural Networks Invariant
and How Should We Measure This?
Henry Kvinge1,2, Tegan H. Emerson1,3,4, Grayson Jorgenson1,
Scott Vasquez1, Timothy Doster1, Jesse D. Lew5
1Pacific Northwest National Laboratory
2Department of Mathematics, University of Washington
3Department of Mathematics, Colorado State University
4Department of Mathematical Sciences at the University of Texas, El Paso
5National Geospatial-Intelligence Agency
henry.kvinge@pnnl.gov
Abstract
It is often said that a deep learning model is “invariant” to some specific type of
transformation. However, what is meant by this statement strongly depends on
the context in which it is made. In this paper we explore the nature of invariance
and equivariance of deep learning models with the goal of better understanding
the ways in which they actually capture these concepts on a formal level. We
introduce a family of invariance and equivariance metrics that allows us to quantify
these properties in a way that disentangles them from other metrics such as loss or
accuracy. We use our metrics to better understand the two most popular methods
used to build invariance into networks: data augmentation and equivariant layers.
We draw a range of conclusions about invariance and equivariance in deep learning
models, ranging from whether initializing a model with pretrained weights has an
effect on a trained model’s invariance, to the extent to which invariance learned via
training can generalize to out-of-distribution data.
1 Introduction
The notions of invariance and equivariance have been guiding concepts across a diverse range of
scientific domains, from physics to psychology. In machine learning (ML) the concept of invariance
is frequently invoked to describe models whose output does not change when its input is transformed
in ways that are irrelevant to the task the model was designed for. For a dog or cat image classifier,
for example, it is desirable for the model to be reflection invariant so that the same prediction is made
whether or not the input image is reflected across its vertical axis. This type of invariance is useful in
this model because although an image generally changes when reflected, whether it contains a dog
or cat does not. Equivariance is used to describe models whose output changes in a manner that is
aligned with the way that input is transformed. For example, as an image is rotated, the position of
bounding boxes predicted by an object detector should also rotate (thus, the object detector should be
rotation equivariant).
While mathematicians have developed rigorous theory that can describe invariance and equivariance,
the amount of this that is actually used in the context of ML varies dramatically among different
works. Within computer vision, research on relatively simple image transformations (i.e., rotations
or translations) is often presented within a solid group-theoretic framework [
6
;
52
;
9
]. Other work
dealing with more complex types of transformations such as changes in image background have (by
necessity) needed to be more informal [
51
]. Furthermore, while model invariance and equivariance
are frequently a central component in a broad range of ML research, limited effort has been put
Preprint. Under review.
arXiv:2210.03773v1 [cs.LG] 7 Oct 2022
into trying to measure them directly with the purpose of understanding the general invariance and
equivariance properties of deep learning models. Rather, most works measure them indirectly through
other ML metrics that align with the ultimate purpose of the model (e.g., loss or accuracy). While
such a strategy makes sense when optimizing model performance is the primary objective, we miss
an opportunity to better understand how and why deep neural networks work (or fail). The purpose
of this paper is to propose a group-theoretic family of metrics, associated to an arbitrary group
of symmetries
G
, which we call
G
-empirical equivariance deviation (
G
-EED), that quantify the
G-equivariance (and G-invariance) of a model.
Informally, the
G
-EED of a model measures the extent to which it fails to be
G
-equivariant on a
specific data distribution and with respect to a specific notion of distance in output space. This
aligns with needs in ML where the user of a model may care only that their model is equivariant on
data that the model will actually encounter in practice and not on any possible input. Further, since
invariance is a special case of equivariance,
G
-EED also measures the extent to which a model fails
to be invariant to the action of
G
. To show the breadth of the
G
-EED concept, we give a number of
different ways it can be applied to measure different aspects of equivariance in a model. For example,
we show how
G
-EED can be applied to a model’s latent space representations to measure the extent
to which a model extracts G-invariant features.
Finally, we use
G
-EED to answer a range of questions about invariance and equivariance in neural
networks, with a focus on the two most popular ways of inducing invariance in these models: data
augmentation and equivariant architectures. Some of the conclusions we draw from our experiments
include the following. (1) Training with augmentation does not tend to induce invariance through
learned equivariant layers; rather, invariance arises through some other mechanism. (2) Invariance
learned through augmentation does generalize mildly to out-of-distribution data (e.g., common
image corruptions), but apparent invariance seen for far out-of-distribution data (e.g., images from a
completely different domain) could be the result of model insensitivity. (3) Invariance should not be
assumed to correlate with model performance: models with random weights can be more invariant
(in the usual mathematical sense) than models with learned or hard-coded invariance. (4) Models
initialized with pretrained weights tend to have different invariance or equivariance properties than
models initialized with random weights though whether these models are judged to be more invariant
depends on the specific notion of distance one chooses to use. (5) Self-supervised models do not
seem to be more invariant than supervised models except when they are trained with contrastive loss
and augmentations from the relevant symmetry group G.
In summary, the contributions of this paper include:
The introduction of
G
-empirical equivariance deviation, the first family of metrics that can
rigorously and directly measure a range of notions of invariance and equivariance in deep
learning models.
A demonstration of the flexibility of
G
-EED, showing that it can be applied easily to a range
of different components of a deep learning model to measure different forms of invariance
and equivariance.
Use of
G
-EED to answer a range of questions, shedding light on the extent to which neural
networks are or are not invariant and equivariant.
2 Related Work
The literature on invariance and equivariance in machine learning can roughly be partitioned into
two groups: those works that focus on how to build invariance and equivariance into a model or its
components [
27
;
9
;
52
;
45
;
10
;
46
;
19
;
12
;
37
;
41
;
2
] and those works that focus on the theory behind
invariance and equivariance [3; 6; 5; 34; 40].
Three common approaches are used to build invariance into deep learning models: data augmentation,
feature averaging, and equivariant architectures. In this paper we focus on the first and third of
these. Outside of a few ubiquitous layer types (such as standard translation equivariant convolutional
layers), data augmentation is by far the most commonly used method, being a standard component of
many training routines, particularly in computer vision. Since it has become a common procedure
when training deep learning models, data augmentation research has expanded in many directions
[12; 37; 41; 2].
2
The idea that
G
-invariance can be hardcoded into a model by combining multiple
G
-equivariant
layers with data reduction layers (e.g., pooling) has a long history in deep learning. The most famous
example of this idea is the conventional convolutional neural network (CNN) [
32
]. Since then a
multitude of other group equivariant layers have been designed including two-dimensional rotation
equivariant layers [
47
;
50
;
36
;
38
;
39
], three-dimensional rotational equivariant layers [
46
;
11
;
16
],
layers equivariant to the Euclidean group and its subgroups [
45
] (which we test against in this paper),
and layers that are equivariant with respect to the symmetric group [33].
Our work is not the first to analyze various aspects of invariance in neural network models. Lyle
et al.
[34]
analyzed invariance with respect to the benefits and limitations of data augmentation and
feature averaging, presenting both theoretical and empirical arguments for using feature averaging
over data augmentation. More recently, Chen et al.
[6]
presented a useful group-theoretic framework
with which to understand data augmentation. Relevant to the present work, Chen et al.
[5]
introduced
a notion of approximate invariance. Unlike that work, however, which focused on theoretical results
related to data augmentation, this paper aims to introduce metrics that can be applied to modern deep
learning architectures and answers questions about invariance from an empirical perspective. There
are a number of existing works that proposed metrics aimed at measuring the extent to which a model
is not equivariant (e.g. [
8
;
22
;
18
;
43
;
49
]). Our work differs from these in two ways: (1) we build
general metrics based on basic group theory that are designed to work across different groups and
datatypes and (2) unlike other works that use their metric to evaluate the equivariance of a specific
model, we use our metrics to explore how models learn (or do not learn) to be equivariant generally.
Finally, a range of recent works have shown that even beyond the standard evaluation statistics (e.g.,
accuracy), invariance is an important concept to consider when studying deep learning models. For
example, Kaur et al.
[25]
showed that lack of invariance can be used to identify out-of-distribution
inputs. A further series of works investigated whether excessive invariance can reduce adversarial
robustness [
23
;
24
;
40
]. All of this work reinforces one of the primary messages of this paper, that it
is important to be able to measure invariance and equivariance directly in a model.
3 Quantifying Invariance and Equivariance
We begin this section by recalling the mathematical definitions of equivariance and invariance. We
present these definitions in terms of the mathematical concept of a group, which formally captures
the notion of symmetry [15].
Assume that
G
is a group. We say that
G
acts on sets
X
and
Y
if there are maps
φX:G×XX
and
φY:G×YY
that respect the composition operation of
G
. That is, for
g1, g2G
and
xX,
φX(g2, φX(g1, x)) = φX(g2g1, x),
with an analogous condition for
φY
. Whenever the meaning is clear, we simplify notation by
writing
φX(g, x) = gx
(with an analogous convention for
φY
). A map
f:XY
is said to be
G-equivariant if for all xXand gG,
f(gx) = gf(x).(1)
In the case where the map
φY
is trivial so that
gy =y
for all
gG
and
yY
, we say that
f
is
G-invariant. Thus, invariance is a special case of equivariance.
Assume that
f:XY
is a neural network where
X
is the ambient space of input data and
Y
is
the target space. In many cases there is a natural way to factorize
f
into a composition
f=f2f1
where
f1:XZ
is known as the feature extractor,
Z
is the latent space of
f
, and
f2:ZY
is the classifier. For example, if
f
is a ResNet50 CNN [
20
] then
f1
may consist of all residual
blocks while
f2
would consist of the final affine classification and softmax layers. We say that
machine learning model
f
extracts
G
-equivariant features if
f1
is a
G
-equivariant function. This is an
especially meaningful distinction in the context of transfer learning where invariance or equivariance
can be transferred to a new task via the invariance or equivariance of
f1
. Note that the definition of
G
-equivariant feature extraction requires a well-defined action of
G
on
Z
, which may not be obvious
in many cases. Because the trivial action is defined for any
G
and
Z
, we can always ask whether
f
extracts G-invariant features.
The following proposition provides some insight into how the invariance (or lack of invariance) of
f1
relates to the invariance (or lack of invariance) of f.
3
Proposition 3.1.
Let
f:XY
be a function that decomposes into
f=f2f1
where
f1:XZ
and f2:ZY. Suppose that Gacts on X,Z, and Y.
1. fcan be G-invariant even if f1and f2are not.
2. If f1is G-invariant, then fis G-invariant.
A proof of Proposition 3.1 can be found in Section A.5. Note that Proposition 3.1.2 implies that if
earlier layers of a network achieve invariance with respect to some group, then this invariance will
persist into later layers. This may be seen as part of the justification of the common practice within
the equivariant architectures community of building invariance through successive combinations of
G
-equivariant layers and pooling layers. Of course, the statement holds only when exact invariance is
achieved. We see below that this is not generally the case.
3.1 Measuring equivariance
In this section we assume that both
X
and
Y
are vector spaces and the action of
G
on both
X
and
Y
is linear. In all of our experiments we assume that the action of
G
on
X
and
Y
is known. In the case
that
f
is also a linear map, the equivariance of
f
can be checked directly by checking equivariance on
a basis for
X
. By extension, the equivariance of many common types of neural network layers can be
checked when these can be framed as linear maps (e.g., convolutional layers). However, there is no
systematic procedure for checking the equivariance of a nonlinear function
f
. If
f
is the composition
of a sequence of functions
f1, f2, . . . , fn
and we can check that each is equivariant, then we know
that
f
is equivariant, but we cannot prove that a function is not equivariant just by proving that each
layer is not equivariant (this follows from Proposition 3.1.1). We are thus motivated to introduce a
family of metrics that can be used to empirically measure the extent to which a function deviates
from being equivariant on a data distribution Don X.
Since we assume in this section that the action of
G
on
Y
is linear, we can define the kernel of this
action,
ker(φY)
, which is a subgroup of
G
.
ker(φY)
consists of all those
gG
such that
g
acts as
the identity on Y. For xX, we define ˆ
f(x)to be the expected value of f(gx)over ker(φY),
ˆ
f(x) := Egker(φY)(f(gx)) = Zgker(φY)
f(gx)(2)
where
µ
is the usual normalized Haar measure on subgroup
ker(G)
. Note that when
f
is
G
-equivariant,
then for each gker(φY),f(gx) = gf(x) = f(x)and hence ˆ
f=f.
Let
(D, ν)
be a probability distribution on
X
and let
m:Y×YR0
be a distance function on
Y
.
We can use mto measure the extent to which fdeviates from being G-equivariant by computing
ZDZgG
m(f(gx), g ˆ
f(x))dµdx (3)
where this time
µ
is the normalized Haar measure on
G
. Note that the argument
m(f(gx), g ˆ
f(x))
measures the extent to which
(1)
fails to hold across distribution
D
, except that
gf(x)
is replaced by
gˆ
f(x)
. We use
ˆ
f(x)
because it averages over all values in the orbit of
x
(under
G
) that should map to
f(x)
. If
f
were genuinely
G
-equivariant, all these values
f(gx)
would yield
f(x)
. This choice is
supported by the fact that it naturally interpolates between the two extreme cases:
G
acts trivially on
Y
(invariance) and
G
acts faithfully on
Y
. In the former case
ˆ
f(x)
is the average value of
f(gx)
over
all of gGand in the latter case ˆ
f(x) = f(x).
The proposition below proves that when the action of
G
is faithful on
X
,
Y
,
(3)
being
0
is equivalent
to fsatisfying (1) on a set of measure 1.
Proposition 3.2.
Let
f:XY
be a continuous function,
G
a group that acts linearly and faithfully
on both
X
and
Y
, and
m:Y×YR0
a metric. Let
D
be a distribution on
X
. Then
(3)
is zero if
and only if fis G-equivariant almost surely, i.e., on a set of measure 1.
We provide a proof of Proposition 3.2 in Appendix A.6
4
Figure 1: A diagram illustrating the different types of empirical equivariance deviation (EED) that
we investigate in this paper for the rotation action of cyclic group C4on MNIST images.
To approximate
(3)
for real models and data where we always work with finite groups and finite
samples of D, we define the G-EED of fwith respect to mto be
E(f, G) := 1
|D||G|X
xDX
gG
m(f(gx), g ˆ
f(x)).(4)
Note that since
G
is a finite group (and hence discrete), the Haar measure turns into the usual counting
measure.
In the remainder of this section we describe some specific types of
G
-EED that may be relevant to
computer vision tasks. By convention, when using a distance function
m
for which larger values of
m(x1, x2)
indicates that
x1
and
x2
are “closer” (the opposite of a proper metric), we attach a negative
sign to
m
. For example, rather than using cosine similarity, we use negative cosine similarity (e.g.,
(5)
). This way, larger values of
G
-EED consisently indicate less invariance, regardless of the
m
used.
The channelwise G-equivariance of convolutional layer activations:
Throughout most layers of a
CNN an individual image is represented as a
3
-tensor. Let
f`:XRC`×H`×W`
be the composition
of the first
`
layers of a CNN such that for input image
xX
,
f`(x)
is a
C`×H`×W`
tensor
where the first dimension corresponds to the channels of the representation and the second and third
dimensions correspond to the two spatial dimensions.
Suppose that Gis a finite group that acts on images and other 2-tensors (e.g., rotations, translations,
and reflections). To understand the extent to which the first
`
layers of the network are
G
-equivariant,
we can measure the
G
-EED of
f`
. Although there are numerous choices of
m
for
(3)
that could be
used to measure the channelwise difference between
f`(gx)
and
gf`(x)
, we choose the following: let
S:RH`×W`×RH`×W`[0,1]
be the cosine similarity on individual channels treated as vectors in
RH`W`. Write [f`(x)]ifor the ith channel of f`(x). Then we set
m(f`(gx), g ˆ
f`(x)) = 1
C
C
X
i=1
S(f`(gx)i, gˆ
f`(x)i).
That is, let
m
be the negative of the average cosine similarity between individual channels of
f`(gx)
and gˆ
f`(x). This gives:
Echannel(f, G, `) := 1
|D||G|X
xDX
gG
m(f`(gx), g ˆ
f`(x)) (5)
Note that since rotation, translation, and reflection groups all act faithfully on
RH`×W`
, for these
specific
G
,
ˆ
f=f
. We call this version of
G
-EED channelwise
G
-EED. This metric assumes that
G
acts on each channel in a 3-tensor independently. It does not account for the more complicated setting
where the action of
G
either permutes or mixes channels of
RC`×H`×W`
in a non-trivial way. In this
work we always assume that the action of
G
on channels of
RC`×H`×W`
is identical to the action
of
G
on
X
(up to differences in spatial scale and number of channels). We consider the case where
5
摘要:

InWhatWaysAreDeepNeuralNetworksInvariantandHowShouldWeMeasureThis?HenryKvinge1;2,TeganH.Emerson1;3;4,GraysonJorgenson1,ScottVasquez1,TimothyDoster1,JesseD.Lew51PacicNorthwestNationalLaboratory2DepartmentofMathematics,UniversityofWashington3DepartmentofMathematics,ColoradoStateUniversity4Departmento...

展开>> 收起<<
In What Ways Are Deep Neural Networks Invariant and How Should We Measure This Henry Kvinge12 Tegan H. Emerson134 Grayson Jorgenson1.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:2.46MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注