HOW DEEP CONVOLUTIONAL NEURAL NETWORKS LOSE SPATIAL INFORMATION WITH TRAINING Umberto M. Tomasini Leonardo Petrini Francesco Cagnetta Matthieu Wyart

2025-04-29 1 0 2.7MB 18 页 10玖币

侵权投诉

HOW DEEP CONVOLUTIONAL NEURAL NETWORKS

LOSE SPATIAL INFORMATION WITH TRAINING

Umberto M. Tomasini ∗, Leonardo Petrini ∗, Francesco Cagnetta, Matthieu Wyart

Institute of Physics

Ecole Polytechnique F´

ed´

erale de Lausanne

name.surname@epfl.ch

ABSTRACT

A central question of machine learning is how deep nets manage to learn tasks

in high dimensions. An appealing hypothesis is that they achieve this feat by

building a representation of the data where information irrelevant to the task is

lost. For image datasets, this view is supported by the observation that after (and

not before) training, the neural representation becomes less and less sensitive

to diffeomorphisms acting on images as the signal propagates through the net.

This loss of sensitivity correlates with performance, and surprisingly correlates

with a gain of sensitivity to white noise acquired during training. These facts are

unexplained, and as we demonstrate still hold when white noise is added to the

images of the training set. Here, we (i) show empirically for various architectures

that stability to image diffeomorphisms is achieved by both spatial and channel

pooling, (ii) introduce a model scale-detection task which reproduces our empirical

observations on spatial pooling and (iii) compute analitically how the sensitivity to

diffeomorphisms and noise scales with depth due to spatial pooling. The scalings

are found to depend on the presence of strides in the net architecture. We ﬁnd that

the increased sensitivity to noise is due to the perturbing noise piling up during

pooling, after being rectiﬁed by ReLU units.

1 INTRODUCTION

Deep learning algorithms can be successfully trained to solve a large variety of tasks (Amodei et al.,

2016;Huval et al.,2015;Mnih et al.,2013;Shi et al.,2016;Silver et al.,2017), often revolving

around classifying data in high-dimensional spaces. If there was little structure in the data, the

learning procedure would be cursed by the dimension of these spaces: achieving good performances

would require an astronomical number of training data (Luxburg & Bousquet,2004). Consequently,

real datasets must have a speciﬁc internal structure that can be learned with fewer examples. It has

been then hypothesized that the effectiveness of deep learning lies in its ability of building ‘good’

representations of this internal structure, which are insensitive to aspects of the data not related to

the task (Ansuini et al.,2019;Shwartz-Ziv & Tishby,2017;Recanatesi et al.,2019), thus effectively

reducing the dimensionality of the problem.

In the context of image classiﬁcation, Bruna & Mallat (2013); Mallat (2016) proposed that neural

networks lose irrelevant information by learning representations that are insensitive to small defor-

mations of the input, also called diffeomorphisms. This idea was tested in modern deep networks

by Petrini et al. (2021), who introduced the following measures

Df=Ex,τ kf(τ(x)) −f(x)k2

Ex1,x2kf(x1)−f(x2)k2, Gf=Ex,ηkf(x+η)−f(x)k2

Ex1,x2kf(x1)−f(x2)k2, Rf=Df

,(1)

to probe the sensitivity of a function

—either the output or an internal representation of a trained

network—to random diffeomorphisms

(see example in Fig. 1, left), to large white noise

perturbations

of magnitude

kτ(x)−xk

, and in relative terms, respectively. Here the input images

and

are sampled uniformly from the test set. In particular, the test error of trained networks

∗Equal contribution.

arXiv:2210.01506v2 [cs.LG] 23 Nov 2022

Figure 1: Left: example of a random diffeomorphism

applied to an image. Center: test er-

ror vs relative sensitivity to diffeomorphisms of the predictor for a set of networks trained on

CIFAR10, adapted from Petrini et al. (2021). Right: Correlation coefﬁcient between test er-

ror



and

when training different architectures on noisy CIFAR10,

ρ(, X) =

Cov(log , log X)/pVar(log )Var(log X)

. Increasing noise magnitudes are shown on the

-axis

and

η∗=Eτ,xkτ(x)−xk2

is the one used for the computation of

. Samples of a noisy CIFAR10

datum are shown on top. Notice that Dfand particularly Rfare positively correlated with , whilst

Gfis negatively correlated with . The corresponding scatter plots are in Fig. 10 (appendix).

ch 1

ch 2

filters w

input x

w · x

= 1.1

= 0.2

1.3

ch 1

ch 2

(a) Spatial pooling (b) Channel pooling

avg. pooling

size = 2x2

stride = 1

rotated

input x

= 0.2

= 1.1

1.3

7 6.5

6 6.5

8 7 5

4 9 5

2 9 3

Figure 2: Spatial vs. channel pooling. (a) Spatial average pooling (size 2x2, stride 1) computed on

a representation of size 3x3. One can notice that nearby pixel variations are smaller after pooling.

(b) If the ﬁlters of different channels are identical up to e.g. a rotation of angle

, then, averaging

the output of the application of such ﬁlters makes the result invariant to input rotations of

. This

averaging is an example of channel pooling.

is correlated with

when

is the network output. Less intuitively, the test error is anti-correlated

with the sensitivity to white noise

. Overall, it is the relative sensitivity

which correlates

best with the error (Fig. 1, middle). This correlation is learned over training—as it is not seen at

initialization—and built up layer by layer (Petrini et al.,2021). These phenomena are not simply due

to benchmark data being noiseless, as they persist when input images are corrupted by some small

noise (Fig. 1, right).

Operations that grant insensitivity to diffeomorphisms in a deep network have been identiﬁed

previously (e.g. Goodfellow et al. (2016), section 9.3, sketched in Fig. 2). The ﬁrst, spatial pooling,

integrates local patches within the image, thus losing the exact location of its features. The second,

channel pooling, requires the interaction of different channels, which allows the network to become

invariant to any local transformation by properly learning ﬁlters that are transformed versions of one

another. However, it is not clear whether these operations are actually learned by deep networks

and how they conspire in building good representations. Here we tackle this question by unveiling

empirically the emergence of spatial and channel pooling, and disentangling their role. Below is a

detailed list of our contributions.

1.1 OUR CONTRIBUTIONS

•

We disentangle the role of spatial and channel pooling within deep networks trained on

CIFAR10 (Section 2). More speciﬁcally, our experiments reveal the signiﬁcant contribution

of spatial pooling in decreasing the sensitivity to diffeomorphisms.

•

In order to isolate the contribution of spatial pooling and quantify its relation with the

sensitivities to diffeomorphism and noise, we introduce idealized scale-detection tasks

(Section 3). In these tasks, data are made of two active pixels and classiﬁed according to

their distance. We ﬁnd the same correlations between test error and sensitivities of trained

networks as found in Petrini et al. (2021). In addition, the neural networks which perform

the best on real data tend to be the best on these tasks.

•

We theoretically analyze how simple CNNs, made by stacking convolutional layers with

ﬁlter size

and stride

, learn these tasks (Section 4). We ﬁnd that the trained networks

perform spatial pooling for most of its layers. We show and verify empirically that the

sensitivities

and

of the

-th hidden layer follow

Gk∼Ak

and

Dk∼A−αs

, where

Akis the effective receptive ﬁeld size and αs= 2 if there is no stride, αs= 1 otherwise.

The code and details for reproducing experiments are available online at

github.com/leonardopetrini/relativestability/experiments ICLR23.md.

1.2 RELATED WORK

In the neuroscience literature, the understanding of the relevance of pooling in building invariant

representations dates back to the pioneering work of Hubel & Wiesel (1962). By studying the cat

visual cortex, they identiﬁed two different kinds of neurons: simple cells responding to e.g. edges at

speciﬁc angles and complex cells that pool the response of simple cells and detect edges regardless of

their position or orientation in the receptive ﬁeld. More recent accounts of the importance of learning

invariant representations in the visual cortex can be found in Niyogi et al. (1998); Anselmi et al.

(2016); Poggio & Anselmi (2016).

In the context of artiﬁcial neural networks, layers jointly performing spatial pooling and strides

have been introduced with the early CNNs of Lecun et al. (1998), following the intuition that local

averaging and subsampling would reduce the sensitivity to small input shifts. Ruderman et al. (2018)

investigated the role of spatial pooling and showed empirically that networks with and without pooling

layers converge to similar deformation stability, suggesting that spatial pooling can be learned in

deep networks. In our work, we further expand in this direction by jointly studying diffeomorphisms

and noise stability and proposing a theory of spatial pooling for a simple task.

The depth-wise loss of irrelevant information in deep networks has been investigated by means of the

information bottleneck framework (Shwartz-Ziv & Tishby,2017;Saxe et al.,2019) and the intrinsic

dimension of the networks internal representations (Ansuini et al.,2019;Recanatesi et al.,2019).

However, these works do not specify what is the irrelevant information to be disregarded, nor the

mechanisms involved in such a process.

The stability of trained networks to noise is extensively studied in the context of adversarial robust-

ness (Fawzi & Frossard,2015;Kanbak et al.,2018;Alcorn et al.,2019;Alaifari et al.,2018;Athalye

et al.,2018;Xiao et al.,2018a;Engstrom et al.,2019). Notice that our work differs from this literature

by the fact that we consider typical perturbations instead of worst-case ones.

2 EMPIRICAL OBSERVATIONS ON REAL DATA

In this section we analyze the parameters of deep CNNs trained on CIFAR10 and ImageNet, so as to

understand how they build representations insensitive to diffeomorphisms (details of the experiments

in App. B). The analysis builds on two premises, the ﬁrst being the assumption that insensitivity is

built layer by layer in the network, as shown in Fig. 3. Hence, we focus on how each of the layers

in a deep network contribute towards creating an insensitive representation. More speciﬁcally, let

us denote with

fk(x)

the internal representation of an input

at the

-th layer of the network. The

entries of

have three indices, one for the channel

and two for the spatial location

(i, j)

. The

relation between fkand fk−1is the following,

[fk(x)]c;i,j =φ

bk

Hk−1

c0=1

c,c0·pi,j ([fk−1(x)]c0)

∀c= 1, . . . , Hk,(2)

where:

denotes the number of channels at the

-th layer;

and

c,c0

the biases and ﬁlters of

the

-th layer; each ﬁlter

c,c0

is a

F×F

matrix with

the ﬁlter size;

pi,j ([fk−1(x)]c0)

denotes

F×F

-dimensional patch of

[fk−1(x)]c0

centered at

(i, j)

;

the activation function. The second

premise is that a general diffeomorphism can be represented as a displacement ﬁeld over the image,

which indicates how each pixel moves in the transformation. Locally, this displacement ﬁeld can be

decomposed into a constant term and a linear part: the former corresponds to local translations, the

latter to stretchings, rotations and shears.1

Invariance to translations via spatial pooling.

Due to weight sharing, i.e. the fact that the same

ﬁlter

c,c0

is applied to all the local patches

(i, j)

of the representation, the output of a convolutional

layer is equivariant to translations by construction: a shift of the input is equivalent to a shift of the

output. To achieve an invariant representation it sufﬁces to sum up the spatial entries of

—an

operation called pooling in CNNs, we refer to it as spatial pooling to stress that the sum runs over

the spatial indices of the representation. Even if there are no pooling layers at initialization, they

can be realized by having homogeneous ﬁlters, i.e. all the

F×F

entries of

wk+1

c,c0

are the same.

Therefore, the closer the ﬁlters are to the homogeneous ﬁlter, the more they decrease the sensitivity

of the representation to local translations.

Invariance to other transformations via channel pooling.

The example of translations shows

that building invariance can be performed by constructing an equivariant representation, and then

pooling it. Invariance can also be built by pooling across channels. A two-channel example is shown

Fig. 2, panel (b), where the ﬁlter of the second channel is built so as to produce the same output as

the ﬁrst channel when applied to a rotated input. The same idea can be applied more generally, e.g.

to the other components of diffeomorphisms—such as local stretchings and shears. Below, we refer

generically to any operation that build invariance to diffeomorphisms by assembling distinct channels

as channel pooling.

Disentangling spatial and channel pooling.

The relative sensitivity to diffeomorphisms

of the

-th layer representation

decreases after each layer, as shown in Fig. 3. This implies that spatial

or channel pooling are carried out along the whole network. To disentangle their contribution we

perform the following experiment: shufﬂe at random the connections between channels of successive

convolutional layers, while keeping the weights unaltered. Channel shufﬂing amounts to randomly

permuting the values of

c, c0

in Eq. 2, therefore it breaks any channel pooling while not affecting

single ﬁlters. The values of

for deep networks after channel shufﬂing are reported in Fig. 3 as

dashed lines and compared with the original values of

in full lines. If only spatial pooling was

present in the network, then the two curves would overlap. Conversely, if the decrease in

was

all due to the interactions between channels, then the shufﬂed curves should be constant. Given that

neither of these scenarios arises, we conclude that both kinds of pooling are being performed.

Emergence of spatial pooling after training.

To bolster the evidence for the presence of spatial

pooling, we analyze the ﬁlters of trained networks. Since spatial pooling can be built by having

homogeneous ﬁlters, we test for its presence by looking at the frequency content of learned ﬁlters

i,j

. In particular, we consider the average squared projection of ﬁlters onto “Fourier modes”

{Ψl}l=1,...,F 2

, taken as the eigenvectors of the discrete Laplace operator on the

F×F

ﬁlter grid.

The square projections averaged over channels read

γk,l =1

Hk−1Hk

c=1

Hk−1

c0=1 Ψl·wk

c,c02,(3)

The displacement ﬁeld around a pixel

(u0, v0)

is approximated as

τ(u, v)'τ(u0, v0) + J(u0, v0)[u−

u0, v −v0]T

, where

τ(u0, v0)

corresponds to translations and

is the Jacobian matrix of

whose trace,

antisymmetric and symmetric traceless parts correspond to stretchings, rotations and shears, respectively.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HOWDEEPCONVOLUTIONALNEURALNETWORKSLOSESPATIALINFORMATIONWITHTRAININGUmbertoM.Tomasini,LeonardoPetrini,FrancescoCagnetta,MatthieuWyartInstituteofPhysics´EcolePolytechniqueF´ed´eraledeLausannename.surname@epfl.chABSTRACTAcentralquestionofmachinelearningishowdeepnetsmanagetolearntasksinhighdimensions...

展开>> 收起<<

HOW DEEP CONVOLUTIONAL NEURAL NETWORKS LOSE SPATIAL INFORMATION WITH TRAINING Umberto M. Tomasini Leonardo Petrini Francesco Cagnetta Matthieu Wyart.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

HOW DEEP CONVOLUTIONAL NEURAL NETWORKS LOSE SPATIAL INFORMATION WITH TRAINING Umberto M. Tomasini Leonardo Petrini Francesco Cagnetta Matthieu Wyart

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: