HOW DEEP CONVOLUTIONAL NEURAL NETWORKS LOSE SPATIAL INFORMATION WITH TRAINING Umberto M. Tomasini Leonardo Petrini Francesco Cagnetta Matthieu Wyart

2025-04-29 0 0 2.7MB 18 页 10玖币
侵权投诉
HOW DEEP CONVOLUTIONAL NEURAL NETWORKS
LOSE SPATIAL INFORMATION WITH TRAINING
Umberto M. Tomasini , Leonardo Petrini , Francesco Cagnetta, Matthieu Wyart
Institute of Physics
´
Ecole Polytechnique F´
ed´
erale de Lausanne
name.surname@epfl.ch
ABSTRACT
A central question of machine learning is how deep nets manage to learn tasks
in high dimensions. An appealing hypothesis is that they achieve this feat by
building a representation of the data where information irrelevant to the task is
lost. For image datasets, this view is supported by the observation that after (and
not before) training, the neural representation becomes less and less sensitive
to diffeomorphisms acting on images as the signal propagates through the net.
This loss of sensitivity correlates with performance, and surprisingly correlates
with a gain of sensitivity to white noise acquired during training. These facts are
unexplained, and as we demonstrate still hold when white noise is added to the
images of the training set. Here, we (i) show empirically for various architectures
that stability to image diffeomorphisms is achieved by both spatial and channel
pooling, (ii) introduce a model scale-detection task which reproduces our empirical
observations on spatial pooling and (iii) compute analitically how the sensitivity to
diffeomorphisms and noise scales with depth due to spatial pooling. The scalings
are found to depend on the presence of strides in the net architecture. We find that
the increased sensitivity to noise is due to the perturbing noise piling up during
pooling, after being rectified by ReLU units.
1 INTRODUCTION
Deep learning algorithms can be successfully trained to solve a large variety of tasks (Amodei et al.,
2016;Huval et al.,2015;Mnih et al.,2013;Shi et al.,2016;Silver et al.,2017), often revolving
around classifying data in high-dimensional spaces. If there was little structure in the data, the
learning procedure would be cursed by the dimension of these spaces: achieving good performances
would require an astronomical number of training data (Luxburg & Bousquet,2004). Consequently,
real datasets must have a specific internal structure that can be learned with fewer examples. It has
been then hypothesized that the effectiveness of deep learning lies in its ability of building ‘good’
representations of this internal structure, which are insensitive to aspects of the data not related to
the task (Ansuini et al.,2019;Shwartz-Ziv & Tishby,2017;Recanatesi et al.,2019), thus effectively
reducing the dimensionality of the problem.
In the context of image classification, Bruna & Mallat (2013); Mallat (2016) proposed that neural
networks lose irrelevant information by learning representations that are insensitive to small defor-
mations of the input, also called diffeomorphisms. This idea was tested in modern deep networks
by Petrini et al. (2021), who introduced the following measures
Df=Ex,τ kf(τ(x)) f(x)k2
Ex1,x2kf(x1)f(x2)k2, Gf=Ex,ηkf(x+η)f(x)k2
Ex1,x2kf(x1)f(x2)k2, Rf=Df
Gf
,(1)
to probe the sensitivity of a function
f
—either the output or an internal representation of a trained
network—to random diffeomorphisms
τ
of
x
(see example in Fig. 1, left), to large white noise
perturbations
η
of magnitude
kτ(x)xk
, and in relative terms, respectively. Here the input images
x
,
x1
and
x2
are sampled uniformly from the test set. In particular, the test error of trained networks
Equal contribution.
1
arXiv:2210.01506v2 [cs.LG] 23 Nov 2022
τ
*
Figure 1: Left: example of a random diffeomorphism
τ
applied to an image. Center: test er-
ror vs relative sensitivity to diffeomorphisms of the predictor for a set of networks trained on
CIFAR10, adapted from Petrini et al. (2021). Right: Correlation coefficient between test er-
ror
and
Df
,
Gf
and
Rf
when training different architectures on noisy CIFAR10,
ρ(, X) =
Cov(log , log X)/pVar(log )Var(log X)
. Increasing noise magnitudes are shown on the
x
-axis
and
η=Eτ,xkτ(x)xk2
is the one used for the computation of
Gf
. Samples of a noisy CIFAR10
datum are shown on top. Notice that Dfand particularly Rfare positively correlated with , whilst
Gfis negatively correlated with . The corresponding scatter plots are in Fig. 10 (appendix).
ch 1
ch 2
filters w
input x
w · x
= 1.1
= 0.2
1.3
ch 1
ch 2
(a) Spatial pooling (b) Channel pooling
avg. pooling
size = 2x2
stride = 1
rotated
input x
= 0.2
= 1.1
1.3
7 6.5
6 6.5
8 7 5
4 9 5
2 9 3
+
+
·
·
Figure 2: Spatial vs. channel pooling. (a) Spatial average pooling (size 2x2, stride 1) computed on
a representation of size 3x3. One can notice that nearby pixel variations are smaller after pooling.
(b) If the filters of different channels are identical up to e.g. a rotation of angle
θ
, then, averaging
the output of the application of such filters makes the result invariant to input rotations of
θ
. This
averaging is an example of channel pooling.
is correlated with
Df
when
f
is the network output. Less intuitively, the test error is anti-correlated
with the sensitivity to white noise
Gf
. Overall, it is the relative sensitivity
Rf
which correlates
best with the error (Fig. 1, middle). This correlation is learned over training—as it is not seen at
initialization—and built up layer by layer (Petrini et al.,2021). These phenomena are not simply due
to benchmark data being noiseless, as they persist when input images are corrupted by some small
noise (Fig. 1, right).
Operations that grant insensitivity to diffeomorphisms in a deep network have been identified
previously (e.g. Goodfellow et al. (2016), section 9.3, sketched in Fig. 2). The first, spatial pooling,
integrates local patches within the image, thus losing the exact location of its features. The second,
channel pooling, requires the interaction of different channels, which allows the network to become
invariant to any local transformation by properly learning filters that are transformed versions of one
another. However, it is not clear whether these operations are actually learned by deep networks
and how they conspire in building good representations. Here we tackle this question by unveiling
empirically the emergence of spatial and channel pooling, and disentangling their role. Below is a
detailed list of our contributions.
2
1.1 OUR CONTRIBUTIONS
We disentangle the role of spatial and channel pooling within deep networks trained on
CIFAR10 (Section 2). More specifically, our experiments reveal the significant contribution
of spatial pooling in decreasing the sensitivity to diffeomorphisms.
In order to isolate the contribution of spatial pooling and quantify its relation with the
sensitivities to diffeomorphism and noise, we introduce idealized scale-detection tasks
(Section 3). In these tasks, data are made of two active pixels and classified according to
their distance. We find the same correlations between test error and sensitivities of trained
networks as found in Petrini et al. (2021). In addition, the neural networks which perform
the best on real data tend to be the best on these tasks.
We theoretically analyze how simple CNNs, made by stacking convolutional layers with
filter size
F
and stride
s
, learn these tasks (Section 4). We find that the trained networks
perform spatial pooling for most of its layers. We show and verify empirically that the
sensitivities
Dk
and
Gk
of the
k
-th hidden layer follow
GkAk
and
DkAαs
k
, where
Akis the effective receptive field size and αs= 2 if there is no stride, αs= 1 otherwise.
The code and details for reproducing experiments are available online at
github.com/leonardopetrini/relativestability/experiments ICLR23.md.
1.2 RELATED WORK
In the neuroscience literature, the understanding of the relevance of pooling in building invariant
representations dates back to the pioneering work of Hubel & Wiesel (1962). By studying the cat
visual cortex, they identified two different kinds of neurons: simple cells responding to e.g. edges at
specific angles and complex cells that pool the response of simple cells and detect edges regardless of
their position or orientation in the receptive field. More recent accounts of the importance of learning
invariant representations in the visual cortex can be found in Niyogi et al. (1998); Anselmi et al.
(2016); Poggio & Anselmi (2016).
In the context of artificial neural networks, layers jointly performing spatial pooling and strides
have been introduced with the early CNNs of Lecun et al. (1998), following the intuition that local
averaging and subsampling would reduce the sensitivity to small input shifts. Ruderman et al. (2018)
investigated the role of spatial pooling and showed empirically that networks with and without pooling
layers converge to similar deformation stability, suggesting that spatial pooling can be learned in
deep networks. In our work, we further expand in this direction by jointly studying diffeomorphisms
and noise stability and proposing a theory of spatial pooling for a simple task.
The depth-wise loss of irrelevant information in deep networks has been investigated by means of the
information bottleneck framework (Shwartz-Ziv & Tishby,2017;Saxe et al.,2019) and the intrinsic
dimension of the networks internal representations (Ansuini et al.,2019;Recanatesi et al.,2019).
However, these works do not specify what is the irrelevant information to be disregarded, nor the
mechanisms involved in such a process.
The stability of trained networks to noise is extensively studied in the context of adversarial robust-
ness (Fawzi & Frossard,2015;Kanbak et al.,2018;Alcorn et al.,2019;Alaifari et al.,2018;Athalye
et al.,2018;Xiao et al.,2018a;Engstrom et al.,2019). Notice that our work differs from this literature
by the fact that we consider typical perturbations instead of worst-case ones.
2 EMPIRICAL OBSERVATIONS ON REAL DATA
In this section we analyze the parameters of deep CNNs trained on CIFAR10 and ImageNet, so as to
understand how they build representations insensitive to diffeomorphisms (details of the experiments
in App. B). The analysis builds on two premises, the first being the assumption that insensitivity is
built layer by layer in the network, as shown in Fig. 3. Hence, we focus on how each of the layers
in a deep network contribute towards creating an insensitive representation. More specifically, let
us denote with
fk(x)
the internal representation of an input
x
at the
k
-th layer of the network. The
entries of
fk
have three indices, one for the channel
c
and two for the spatial location
(i, j)
. The
3
relation between fkand fk1is the following,
[fk(x)]c;i,j =φ
bk
c+
Hk1
X
c0=1
wk
c,c0·pi,j ([fk1(x)]c0)
c= 1, . . . , Hk,(2)
where:
Hk
denotes the number of channels at the
k
-th layer;
bk
c
and
wk
c,c0
the biases and filters of
the
k
-th layer; each filter
wk
c,c0
is a
F×F
matrix with
F
the filter size;
pi,j ([fk1(x)]c0)
denotes
a
F×F
-dimensional patch of
[fk1(x)]c0
centered at
(i, j)
;
φ
the activation function. The second
premise is that a general diffeomorphism can be represented as a displacement field over the image,
which indicates how each pixel moves in the transformation. Locally, this displacement field can be
decomposed into a constant term and a linear part: the former corresponds to local translations, the
latter to stretchings, rotations and shears.1
Invariance to translations via spatial pooling.
Due to weight sharing, i.e. the fact that the same
filter
wk
c,c0
is applied to all the local patches
(i, j)
of the representation, the output of a convolutional
layer is equivariant to translations by construction: a shift of the input is equivalent to a shift of the
output. To achieve an invariant representation it suffices to sum up the spatial entries of
fk
—an
operation called pooling in CNNs, we refer to it as spatial pooling to stress that the sum runs over
the spatial indices of the representation. Even if there are no pooling layers at initialization, they
can be realized by having homogeneous filters, i.e. all the
F×F
entries of
wk+1
c,c0
are the same.
Therefore, the closer the filters are to the homogeneous filter, the more they decrease the sensitivity
of the representation to local translations.
Invariance to other transformations via channel pooling.
The example of translations shows
that building invariance can be performed by constructing an equivariant representation, and then
pooling it. Invariance can also be built by pooling across channels. A two-channel example is shown
Fig. 2, panel (b), where the filter of the second channel is built so as to produce the same output as
the first channel when applied to a rotated input. The same idea can be applied more generally, e.g.
to the other components of diffeomorphisms—such as local stretchings and shears. Below, we refer
generically to any operation that build invariance to diffeomorphisms by assembling distinct channels
as channel pooling.
Disentangling spatial and channel pooling.
The relative sensitivity to diffeomorphisms
Rk
of the
k
-th layer representation
fk
decreases after each layer, as shown in Fig. 3. This implies that spatial
or channel pooling are carried out along the whole network. To disentangle their contribution we
perform the following experiment: shuffle at random the connections between channels of successive
convolutional layers, while keeping the weights unaltered. Channel shuffling amounts to randomly
permuting the values of
c, c0
in Eq. 2, therefore it breaks any channel pooling while not affecting
single filters. The values of
Rk
for deep networks after channel shuffling are reported in Fig. 3 as
dashed lines and compared with the original values of
Rk
in full lines. If only spatial pooling was
present in the network, then the two curves would overlap. Conversely, if the decrease in
Rk
was
all due to the interactions between channels, then the shuffled curves should be constant. Given that
neither of these scenarios arises, we conclude that both kinds of pooling are being performed.
Emergence of spatial pooling after training.
To bolster the evidence for the presence of spatial
pooling, we analyze the filters of trained networks. Since spatial pooling can be built by having
homogeneous filters, we test for its presence by looking at the frequency content of learned filters
wk
i,j
. In particular, we consider the average squared projection of filters onto “Fourier modes”
{Ψl}l=1,...,F 2
, taken as the eigenvectors of the discrete Laplace operator on the
F×F
filter grid.
The square projections averaged over channels read
γk,l =1
Hk1Hk
Hk
X
c=1
Hk1
X
c0=1 Ψl·wk
c,c02,(3)
1
The displacement field around a pixel
(u0, v0)
is approximated as
τ(u, v)'τ(u0, v0) + J(u0, v0)[u
u0, v v0]T
, where
τ(u0, v0)
corresponds to translations and
J
is the Jacobian matrix of
τ
whose trace,
antisymmetric and symmetric traceless parts correspond to stretchings, rotations and shears, respectively.
4
摘要:

HOWDEEPCONVOLUTIONALNEURALNETWORKSLOSESPATIALINFORMATIONWITHTRAININGUmbertoM.Tomasini,LeonardoPetrini,FrancescoCagnetta,MatthieuWyartInstituteofPhysics´EcolePolytechniqueF´ed´eraledeLausannename.surname@epfl.chABSTRACTAcentralquestionofmachinelearningishowdeepnetsmanagetolearntasksinhighdimensions...

展开>> 收起<<
HOW DEEP CONVOLUTIONAL NEURAL NETWORKS LOSE SPATIAL INFORMATION WITH TRAINING Umberto M. Tomasini Leonardo Petrini Francesco Cagnetta Matthieu Wyart.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:2.7MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注