
relation between fkand fk−1is the following,
[fk(x)]c;i,j =φ
bk
c+
Hk−1
X
c0=1
wk
c,c0·pi,j ([fk−1(x)]c0)
∀c= 1, . . . , Hk,(2)
where:
Hk
denotes the number of channels at the
k
-th layer;
bk
c
and
wk
c,c0
the biases and filters of
the
k
-th layer; each filter
wk
c,c0
is a
F×F
matrix with
F
the filter size;
pi,j ([fk−1(x)]c0)
denotes
a
F×F
-dimensional patch of
[fk−1(x)]c0
centered at
(i, j)
;
φ
the activation function. The second
premise is that a general diffeomorphism can be represented as a displacement field over the image,
which indicates how each pixel moves in the transformation. Locally, this displacement field can be
decomposed into a constant term and a linear part: the former corresponds to local translations, the
latter to stretchings, rotations and shears.1
Invariance to translations via spatial pooling.
Due to weight sharing, i.e. the fact that the same
filter
wk
c,c0
is applied to all the local patches
(i, j)
of the representation, the output of a convolutional
layer is equivariant to translations by construction: a shift of the input is equivalent to a shift of the
output. To achieve an invariant representation it suffices to sum up the spatial entries of
fk
—an
operation called pooling in CNNs, we refer to it as spatial pooling to stress that the sum runs over
the spatial indices of the representation. Even if there are no pooling layers at initialization, they
can be realized by having homogeneous filters, i.e. all the
F×F
entries of
wk+1
c,c0
are the same.
Therefore, the closer the filters are to the homogeneous filter, the more they decrease the sensitivity
of the representation to local translations.
Invariance to other transformations via channel pooling.
The example of translations shows
that building invariance can be performed by constructing an equivariant representation, and then
pooling it. Invariance can also be built by pooling across channels. A two-channel example is shown
Fig. 2, panel (b), where the filter of the second channel is built so as to produce the same output as
the first channel when applied to a rotated input. The same idea can be applied more generally, e.g.
to the other components of diffeomorphisms—such as local stretchings and shears. Below, we refer
generically to any operation that build invariance to diffeomorphisms by assembling distinct channels
as channel pooling.
Disentangling spatial and channel pooling.
The relative sensitivity to diffeomorphisms
Rk
of the
k
-th layer representation
fk
decreases after each layer, as shown in Fig. 3. This implies that spatial
or channel pooling are carried out along the whole network. To disentangle their contribution we
perform the following experiment: shuffle at random the connections between channels of successive
convolutional layers, while keeping the weights unaltered. Channel shuffling amounts to randomly
permuting the values of
c, c0
in Eq. 2, therefore it breaks any channel pooling while not affecting
single filters. The values of
Rk
for deep networks after channel shuffling are reported in Fig. 3 as
dashed lines and compared with the original values of
Rk
in full lines. If only spatial pooling was
present in the network, then the two curves would overlap. Conversely, if the decrease in
Rk
was
all due to the interactions between channels, then the shuffled curves should be constant. Given that
neither of these scenarios arises, we conclude that both kinds of pooling are being performed.
Emergence of spatial pooling after training.
To bolster the evidence for the presence of spatial
pooling, we analyze the filters of trained networks. Since spatial pooling can be built by having
homogeneous filters, we test for its presence by looking at the frequency content of learned filters
wk
i,j
. In particular, we consider the average squared projection of filters onto “Fourier modes”
{Ψl}l=1,...,F 2
, taken as the eigenvectors of the discrete Laplace operator on the
F×F
filter grid.
The square projections averaged over channels read
γk,l =1
Hk−1Hk
Hk
X
c=1
Hk−1
X
c0=1 Ψl·wk
c,c02,(3)
1
The displacement field around a pixel
(u0, v0)
is approximated as
τ(u, v)'τ(u0, v0) + J(u0, v0)[u−
u0, v −v0]T
, where
τ(u0, v0)
corresponds to translations and
J
is the Jacobian matrix of
τ
whose trace,
antisymmetric and symmetric traceless parts correspond to stretchings, rotations and shears, respectively.
4