
set of images
D
, an image
x
is sampled uniformly from
D
, and two augmented views
x1
and
x2
are obtained from
x
via a distribution of data augmentations
T
. Then they are fed to an encoder
f
,
consisting of a backbone and a projector network, which produces
`2
-normalized embeddings of
z1
and
z2
. For a batch of
m
images, we have
Z1=z1
1, z2
1, . . . , zm
1
and similarly for
Z2
, which are
two observations of the same Z. MEC aims to minimize the following loss:
LMEC =−µlog det Im+λZ>
1Z2≈ −Tr µ
n
X
k=1
(−1)k+1
kλZ>
1Z2k!,(3)
where the same notations in Section 2.1 apply, and
n
is the order of Taylor expansion. Compared
with Equation
(1)
, the formulation in
(3)
considers not only maximizing entropy, but also the view
consistency prior mined from the data itself, therefore learning meaningful representations.
As noted in Section 2.1, the convergence condition of Taylor expansion requires
kCk2<1
, where
C=λZ>
1Z2
and
λ=d
m2=1
m2
d
. We show such condition can be strictly satisfied by setting
2
d>1
because of the inequality
kCk2≤pkCk1kCk∞<1
. In practice, we empirically find
that the Taylor expansion converges over a wide range of
d
(Figure 4(c)) with a linear warm-up.
From the preliminary experiments on CIFAR-10 [41] (detailed in Appendix B), we also find that
the distributions of representations show progressive finer granularity as
d
decreases (Figure 4(b)).
This can be interpreted by the practical meaning of the distortion
d
(Figure 4(a)), i.e., a smaller
d
encourages the representation space to be encoded in finer granularity (and hence more uniform). By
contrast, a small
d
might break the semantic structures of similar images (i.e., tolerance). Therefore,
a good choice of
d
is needed to compromise the uniformity and tolerance properties [75,74] of
representations, which shares the same role as the temperature [80] term in contrastive learning.
An overview of MEC is illustrated in Figure 2and a PyTorch-like pseudocode is provided in
Appendix A. The algorithm describes the minimalist variant of MEC, which can be further improved
by integrating momentum encoder and asymmetric networks (detailed in experiments).
2.3 A Unified View of Batch-wise and Feature-wise SSL Objectives
Current SSL methods based on Siamese networks can be roughly divided into two categories: batch-
wise methods [12,31,13,15,14,10] and feature-wise methods [86,5,24,36]. The former aims to
minimize the distance between augmented views of the same sample while maximizing the distance
between different samples, which can be viewed as decorrelating the different features in a batch.
The latter, in contrast, tries to decorrelate the different vector components in the representation. The
relationship between them has not been fully understood. Our work builds bridges between these two
types of methods through the following derivation:
LMEC =−µlog det Im+λZ>
1Z2
| {z }
batch-wise
=−µlog det Id+λZ1Z>
2
| {z }
feature-wise
,(4)
which can be proved since
Z>
1Z2∈Rm×m
and
Z1Z>
2∈Rd×d
have the same nonzero eigenvalues.
In Figure 2, under the framework of MEC, we show the equivalence between batch-wise and feature-
wise methods using two examples, SimSiam [14] and Barlow Twins [86]. By taking Taylor expansion
(Equation
(2)
) of the left side of Equation
(4)
and before the trace operation, the diagonal elements of
the leading term (i.e.,
µλZ>
1Z2
) measure the similarity between the views of the same images in a
batch, and the objective of SimSiam [14] is equivalent to maximizing the trace of this term. Similarly,
the leading term of the right side expansion models the correlation between dimensions of the feature,
and the objective of Barlow Twins [86] is equivalent to the second-order expansion of
LMEC
. With
the above derivation, our method naturally subsumes the two different kinds of objectives as its
low-order expansions, and we show in experiments that better downstream task performance can be
achieved with higher-order approximations. We further show in Appendix Ethat our MEC can also
bridge other self-supervised objectives. And we hope the direct tying of a family of objectives to a
very grounded mathematical concept can inspire more new methods.
3 Experiments
We perform self-supervised pre-training using the proposed MEC on the training set of the Ima-
geNet ILSVRC-2012 dataset [17]. After pre-training, we conduct extensive experiments to examine
5