collapse, whitening loss is proposed by only minimizing the distance between embeddings of positive
pairs under the condition that embeddings from different views are whitened [
13
,
24
]. A typical way
is using batch whitening (BW) and imposing the loss on the whitened output [
13
,
24
], which obtains
promising results.
Although whitening loss has theoretical guarantee in avoiding collapse, we experimentally observe
that this guarantee depends on which kind of whitening transformation [
29
] is used in practice
(see Section 3.2 for details). This interesting observation challenges the motivations of whitening
loss for SSL. Besides, the motivation of whitening loss is that the whitening operation can remove
the correlation among axes [
24
] and a whitened representation ensures the examples scattered in
a spherical distribution [
13
]. Based on this argument, one can use the whitened output as the
representation for downstream tasks, but it is not used in practice. To this end, this paper investigates
whitening loss and tries to demystify these interesting observations. Our contributions are as follows:
•
We decompose the symmetric formulation of whitening loss into two asymmetric losses,
where each asymmetric loss requires an online network to match a whitened target. This
mechanism provides a pivoting point connecting to other methods, and a way to understand
why certain whitening transformation fails to avoid dimensional collapse.
•
Our analysis shows that BW based methods do not impose whitening constraints on the
embedding, but they only require the embedding to be full-rank. This full-rank constraint is
also sufficient to avoid dimensional collapse.
•
We propose channel whitening with random group partition (CW-RGP), which exploits
the advantages of BW-based method in preventing collapse and avoids their disadvantages
requiring large batch size. Experimental results on ImageNet classification and COCO
object detection show that CW-RGP has promising potential in learning good representation.
2 Related Work
A desirable objective in self-supervised learning is to avoid feature collapse.
Contrastive learning prevents collapse by attracting positive samples closer, and spreading nega-
tive samples apart [
47
,
48
]. In these methods, negative samples play an important role and need to
be well designed [
37
,
1
,
23
]. One typical mechanism is building a memory bank with a momentum
encoder to provide consistent negative samples, proposed in MoCos [
21
], yielding promising results
[
21
,
7
,
9
,
33
]. Other works include SimCLR [
6
] addresses that more negative samples in a batch with
strong data augmentations perform better. Contrastive methods require large batch sizes or memory
banks, which tends to be costly, promoting the questions whether negative pairs is necessary.
Non-contrastive methods
aim to accomplish SSL without introducing negative pairs explicitly [
3
,
4
,
34
,
18
,
8
]. One typical way to avoid representational collapse is the introduction of asymmetric
network architecture. BYOL [
18
] appends a predictor after the online network and introduce
momentum into the target network. SimSiam [
8
] further simplifies BYOL by removing the momentum
mechanism, and shows that stop-gradient to target network serves as an alternative approximation to
the momentum encoder. Other progress includes an asymmetric pipeline with a self-distillation loss
for Vision Transformers [
5
]. It remains not clear how the asymmetric network avoids collapse without
negative pairs, leaving the debates on batch normalization (BN) [
15
,
45
,
39
] and stop-gradient [
8
,
50
],
even though preliminary works have attempted to analyze the training dynamics theoretical with
certain assumptions [
44
] and build a connection between asymmetric network with contrastive
learning methods [
42
]. Our work provides a pivoting point connecting asymmetric network to
profound whitening loss in avoiding collapse.
Whitening loss
has theoretical guarantee in avoiding collapse by minimizing the distance of positive
pairs under the conditioning that the embeddings from different views are whitened [
49
,
13
,
24
,
2
].
One way to obtain whitened output is imposing a whitening penalty as regularization on embedding—
the so-called soft whitening, which is proposed in Barlow Twins [
49
], VICReg [
2
] and CCA-SSG [
51
].
Another way is using batch whitening (BW) [
25
]—the so-called hard whitening, which is used in
W-MSE [
13
] and Shuffled-DBN [
24
]. We propose a different hard whitening method—channel
whitening (CW) that has the same function that ensures all the singular values of transformed output
being one for avoiding collapse. But CW is more numerical stable and works better when batch size is
small, compared to BW. Furthermore, our CW with random group partition (CW-RGP) can effectively
control the extent of constraint on embedding and obtain better performance in practice. We note
that a recent work ICL [
52
] proposes to decorrelate instances, like CW but having several significant
2