An Investigation into Whitening Loss for Self-supervised Learning Xi Weng1 Lei Huang 12 Lei Zhao1

2025-04-27 0 0 1.06MB 21 页 10玖币
侵权投诉
An Investigation into Whitening Loss for
Self-supervised Learning
Xi Weng1, Lei Huang1,2, Lei Zhao1,
Rao Muhammad Anwer2, Salman Khan2, Fahad Shahbaz Khan2
1SKLSDE, Institute of Artificial Intelligence, Beihang University, Beijing, China
2Mohamed bin Zayed University of Artificial Intelligence, UAE
Abstract
A desirable objective in self-supervised learning (SSL) is to avoid feature collapse.
Whitening loss guarantees collapse avoidance by minimizing the distance between
embeddings of positive pairs under the conditioning that the embeddings from
different views are whitened. In this paper, we propose a framework with an
informative indicator to analyze whitening loss, which provides a clue to demystify
several interesting phenomena as well as a pivoting point connecting to other SSL
methods. We reveal that batch whitening (BW) based methods do not impose
whitening constraints on the embedding, but they only require the embedding
to be full-rank. This full-rank constraint is also sufficient to avoid dimensional
collapse. Based on our analysis, we propose channel whitening with random
group partition (CW-RGP), which exploits the advantages of BW-based methods
in preventing collapse and avoids their disadvantages requiring large batch size.
Experimental results on ImageNet classification and COCO object detection reveal
that the proposed CW-RGP possesses a promising potential for learning good
representations. The code is available at https://github.com/winci-ai/CW-RGP.
1 Introduction
Self-supervised learning (SSL) has made significant progress over the last several years [
1
,
21
,
6
,
18
,
8
], almost reaching the performance of supervised baselines on many downstream tasks [
36
,
27
,
38
].
Several recent approaches rely on a joint embedding architecture in which a dual pair of networks are
trained to produce similar embeddings for different views of the same image [
8
]. Such methods aim
to learn representations that are invariant to transformation of the same input. One main challenge
with the joint embedding architectures is how to prevent a collapse of representation, in which the
two branches ignore the inputs and produce identical and constant output representations [6, 8].
One line of work uses contrastive learning methods that attract different views from the same image
(positive pairs) while pull apart different images (negative pairs), which can prevent constant outputs
from the solution space [
47
]. While the concept is simple, these methods need large batch size to
obtain a good performance [
21
,
6
,
40
]. Another line of work tries to directly match the positive targets
without introducing negative pairs. A seminal approach, BYOL [
18
], shows that an extra predictor
and momentum is essential for representation learning. SimSiam [
8
] further generalizes [
18
] by
empirically showing that stop-gradient is essential for preventing trivial solutions. Recent works
generalize the collapse problem into dimensional collapse [
24
,
28
]
2
where the embedding vectors only
span a lower-dimensional subspace and would be highly correlated. Therefore, the embedding vector
dimensions would vary together and contain redundant information. To prevent the dimensional
equal contribution corresponding author (huangleiAI@buaa.edu.cn). This work was partially done while
Lei Huang was a visiting scholar at Mohamed bin Zayed University of Artificial Intelligence, UAE.
2This collapse is also referred to informational collapse in [2].
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.03586v1 [cs.CV] 7 Oct 2022
collapse, whitening loss is proposed by only minimizing the distance between embeddings of positive
pairs under the condition that embeddings from different views are whitened [
13
,
24
]. A typical way
is using batch whitening (BW) and imposing the loss on the whitened output [
13
,
24
], which obtains
promising results.
Although whitening loss has theoretical guarantee in avoiding collapse, we experimentally observe
that this guarantee depends on which kind of whitening transformation [
29
] is used in practice
(see Section 3.2 for details). This interesting observation challenges the motivations of whitening
loss for SSL. Besides, the motivation of whitening loss is that the whitening operation can remove
the correlation among axes [
24
] and a whitened representation ensures the examples scattered in
a spherical distribution [
13
]. Based on this argument, one can use the whitened output as the
representation for downstream tasks, but it is not used in practice. To this end, this paper investigates
whitening loss and tries to demystify these interesting observations. Our contributions are as follows:
We decompose the symmetric formulation of whitening loss into two asymmetric losses,
where each asymmetric loss requires an online network to match a whitened target. This
mechanism provides a pivoting point connecting to other methods, and a way to understand
why certain whitening transformation fails to avoid dimensional collapse.
Our analysis shows that BW based methods do not impose whitening constraints on the
embedding, but they only require the embedding to be full-rank. This full-rank constraint is
also sufficient to avoid dimensional collapse.
We propose channel whitening with random group partition (CW-RGP), which exploits
the advantages of BW-based method in preventing collapse and avoids their disadvantages
requiring large batch size. Experimental results on ImageNet classification and COCO
object detection show that CW-RGP has promising potential in learning good representation.
2 Related Work
A desirable objective in self-supervised learning is to avoid feature collapse.
Contrastive learning prevents collapse by attracting positive samples closer, and spreading nega-
tive samples apart [
47
,
48
]. In these methods, negative samples play an important role and need to
be well designed [
37
,
1
,
23
]. One typical mechanism is building a memory bank with a momentum
encoder to provide consistent negative samples, proposed in MoCos [
21
], yielding promising results
[
21
,
7
,
9
,
33
]. Other works include SimCLR [
6
] addresses that more negative samples in a batch with
strong data augmentations perform better. Contrastive methods require large batch sizes or memory
banks, which tends to be costly, promoting the questions whether negative pairs is necessary.
Non-contrastive methods
aim to accomplish SSL without introducing negative pairs explicitly [
3
,
4
,
34
,
18
,
8
]. One typical way to avoid representational collapse is the introduction of asymmetric
network architecture. BYOL [
18
] appends a predictor after the online network and introduce
momentum into the target network. SimSiam [
8
] further simplifies BYOL by removing the momentum
mechanism, and shows that stop-gradient to target network serves as an alternative approximation to
the momentum encoder. Other progress includes an asymmetric pipeline with a self-distillation loss
for Vision Transformers [
5
]. It remains not clear how the asymmetric network avoids collapse without
negative pairs, leaving the debates on batch normalization (BN) [
15
,
45
,
39
] and stop-gradient [
8
,
50
],
even though preliminary works have attempted to analyze the training dynamics theoretical with
certain assumptions [
44
] and build a connection between asymmetric network with contrastive
learning methods [
42
]. Our work provides a pivoting point connecting asymmetric network to
profound whitening loss in avoiding collapse.
Whitening loss
has theoretical guarantee in avoiding collapse by minimizing the distance of positive
pairs under the conditioning that the embeddings from different views are whitened [
49
,
13
,
24
,
2
].
One way to obtain whitened output is imposing a whitening penalty as regularization on embedding—
the so-called soft whitening, which is proposed in Barlow Twins [
49
], VICReg [
2
] and CCA-SSG [
51
].
Another way is using batch whitening (BW) [
25
]—the so-called hard whitening, which is used in
W-MSE [
13
] and Shuffled-DBN [
24
]. We propose a different hard whitening method—channel
whitening (CW) that has the same function that ensures all the singular values of transformed output
being one for avoiding collapse. But CW is more numerical stable and works better when batch size is
small, compared to BW. Furthermore, our CW with random group partition (CW-RGP) can effectively
control the extent of constraint on embedding and obtain better performance in practice. We note
that a recent work ICL [
52
] proposes to decorrelate instances, like CW but having several significant
2
differences in technical details. ICL uses "stop-gradient" for the whitening matrix, while CW requires
back-propagation through the whitening transformation. Besides, ICL uses extra pre-conditioning on
the covariance and whitening matrices, which is essential for the numerical stability, while CW does
not use extra pre-conditioning and can work well since it encourages the embedding to be full-rank.
3 Exploring Whitening Loss for SSL
3.1 Preliminaries
𝐗𝒯 ∼ 𝕋
𝐗1
𝐗2
Encoder Pro
𝐙2
𝐙1
𝒁1
𝐙2
𝐇1Whitening
Siamese network
𝒁: whitened output
𝐙: embeding𝐇: encoding
Encoder Pro
𝐇2Whitening
Distance
Figure 1: The basic notations for SSL used in this paper.
Let
x
denote the input sampled uniformly
from a set of images
D
, and
T
denote the
set of data transformations available for
augmentation. We consider the Siamese
network
fθ(·)
parameterized by
θ
. It takes
as input two randomly augmented views,
x1=T1(x)
and
x2=T2(x)
, where
T1,2T
. The network
fθ(·)
is trained
with an objective function that minimizes
the distance between embeddings obtained from different views of the same image:
L(x, θ) = ExD,T1,2T`fθ(T1(x)), fθ(T2(x)).(1)
where
`(·,·)
is a loss function. In particular, the Siamese network usually consists of an encoder
Eθe(·)
and a projector
Gθg(·)
. Their outputs
h=Eθe(T(x))
and
z=Gθg(h)
are referred to as
encoding and embedding, respectively. We summarize the notations and use the corresponding capital
letters denoting mini-batch data in Figure 1. Under this notation, we have
fθ(·) = Gθg(Eθe(·))
with
learnable parameters
θ={θe, θg}
. The encoding
h
is usually used as representation for evaluation
by either training a linear classifier [
21
] or transferring to downstream tasks. This is due to that
h
is
shown to obtain significantly better performance than the embedding z[6, 8].
The mean square error (MSE) of L2normalized vectors is usually used as the loss function [8]:
`(z1,z2) = kz1
kz1k2z2
kz2k2k2
2,(2)
where
k · k2
denotes the
L2
norm. This loss is also equivalent to the negative cosine similarity, up to
a scale of 1
2and an optimization irrelevant constant.
Collapse and Whitening Loss.
While minimizing Eqn. 1, a trivial solution known as collapse
could occur such that
fθ(x)c,xD
. The state of collapse will provide no gradients
for learning and offer no information for discrimination. Moreover, a weaker collapse condition
called dimensional collapse can be easily arrived, for which the projected features collapse into a
low-dimensional manifold. As illustrated in [
24
], dimensional collapse is associated with strong
correlations between axes, which motivates the use of whitening method in avoiding the dimensional
collapse. The general idea of whitening loss [
13
] is to minimize Eqn. 1, under the condition that
embeddings from different views are whitened, which can be formulated as3:
min
θL(x;θ) = ExD,T1,2T`(z1,z2),
s.t. cov(zi,zi) = I, i ∈ {1,2}.(3)
Whitening loss provides theoretical guarantee in avoiding (dimensional) collapse, since the embedding
is whitened with all axes decorrelated [
13
,
24
]. While it is difficult to directly solve the problem
of Eqn. 3, Ermolov et al. [
13
] propose to whiten the mini-batch embedding
ZRdz×m
using
batch whitening (BW) [
25
,
41
] and impose the loss on the whitened output
b
ZRdz×m
, given the
mini-batch inputs Xwith size of m, as follows:
min
θL(X;θ) = EXD,T1,2Tkb
Z1b
Z2k2
F
with b
Zi= Φ(Zi), i ∈ {1,2},(4)
where Φ(·)denotes the whitening transformation over mini-batch data.
3The dual view formulation can be extended to sdifferent views, as shown in [13].
3
(a) (b) (c) (d)
Figure 2: Effects of different whitening transformations for SSL. We use the ResNet-18 as the encoder
(dimension of representation is 512.), a two layer MLP with ReLU and BN appended as the projector
(dimension of embedding is 64). The model is trained on CIFAR-10 for 200 epochs with batch size
of 256 and standard data argumentation, using Adam optimizer [
30
] (more details of experimental
setup please see Appendix A.2). We show (a) the linear evaluation accuracy; (b) the training loss; (c)
the rank of embedding; (d) the rank of encoding.
Whitening Transformations.
There are an infinite number of possible whitening matrices, as
shown in [
29
,
25
], since any whitened data with a rotation is still whitened. For simplifying notation,
we assume
Z
is centered by
Z:= Z(I1
m11T)
. Ermolov et al. [
13
] propose W-MSE that uses
Cholesky decomposition (CD) whitening:
ΦCD (Z) = L1Z
in Eqn. 4, where
L
is a lower triangular
matrix from the Cholesky decomposition, with
LLT= Σ
. Here
Σ = 1
mZZT
is the covariance matrix
of the embedding. Hua et al. [
24
] use zero-phase component analysis (ZCA) whitening [
25
] in Eqn. 4:
ΦZCA =UΛ1
2UT
, where
Λ = diag(λ1, . . . , λdz)
and
U= [u1, ..., udz]
are the eigenvalues and
associated eigenvectors of
Σ
,i.e.,
UΛUT= Σ
. Another famous whitening is principal components
analysis (PCA) whitening: ΦP CA = Λ1
2UT[29, 25].
3.2 Empirical Investigation on Whitening Loss
In this section, we conduct experiments to investigate the effects of different whitening transformations
Φ(·)
used in Eqn. 4 for SSL. Besides, we investigate the performances of different features (including
encoding
H
,embedding
Z
and the whitened output
b
Z
) used as representation for evaluation. For
illustration, we first define the rank and stable-rank [46] of a matrix as follows:
Definition 1.
Given a matrix
ARd×m, d m
, we denote
{λ1, ..., λd}
the singular values of
A
in
a descent order with convention
λ1>0
. The
rank
of
A
is the number of its non-zero singular values,
denoted as
Rank(A) = Pd
i=1 I(λi>0)
, where
I(·)
is the indicator function. The
stable-rank
of
A
is denoted as r(A) = Pd
i=1 λi
λ1.
By definition,
Rank(A)
can be a good indicator to evaluate the extent of dimensional collapse of
A
,
and
r(A)
can be an indicator to evaluate the extent of whitening of
A
. It can be demonstrated that
r(A)Rank(A)d
[
46
]. Note that if
A
is fully whitened with covariance matrix
AAT=mI
,
we have
r(A) = Rank(A) = d
. We also define normalized rank as
\
Rank(A) = Rank(A)
d
and
normalized stable-rank as
br(A) = r(A)
d
, for comparing the extent of dimensional collapse and
whitening of matrices with different dimensions, respectively.
PCA Whitening Fails to Avoid Dimensional Collapse.
We compare the effects of ZCA, CD,
PCA transformations for whitening loss, evaluated on CIFAR-10 using the standard setup for SSL
(see Section 4.1 for details). Besides, we also provide the result of batch normalization (BN) that
only performs standardization without decorrelating the axes, and the ‘Plain’ method that imposes
the loss directly on embedding. From Figure 2, we observe that naively training a Siamese network
(‘Plain’) results in collapse both on the embedding (Figure 2(c)) and encoding (Figure 2(d)), which
significantly hampers the performance (Figure 2(a)), although its training loss becomes close to zero
(Figure 2(b)). We also observe that an extra BN imposed on the embedding prevents collapse to a
point. However, it suffers from the dimensional collapse where the rank of embedding and encoding
are significantly low, which also hampers the performance. ZCA and CD whitening both maintain
high rank of embedding and encoding by decorrelating the axes, ensuring high linear evaluation
accuracy. However, we note that PCA whitening shows significantly different behaviors: PCA
whitening cannot decrease the loss and even cannot avoid the dimensional collapse, which also leads
to significantly downgraded performance. This interesting observation challenges the motivations of
whitening loss for SSL. We defer the analyses and illustration in Section 3.3.
Whitened Output is not a Good Representation.
As introduced before, the motivation of whiten-
ing loss for SSL is that the whitening operation can remove the correlation among axes [
24
] and a
4
(a) (b) (c)
Figure 3: Comparisons of features when using encoding
H
,embedding
Z
and whitened output
b
Z
respectively. We follow the same experimental setup as Figure 2. We show (a) the linear evaluation
accuracy; (b) the kNN accuracy; (c) the normalized stable-rank for comparing the extent of whitening
(note that the normalized stable-rank of
b
Z
is always
100%
during training and we omit it for clarity).
The results are averaged by five random seeds, with standard deviation shown using shaded region.
whitened representation ensures that the examples scattered in a spherical distribution [
13
], which
is sufficient to avoid collapse. Based on this argument, one should use the whitened output
b
Z
as
the representation for downstream tasks, rather than the encoding
H
that is commonly used. This
raises questions that whether
H
is well whitened and whether the whitened output is a good feature.
We conduct experiments to compare the performances of whitening loss, when using
H
,
Z
and
b
Z
as representations for evaluation respectively. The results are shown in Figure 3. We observe
that using whitened output
b
Z
as a representation has significantly worse performance than using
H
.
Furthermore, we find that the normalized stable rank of
H
is significantly smaller than
100%
, which
suggests that
H
is not well whitened. These results show that the whitened output could not be a
good representation.
3.3 Analysing Decomposition of Whitening Loss
For clarity, we use the mini-batch input with size of
m
. Given one mini-batch input
X
with two
augmented views, Eqn. 4 can be formulated as:
L(X) = 1
mkb
Z1b
Z2k2
F.(5)
Let us consider a proxy loss described as:
L0(X) = 1
mkb
Z1(b
Z2)stk2
F
| {z }
L0
1
+1
mk(b
Z1)st b
Z2k2
F
| {z }
L0
2
,(6)
where
(·)st
indicates the stop-gradient operation. It is easy to demonstrate that
L
θ =L0
θ
(see
Appendix B.1 for proof). That is, the optimization dynamics of
L
is equivalent to
L0
. By looking into
the first term of Eqn. 14, we have:
L0
1=1
mkφ(Z1)Z1(b
Z2)stk2
F.(7)
Here, we can view
φ(Z1)
as a predictor that depends on
Z1
during forward propagation, and
b
Z2
as a whitened target with
r(b
Z2) = Rank(b
Z2) = dz
. In this way, we find that minimizing
L0
1
only
requires the embedding
Z1
being full-rank with
Rank(b
Z1) = dz
, as stated by following proposition.
Proposition 1.
Let
A= arg minZ1L0
1(Z1)
. We have that
A
is not an empty set, and
Z1A
,
Z1
is
full-rank. Furthermore, for any
{σi}dz
i=1
with
σ1σ2, ..., σdz>0
, we construct
e
A={Z1|Z1=
U2
diag(
σ1, σ2, ..., σdz
)
VT
2
, where
U2Rdz×dz
and
V2Rm×dz
are from the singular value
decomposition of b
Z2,i.e.,U2(mI)VT
2=b
Z2. When we use ZCA whitening, we have e
AA.
The proof is shown in Appendix B.2. Proposition 1 states that there are infinity matrix with full-rank
that is the optimum when minimizing
L0
1
w.r.t.
Z1
. Therefore, minimizing
L0
1
only requires the
embedding
Z1
being full-rank with
Rank(b
Z1) = dz
, and does not necessarily impose the constraints
on
Z1
to be whitened with
r(Z1) = dz
. Similar analysis also applies to
L0
2
and minimizing
L0
2
5
摘要:

AnInvestigationintoWhiteningLossforSelf-supervisedLearningXiWeng1,LeiHuang1;2,LeiZhao1,RaoMuhammadAnwer2,SalmanKhan2,FahadShahbazKhan21SKLSDE,InstituteofArticialIntelligence,BeihangUniversity,Beijing,China2MohamedbinZayedUniversityofArticialIntelligence,UAEAbstractAdesirableobjectiveinself-supe...

展开>> 收起<<
An Investigation into Whitening Loss for Self-supervised Learning Xi Weng1 Lei Huang 12 Lei Zhao1.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.06MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注