An Investigation into Whitening Loss for Self-supervised Learning Xi Weng1 Lei Huang 12 Lei Zhao1

2025-04-27 0 0 1.06MB 21 页 10玖币

侵权投诉

An Investigation into Whitening Loss for

Self-supervised Learning

Xi Weng∗1, Lei Huang∗1,2, Lei Zhao∗1,

Rao Muhammad Anwer2, Salman Khan2, Fahad Shahbaz Khan2

1SKLSDE, Institute of Artiﬁcial Intelligence, Beihang University, Beijing, China

2Mohamed bin Zayed University of Artiﬁcial Intelligence, UAE

Abstract

A desirable objective in self-supervised learning (SSL) is to avoid feature collapse.

Whitening loss guarantees collapse avoidance by minimizing the distance between

embeddings of positive pairs under the conditioning that the embeddings from

different views are whitened. In this paper, we propose a framework with an

informative indicator to analyze whitening loss, which provides a clue to demystify

several interesting phenomena as well as a pivoting point connecting to other SSL

methods. We reveal that batch whitening (BW) based methods do not impose

whitening constraints on the embedding, but they only require the embedding

to be full-rank. This full-rank constraint is also sufﬁcient to avoid dimensional

collapse. Based on our analysis, we propose channel whitening with random

group partition (CW-RGP), which exploits the advantages of BW-based methods

in preventing collapse and avoids their disadvantages requiring large batch size.

Experimental results on ImageNet classiﬁcation and COCO object detection reveal

that the proposed CW-RGP possesses a promising potential for learning good

representations. The code is available at https://github.com/winci-ai/CW-RGP.

1 Introduction

Self-supervised learning (SSL) has made signiﬁcant progress over the last several years [

], almost reaching the performance of supervised baselines on many downstream tasks [

Several recent approaches rely on a joint embedding architecture in which a dual pair of networks are

trained to produce similar embeddings for different views of the same image [

]. Such methods aim

to learn representations that are invariant to transformation of the same input. One main challenge

with the joint embedding architectures is how to prevent a collapse of representation, in which the

two branches ignore the inputs and produce identical and constant output representations [6, 8].

One line of work uses contrastive learning methods that attract different views from the same image

(positive pairs) while pull apart different images (negative pairs), which can prevent constant outputs

from the solution space [

]. While the concept is simple, these methods need large batch size to

obtain a good performance [

]. Another line of work tries to directly match the positive targets

without introducing negative pairs. A seminal approach, BYOL [

], shows that an extra predictor

and momentum is essential for representation learning. SimSiam [

] further generalizes [

] by

empirically showing that stop-gradient is essential for preventing trivial solutions. Recent works

generalize the collapse problem into dimensional collapse [

]

where the embedding vectors only

span a lower-dimensional subspace and would be highly correlated. Therefore, the embedding vector

dimensions would vary together and contain redundant information. To prevent the dimensional

∗

equal contribution corresponding author (huangleiAI@buaa.edu.cn). This work was partially done while

Lei Huang was a visiting scholar at Mohamed bin Zayed University of Artiﬁcial Intelligence, UAE.

2This collapse is also referred to informational collapse in [2].

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.03586v1 [cs.CV] 7 Oct 2022

collapse, whitening loss is proposed by only minimizing the distance between embeddings of positive

pairs under the condition that embeddings from different views are whitened [

]. A typical way

is using batch whitening (BW) and imposing the loss on the whitened output [

], which obtains

promising results.

Although whitening loss has theoretical guarantee in avoiding collapse, we experimentally observe

that this guarantee depends on which kind of whitening transformation [

] is used in practice

(see Section 3.2 for details). This interesting observation challenges the motivations of whitening

loss for SSL. Besides, the motivation of whitening loss is that the whitening operation can remove

the correlation among axes [

] and a whitened representation ensures the examples scattered in

a spherical distribution [

]. Based on this argument, one can use the whitened output as the

representation for downstream tasks, but it is not used in practice. To this end, this paper investigates

whitening loss and tries to demystify these interesting observations. Our contributions are as follows:

•

We decompose the symmetric formulation of whitening loss into two asymmetric losses,

where each asymmetric loss requires an online network to match a whitened target. This

mechanism provides a pivoting point connecting to other methods, and a way to understand

why certain whitening transformation fails to avoid dimensional collapse.

•

Our analysis shows that BW based methods do not impose whitening constraints on the

embedding, but they only require the embedding to be full-rank. This full-rank constraint is

also sufﬁcient to avoid dimensional collapse.

•

We propose channel whitening with random group partition (CW-RGP), which exploits

the advantages of BW-based method in preventing collapse and avoids their disadvantages

requiring large batch size. Experimental results on ImageNet classiﬁcation and COCO

object detection show that CW-RGP has promising potential in learning good representation.

2 Related Work

A desirable objective in self-supervised learning is to avoid feature collapse.

Contrastive learning prevents collapse by attracting positive samples closer, and spreading nega-

tive samples apart [

]. In these methods, negative samples play an important role and need to

be well designed [

]. One typical mechanism is building a memory bank with a momentum

encoder to provide consistent negative samples, proposed in MoCos [

], yielding promising results

[

]. Other works include SimCLR [

] addresses that more negative samples in a batch with

strong data augmentations perform better. Contrastive methods require large batch sizes or memory

banks, which tends to be costly, promoting the questions whether negative pairs is necessary.

Non-contrastive methods

aim to accomplish SSL without introducing negative pairs explicitly [

]. One typical way to avoid representational collapse is the introduction of asymmetric

network architecture. BYOL [

] appends a predictor after the online network and introduce

momentum into the target network. SimSiam [

] further simpliﬁes BYOL by removing the momentum

mechanism, and shows that stop-gradient to target network serves as an alternative approximation to

the momentum encoder. Other progress includes an asymmetric pipeline with a self-distillation loss

for Vision Transformers [

]. It remains not clear how the asymmetric network avoids collapse without

negative pairs, leaving the debates on batch normalization (BN) [

] and stop-gradient [

even though preliminary works have attempted to analyze the training dynamics theoretical with

certain assumptions [

] and build a connection between asymmetric network with contrastive

learning methods [

]. Our work provides a pivoting point connecting asymmetric network to

profound whitening loss in avoiding collapse.

Whitening loss

has theoretical guarantee in avoiding collapse by minimizing the distance of positive

pairs under the conditioning that the embeddings from different views are whitened [

One way to obtain whitened output is imposing a whitening penalty as regularization on embedding—

the so-called soft whitening, which is proposed in Barlow Twins [

], VICReg [

] and CCA-SSG [

Another way is using batch whitening (BW) [

]—the so-called hard whitening, which is used in

W-MSE [

] and Shufﬂed-DBN [

]. We propose a different hard whitening method—channel

whitening (CW) that has the same function that ensures all the singular values of transformed output

being one for avoiding collapse. But CW is more numerical stable and works better when batch size is

small, compared to BW. Furthermore, our CW with random group partition (CW-RGP) can effectively

control the extent of constraint on embedding and obtain better performance in practice. We note

that a recent work ICL [

] proposes to decorrelate instances, like CW but having several signiﬁcant

differences in technical details. ICL uses "stop-gradient" for the whitening matrix, while CW requires

back-propagation through the whitening transformation. Besides, ICL uses extra pre-conditioning on

the covariance and whitening matrices, which is essential for the numerical stability, while CW does

not use extra pre-conditioning and can work well since it encourages the embedding to be full-rank.

3 Exploring Whitening Loss for SSL

3.1 Preliminaries

𝐗𝒯 ∼ 𝕋

𝐗1

𝐗2

Encoder Pro

𝐙2

𝐙1෡

𝒁1

෠

𝐙2

𝐇1Whitening

Siamese network

෡

𝒁: whitened output

𝐙: embeding𝐇: encoding

Encoder Pro

𝐇2Whitening

Distance

Figure 1: The basic notations for SSL used in this paper.

Let

denote the input sampled uniformly

from a set of images

, and

denote the

set of data transformations available for

augmentation. We consider the Siamese

network

fθ(·)

parameterized by

. It takes

as input two randomly augmented views,

x1=T1(x)

and

x2=T2(x)

, where

T1,2∈T

. The network

fθ(·)

is trained

with an objective function that minimizes

the distance between embeddings obtained from different views of the same image:

L(x, θ) = Ex∼D,T1,2∼T`fθ(T1(x)), fθ(T2(x)).(1)

where

`(·,·)

is a loss function. In particular, the Siamese network usually consists of an encoder

Eθe(·)

and a projector

Gθg(·)

. Their outputs

h=Eθe(T(x))

and

z=Gθg(h)

are referred to as

encoding and embedding, respectively. We summarize the notations and use the corresponding capital

letters denoting mini-batch data in Figure 1. Under this notation, we have

fθ(·) = Gθg(Eθe(·))

with

learnable parameters

θ={θe, θg}

. The encoding

is usually used as representation for evaluation

by either training a linear classiﬁer [

] or transferring to downstream tasks. This is due to that

shown to obtain signiﬁcantly better performance than the embedding z[6, 8].

The mean square error (MSE) of L2−normalized vectors is usually used as the loss function [8]:

`(z1,z2) = kz1

kz1k2−z2

kz2k2k2

2,(2)

where

k · k2

denotes the

norm. This loss is also equivalent to the negative cosine similarity, up to

a scale of 1

2and an optimization irrelevant constant.

Collapse and Whitening Loss.

While minimizing Eqn. 1, a trivial solution known as collapse

could occur such that

fθ(x)≡c,∀x∈D

. The state of collapse will provide no gradients

for learning and offer no information for discrimination. Moreover, a weaker collapse condition

called dimensional collapse can be easily arrived, for which the projected features collapse into a

low-dimensional manifold. As illustrated in [

], dimensional collapse is associated with strong

correlations between axes, which motivates the use of whitening method in avoiding the dimensional

collapse. The general idea of whitening loss [

] is to minimize Eqn. 1, under the condition that

embeddings from different views are whitened, which can be formulated as3:

min

θL(x;θ) = Ex∼D,T1,2∼T`(z1,z2),

s.t. cov(zi,zi) = I, i ∈ {1,2}.(3)

Whitening loss provides theoretical guarantee in avoiding (dimensional) collapse, since the embedding

is whitened with all axes decorrelated [

]. While it is difﬁcult to directly solve the problem

of Eqn. 3, Ermolov et al. [

] propose to whiten the mini-batch embedding

Z∈Rdz×m

using

batch whitening (BW) [

] and impose the loss on the whitened output

Z∈Rdz×m

, given the

mini-batch inputs Xwith size of m, as follows:

min

θL(X;θ) = EX∼D,T1,2∼Tkb

Z1−b

Z2k2

with b

Zi= Φ(Zi), i ∈ {1,2},(4)

where Φ(·)denotes the whitening transformation over mini-batch data.

3The dual view formulation can be extended to sdifferent views, as shown in [13].

(a) (b) (c) (d)

Figure 2: Effects of different whitening transformations for SSL. We use the ResNet-18 as the encoder

(dimension of representation is 512.), a two layer MLP with ReLU and BN appended as the projector

(dimension of embedding is 64). The model is trained on CIFAR-10 for 200 epochs with batch size

of 256 and standard data argumentation, using Adam optimizer [

] (more details of experimental

setup please see Appendix A.2). We show (a) the linear evaluation accuracy; (b) the training loss; (c)

the rank of embedding; (d) the rank of encoding.

Whitening Transformations.

There are an inﬁnite number of possible whitening matrices, as

shown in [

], since any whitened data with a rotation is still whitened. For simplifying notation,

we assume

is centered by

Z:= Z(I−1

m11T)

. Ermolov et al. [

] propose W-MSE that uses

Cholesky decomposition (CD) whitening:

ΦCD (Z) = L−1Z

in Eqn. 4, where

is a lower triangular

matrix from the Cholesky decomposition, with

LLT= Σ

. Here

Σ = 1

mZZT

is the covariance matrix

of the embedding. Hua et al. [

] use zero-phase component analysis (ZCA) whitening [

] in Eqn. 4:

ΦZCA =UΛ−1

2UT

, where

Λ = diag(λ1, . . . , λdz)

and

U= [u1, ..., udz]

are the eigenvalues and

associated eigenvectors of

,i.e.,

UΛUT= Σ

. Another famous whitening is principal components

analysis (PCA) whitening: ΦP CA = Λ−1

2UT[29, 25].

3.2 Empirical Investigation on Whitening Loss

In this section, we conduct experiments to investigate the effects of different whitening transformations

Φ(·)

used in Eqn. 4 for SSL. Besides, we investigate the performances of different features (including

encoding

,embedding

and the whitened output

) used as representation for evaluation. For

illustration, we ﬁrst deﬁne the rank and stable-rank [46] of a matrix as follows:

Deﬁnition 1.

Given a matrix

A∈Rd×m, d ≤m

, we denote

{λ1, ..., λd}

the singular values of

a descent order with convention

λ1>0

. The

rank

is the number of its non-zero singular values,

denoted as

Rank(A) = Pd

i=1 I(λi>0)

, where

I(·)

is the indicator function. The

stable-rank

is denoted as r(A) = Pd

i=1 λi

λ1.

By deﬁnition,

Rank(A)

can be a good indicator to evaluate the extent of dimensional collapse of

and

r(A)

can be an indicator to evaluate the extent of whitening of

. It can be demonstrated that

r(A)≤Rank(A)≤d

[

]. Note that if

is fully whitened with covariance matrix

AAT=mI

we have

r(A) = Rank(A) = d

. We also deﬁne normalized rank as

Rank(A) = Rank(A)

and

normalized stable-rank as

br(A) = r(A)

, for comparing the extent of dimensional collapse and

whitening of matrices with different dimensions, respectively.

PCA Whitening Fails to Avoid Dimensional Collapse.

We compare the effects of ZCA, CD,

PCA transformations for whitening loss, evaluated on CIFAR-10 using the standard setup for SSL

(see Section 4.1 for details). Besides, we also provide the result of batch normalization (BN) that

only performs standardization without decorrelating the axes, and the ‘Plain’ method that imposes

the loss directly on embedding. From Figure 2, we observe that naively training a Siamese network

(‘Plain’) results in collapse both on the embedding (Figure 2(c)) and encoding (Figure 2(d)), which

signiﬁcantly hampers the performance (Figure 2(a)), although its training loss becomes close to zero

(Figure 2(b)). We also observe that an extra BN imposed on the embedding prevents collapse to a

point. However, it suffers from the dimensional collapse where the rank of embedding and encoding

are signiﬁcantly low, which also hampers the performance. ZCA and CD whitening both maintain

high rank of embedding and encoding by decorrelating the axes, ensuring high linear evaluation

accuracy. However, we note that PCA whitening shows signiﬁcantly different behaviors: PCA

whitening cannot decrease the loss and even cannot avoid the dimensional collapse, which also leads

to signiﬁcantly downgraded performance. This interesting observation challenges the motivations of

whitening loss for SSL. We defer the analyses and illustration in Section 3.3.

Whitened Output is not a Good Representation.

As introduced before, the motivation of whiten-

ing loss for SSL is that the whitening operation can remove the correlation among axes [

] and a

(a) (b) (c)

Figure 3: Comparisons of features when using encoding

,embedding

and whitened output

respectively. We follow the same experimental setup as Figure 2. We show (a) the linear evaluation

accuracy; (b) the kNN accuracy; (c) the normalized stable-rank for comparing the extent of whitening

(note that the normalized stable-rank of

is always

100%

during training and we omit it for clarity).

The results are averaged by ﬁve random seeds, with standard deviation shown using shaded region.

whitened representation ensures that the examples scattered in a spherical distribution [

], which

is sufﬁcient to avoid collapse. Based on this argument, one should use the whitened output

the representation for downstream tasks, rather than the encoding

that is commonly used. This

raises questions that whether

is well whitened and whether the whitened output is a good feature.

We conduct experiments to compare the performances of whitening loss, when using

and

as representations for evaluation respectively. The results are shown in Figure 3. We observe

that using whitened output

as a representation has signiﬁcantly worse performance than using

Furthermore, we ﬁnd that the normalized stable rank of

is signiﬁcantly smaller than

100%

, which

suggests that

is not well whitened. These results show that the whitened output could not be a

good representation.

3.3 Analysing Decomposition of Whitening Loss

For clarity, we use the mini-batch input with size of

. Given one mini-batch input

with two

augmented views, Eqn. 4 can be formulated as:

L(X) = 1

mkb

Z1−b

Z2k2

F.(5)

Let us consider a proxy loss described as:

L0(X) = 1

mkb

Z1−(b

Z2)stk2

| {z }

mk(b

Z1)st −b

Z2k2

| {z }

,(6)

where

(·)st

indicates the stop-gradient operation. It is easy to demonstrate that

∂L

∂θ =∂L0

∂θ

(see

Appendix B.1 for proof). That is, the optimization dynamics of

is equivalent to

. By looking into

the ﬁrst term of Eqn. 14, we have:

1=1

mkφ(Z1)Z1−(b

Z2)stk2

F.(7)

Here, we can view

φ(Z1)

as a predictor that depends on

during forward propagation, and

as a whitened target with

r(b

Z2) = Rank(b

Z2) = dz

. In this way, we ﬁnd that minimizing

only

requires the embedding

being full-rank with

Rank(b

Z1) = dz

, as stated by following proposition.

Proposition 1.

Let

A= arg minZ1L0

1(Z1)

. We have that

is not an empty set, and

∀Z1∈A

full-rank. Furthermore, for any

{σi}dz

i=1

with

σ1≥σ2≥, ..., σdz>0

, we construct

A={Z1|Z1=

diag(

σ1, σ2, ..., σdz

)

, where

U2∈Rdz×dz

and

V2∈Rm×dz

are from the singular value

decomposition of b

Z2,i.e.,U2(√mI)VT

2=b

Z2. When we use ZCA whitening, we have e

A⊆A.

The proof is shown in Appendix B.2. Proposition 1 states that there are inﬁnity matrix with full-rank

that is the optimum when minimizing

w.r.t.

. Therefore, minimizing

only requires the

embedding

being full-rank with

Rank(b

Z1) = dz

, and does not necessarily impose the constraints

to be whitened with

r(Z1) = dz

. Similar analysis also applies to

and minimizing

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnInvestigationintoWhiteningLossforSelf-supervisedLearningXiWeng1,LeiHuang1;2,LeiZhao1,RaoMuhammadAnwer2,SalmanKhan2,FahadShahbazKhan21SKLSDE,InstituteofArticialIntelligence,BeihangUniversity,Beijing,China2MohamedbinZayedUniversityofArticialIntelligence,UAEAbstractAdesirableobjectiveinself-supe...

展开>> 收起<<

An Investigation into Whitening Loss for Self-supervised Learning Xi Weng1 Lei Huang 12 Lei Zhao1.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

An Investigation into Whitening Loss for Self-supervised Learning Xi Weng1 Lei Huang 12 Lei Zhao1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: