RankMe Assessing the Downstream Performance of Pretrained Self-Supervised Representations by Their Rank Quentin Garrido1 2Randall Balestriero1Laurent Najman2Yann LeCun1 3 4

2025-04-29 0 0 1.15MB 46 页 10玖币

侵权投诉

RankMe: Assessing the Downstream Performance of Pretrained Self-Supervised

Representations by Their Rank

Quentin Garrido 1 2 Randall Balestriero 1Laurent Najman 2Yann LeCun 134

Abstract

Joint-Embedding Self Supervised Learning (JE-

SSL) has seen a rapid development, with the emer-

gence of many method variations but only few

principled guidelines that would help practition-

ers to successfully deploy them. The main reason

for that pitfall comes from JE-SSL’s core principle

of not employing any input reconstruction there-

fore lacking visual cues of unsuccessful training.

Adding non informative loss values to that, it be-

comes difﬁcult to deploy SSL on a new dataset for

which no labels can help to judge the quality of the

learned representation. In this study, we develop a

simple unsupervised criterion that is indicative of

the quality of the learned JE-SSL representations:

their effective rank. Albeit simple and computa-

tionally friendly, this method —coined RankMe—

allows one to assess the performance of JE-SSL

representations, even on different downstream

datasets, without requiring any labels. A further

beneﬁt of RankMe is that it does not have any

training or hyper-parameters to tune. Through

thorough empirical experiments involving hun-

dreds of training episodes, we demonstrate how

RankMe can be used for hyperparameter selec-

tion with nearly no reduction in ﬁnal performance

compared to the current selection method that in-

volve a dataset’s labels. We hope that RankMe

will facilitate the deployment of JE-SSL towards

domains that do not have the opportunity to rely

on labels for representations’ quality assessment.

Meta AI - FAIR

Univ Gustave Eiffel, CNRS, LIGM, F-77454

Marne-la-Vallée, France

Courant Institute, New York University

Center for Data Science, New York University. Correspondence

to: Quentin Garrido <garridoq@meta.com>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

1. Introduction

Self-supervised learning (SSL) has shown great progress to

learn informative data representations in recent years (Chen

et al., 2020a; He et al., 2020; Chen et al., 2020b; Grill

et al., 2020; Lee et al., 2021; Caron et al., 2020; Zbon-

tar et al., 2021; Bardes et al., 2021; Tomasev et al., 2022;

Caron et al., 2021; Chen et al., 2021; Li et al., 2022b; Zhou

et al., 2022a;b; HaoChen et al., 2021; He et al., 2022), catch-

ing up to supervised baselines and even surpassing them

in few-shot learning, i.e., when evaluating the SSL model

from only a few labeled examples. Although various SSL

families of losses have emerged, most are variants of the

joint-embedding (JE) framework with a siamese network

architecture (Bromley et al., 1994), denoted as JE-SSL for

short. The only technicality we ought to introduce to make

our study precise is the fact that JE-SSL has introduced

some different notations to denote an input’s representation.

In short, JE-SSL often composes a backbone or encoder

network e.g., a ResNet50 and a projector network e.g., a

multilayer perceptron. This projector is only employed dur-

ing training, and we refer to its outputs as embeddings, while

the actual inputs’ representation employed for downstream

tasks are obtained at the encoder’s output.

Although downstream tasks performance of JE-SSL repre-

sentations might seem impressive, one pondering fact should

be noted: all existing methods, hyperparameters, models —

and thus performances — are obtained by manual search

involving the labels of the considered datasets. In words,

JE-SSL is tuned by monitoring the supervised performance

of the model at hand. Therefore, successfully deploying a

SSL model on a new dataset relies on the strong assumption

of having labels on that dataset to tune the SSL method e.g.

through a linear classiﬁer feeding on the JE-SSL represen-

tations (Misra & Maaten, 2020). This quality assessment

strategy was also extended to the use of nonlinear classiﬁers,

e.g., a

-nn classiﬁer (Wu et al., 2018; Zhuang et al., 2019).

Hence, although labels are not directly employed to compute

the weight updates, they are used as a proxy. This limitation

prevents the deployment of JE-SSL in challenging domains

where the number of available labelled examples is limited.

Adding to the challenge, one milestone of JE-SSL is to

move away from reconstruction based learning; hence with-

arXiv:2210.02885v3 [cs.LG] 26 Jun 2023

RankMe

out labels and without visual cues, tuning JE-SSL methods

on unlabeled datasets remains challenging. This led to the

application of feature inversion methods e.g. Deep Image

Prior (Ulyanov et al., 2018) or conditional diffusion mod-

els (Bordes et al., 2021) to be deployed onto learned JE-SSL

representation to try to visualize the learned features. Those

alternative visualization solutions however suffer from their

own limitations e.g. bias of the used method, or compu-

tational cost. More importantly, those feature inversion

strategies have been designed for natural images i.e. it is not

clear how such methods would perform on different data

modalities.

In this study we propose RankMe to assess a model’s

performance without having access to any labels; a sim-

ple method that does not require any training or tuning.

RankMe accurately predicts a model’s performance both on

In-Distribution (ID), i.e., same data distribution as used dur-

ing the JE-SSL training, and on Out-Of-Distribution (OOD),

i.e., different data distribution onto which the learned model

is deployed onto. We highlight this crucial property at the

top of Figure 1. The strength of RankMe lies in the fact

that it is solely based on the singular values distribution of

the learned embeddings which is not only simple to obtain

but also easy to interpret. In fact, RankMe’s motivation

hinges on Cover’s theorem (Cover, 1965) that states how

increasing the rank of a linear classiﬁer’s input increases its

training performance, and three simple hypotheses that thor-

oughly validate empirically at the end of our study. Since

RankMe provides a step towards (unlabeled) JE-SSL by al-

lowing practitioners to cross-validate hyperparameters and

select models without resorting to labels or feature inversion

methods, we hope that it will allow JE-SSL to move away

from using labels as part of their design search strategy. We

summarize our contributions below:

We introduce RankMe (Equation (1)) and motivate its

construction from ﬁrst principles (Section 5) e.g. Cover’s

theorem

We demonstrate that RankMe’s ability to inform about

JE-SSL downstream performances is consistent across

methods, e.g. VICReg, SimCLR, DINO, and their vari-

ants, and across architectures, e.g. using a projector net-

work and/or a nonlinear evaluation method (see Figure 2

and Section 3.3)

We demonstrate that RankMe enables hyperparameter

cross-validation for JE-SSL methods; RankMe is able to

retrieve and sometimes surpass most of the performance

previously found by manual –and label-guided– search

while not employing any labels, on both in domain and

out of domain datasets (Figure 1 and Tables 1 and 2)

We provide a hyperparameter free numerically stable imple-

mentation of RankMe in Section 3.1 and pseudo-code for

cross-validation in Figure 4. Through extensive experiments

involving 11 datasets and 110 models over 5 methods, we

demonstrate that in the linear and nonlinear probing regime,

RankMe is able to tell apart successful and sub-optimal JE-

SSL training, even on different downstream tasks without

having access to labels or downstream task data samples.

2. Background

Joint embedding self-supervised learning (JE-SSL). In

JE-SSL, two main families of method can be distinguished:

contrastive and non-contrastive. Contrastive methods (Chen

et al., 2020a; He et al., 2020; Chen et al., 2020b; 2021; Yeh

et al., 2021) mostly rely on the InfoNCE criterion (Oord

et al., 2018) except for (HaoChen et al., 2021) which uses

squared similarities between the embedding. A clustering

variant of contrastive learning has also emerged (Caron

et al., 2018; 2020; 2021) and can be thought of as con-

trastive methods, but between cluster centroids instead of

samples. Non-contrastive methods (Grill et al., 2020; Chen

& He, 2020; Caron et al., 2021; Bardes et al., 2021; Zbontar

et al., 2021; Ermolov et al., 2021; Li et al., 2022c) aim at

bringing together embeddings of positive samples, similar

to contrastive learning. However, a key difference with

contrastive learning lies in how those methods prevent a

representational collapse. In the former, the criterion ex-

plicitly pushes away negative samples, i.e., all samples that

are not positive, from each other. In the latter, the criterion

does not prevent collapse by distinguishing positive and

negative samples, but instead considers the embeddings as

a whole and encourages information content maximization

e.g., by regularizing the empirical covariance matrix of the

embeddings. Such a categorization is not needed for our

development, and we thus refer to any of the above method

as JE-SSL.

Known Observations About Representations’ Spectrum

in JE-SSL. The phenomenon of learning rank-deﬁcient or

dimensional collapsed, embeddings in JE-SSL has recently

been studied from both a theoretical and empirical point of

view. The empirical emergence of dimensional collapse was

studied in (Hua et al., 2021) where they proposed the use

of a whitening batch normalization layer to help alleviate it.

In (Jing et al., 2022), a focus on contrastive approaches in a

linear setting enabled a better understanding of dimensional

collapse and the role of augmentations in its emergence.

Performance in a low label regime of a partially collapsed

encoder can also be improved by forcing the whitening of

its output, as shown in (He & Ozay, 2022). Furthermore, it

was shown in (Balestriero & LeCun, 2022) how dimensional

collapse is a phenomenon that should not necessarily happen

in theory and how its emergence is mostly due to practical

concerns. Interestingly, we will see through the lens of

RankMe that dimensional collapse is tightly linked with the

quality of the representation. In supervised learning, the

collapse of the embeddings was also studied and found to

RankMe

Dataset Method Labels VICReg SimCLR DINO

cov. inv. LR WD temp. LR. WD. t-temp. s-temp.

ImageNet

ImageNet Oracle ✓68.2 68.2 68.6 68.0 68.5 68.5 68.3 72.3 72.4

α-ReQ X 67.9 67.5 59.5 67.8 63.5 68.1 32.3 71.7 66.2

RankMe X 67.8 67.9 68.2 67.8 67.1 68.0 68.3 72.2 72.4

OOD

ImageNet Oracle ✓68.7 68.7 68.9 68.8 68.7 68.7 68.8 71.9 72.5

α-ReQ X 68.1 67.8 63.8 68.4 65.1 68.2 68.6 71.8 68.5

RankMe X 67.7 68.3 68.7 68.4 67.6 68.4 68.8 71.8 72.5

Figure 1.

Top: Performance of JE-SSL representations (encoder output) in y-axis against the embeddings (projector output) RankMe

values in x-axis on ImageNet-1k. Except for some degenerate solutions at full-rank, RankMe values correlate well with in-distribution

(left column) and out-of-distribution (right columns) classiﬁcation performance. Bottom: Hyperparameter selection using the common

supervised linear probe strategy,

-ReQ the proposed unsupervised RankMe strategy. Values in bold represent the best performance

between RankMe and

-ReQ. OOD indicates the average performance over all the considered datasets other than ImageNet. Without

any label, optimization or parameters, RankMe is able to recover most of the performance obtained by using ImageNet validation set,

highlighting its strength as a hyperparameter selection tool. RankMe also outperforms

-ReQ on average and does not suffer from as big

performance drops in worst cases.

be detrimental to performances (Ganea et al., 2019).

As such, existing studies have started to prescribe informally

the choice of representations that have a lesser collapse; yet

no formal study on the ability of this recipe to actually

identify successfully trained models, nor how to quantify

the amount of collapse to improve representations as been

proposed; this is the goal of our study.

RankMe Consistently Predicts Downstream

performances From Representations

The goal of this section is to introduce and motivate RankMe

while providing a numerically stable implementation. We

defer a theoretical justiﬁcation to Section 5. To ease no-

tations, we refer to the (train) dataset used to obtain the

JE-SSL model as source dataset, and the test set on the

same dataset or a different OOD dataset as target dataset.

3.1.

RankMe: A Simple Method and Its Implementation

The most crucial step of RankMe is the estimation of the

embeddings’ rank. A trivial solution could be to check at the

number of nonzero singular values. Denoting by

σk

the

kth

singular value of the

(N×K)

embedding matrix

, this

would lead to

rank(Z) = Pmin(N,K)

k=1 1{σk>0}

. However,

such a deﬁnition is too rigid for practical scenarios. For

example, round-off error alone could have a dramatic

impact on the rank estimate. Instead, alternative and robust

rank deﬁnitions have emerged (Press et al., 2007) such

rank(Z) = Pmin(N,K)

k=1 1{σk>maxiσi×max(M,N)×ϵ},

where

is a small constant dependent on the data type,

typically

10−7

for

float32

. An alternative measure

of rank comes from a probabilistic viewpoint where the

singular values are normalized to sum to 1 and the Shannon

Entropy (Shannon, 1948) is used, which corresponds to our

deﬁnition of RankMe from Equation (1). We thus introduce

RankMe formally as the following smooth rank measure,

originally introduced in (Roy & Vetterli, 2007),

RankMe(Z) = exp 

−

min(N,K)

k=1

pklog pk

,(1)

with pk=σk(Z)

∥σ(Z)∥1

+ϵ, (2)

RankMe

where

is the source dataset’s embeddings. As opposed to

the classical rank, the chosen Equation (1) does not rely on

specifying the exact threshold at which the singular value

is treated as nonzero. Throughout our study, we employ

Equation (1), and provide the matching analysis with the

classical rank in the appendix. Another beneﬁt of RankMe’s

Equation (1) is in its quantiﬁcation of the whitening of the

embeddings in addition to their rank, which is known to

simplify optimization of (non)linear probes put on top of

them (Santurkar et al., 2018). Lastly, although Equation (1)

is deﬁned with the full embedding matrix

, we observe that

not all of the samples need to be used to have an accurate

estimate of RankMe. In practice, we use

25600

samples as

ablation studies provided in Appendix G and Figure S11 in-

dicate that this provides a highly accurate estimate. RankMe

should however only be used to compare different runs of

a given method, since the embeddings’ rank is not the only

factor that affects performance.

Relation of RankMe To Existing Solutions. Performance

evaluation without labels can also be done using a pretext-

task, such as rotation prediction. This technique helped in

selecting data augmentation policies in (Reed et al., 2021).

One limitation lies in the need to select and train the clas-

siﬁer of the pretext-task, and on the strong assumption that

rotation were not part of the transformations one aimed to be

invariant to. Since (supervised) linear evaluation is the most

widely used evaluation method, we will focus on showing

how RankMe compares with it. In (Li et al., 2022a), it is

shown that the eigenspectrum of representations can be used

to assess performance when used in conjunction with the

loss value. This requires training an additional classiﬁer

to predict the performance and as such is not usable as is

in a completely unsupervised fashion. Most related to us

is (Ghosh et al., 2022) where representations are evaluated

by their eigenspectrum decay, giving a baseline for unsu-

pervised hyperparameter selection. α-ReQ relies on strong

assumptions, and if they hold, then RankMe and

-ReQ

can match, but we show that we outperform it on average.

In fact the assumptions made by

-ReQ are known to not

hold in light of collapse (He & Ozay, 2022). We investigate

α-ReQ’s behavior in detail in Appendix E.

3.2. RankMe Predicts Linear Probing performance

Even on Unseen Datasets

In order to empirically validate RankMe, we compare it to

linear evaluation, which is the default evaluation method

of JE-SSL methods. Finetuning has gained in popularity

with Masked Image Modeling methods (He et al., 2021),

but this can have a signiﬁcant impact on the properties of

the embeddings and alters what was learned during the

pretraining. As such, we do not focus on this evaluation.

Experimental Methods and Datasets Considered. In

order to provide a meaningful assessment of the embed-

dings rank’s impact on performance, we focus on 5 JE-

SSL methods. We use SimCLR as a representative con-

trastive method, VICReg as a representative covariance

based method, and VICReg-exp and VICReg-ctr which

were introduced in (Garrido et al., 2022). We also include

DINO (Caron et al., 2021) as a clustering approach. Ap-

plying RankMe to DINO is not as straightforward due to

the clustering layer in the projector, so embeddings have to

be taken right before the last projector layer. Confer Ap-

pendix C for more details. To make our work self-contained,

we present the methods in Appendix A. We chose to use

VICReg-exp and VICReg-ctr as they provide small mod-

iﬁcations to VICReg and SimCLR while producing em-

beddings with different rank properties. For each method

we vary parameters that directly inﬂuence the rank of the

embeddings, whether it is the temperature used in softmax

based methods, which directly impacts the hardness of the

softmax, or the loss weights to give more or less impor-

tance to the regularizing aspect of loss functions. We also

vary optimization parameters such as the learning rate and

weight decay to provide a more complete analysis. We

provide the hyperparameters used for all experiments in

Appendix K. All approaches were trained in the same ex-

perimental setting with a ResNet-50 (He et al., 2016) back-

bone with a MLP projector having intermediate layers of

size

8192,8192,2048

, which avoids any architectural rank

constraints. The models were trained for 100 epochs on

ImageNet with the LARS (You et al., 2017; Goyal et al.,

2017) optimizer. DINO was also trained using multi-crop.

In order to evaluate the methods, we use ImageNet (our

source dataset), as well as iNaturalist18 (Horn et al., 2018),

Places205 (Zhou et al., 2014), EuroSat (Helber et al., 2019),

SUN397 (Xiao et al., 2010), and StanfordCars (Krause et al.,

2013) to evaluate the trained models on unseen datasets.

While we focus on these datasets for our visualizations, we

also include CIFAR10, CIFAR100 (Krizhevsky et al., 2009),

Food101 (Bossard et al., 2014), VOC07 (Everingham et al.)

and CLVR-count (Johnson et al., 2017) for our hyperparam-

eter selection results, and provide matching visualizations

in Appendix D. These commonly used datasets provide a

wide range of scenarios that differ from ImageNet and pro-

vide meaningful ways to test the robustness of RankMe.

For example, iNaturalist18 consists of 8142 classes focused

on fauna and ﬂora which requires more granularity than

similar classes on ImageNet, SUN397 focuses on scene

understanding, deviating from the single object and object-

centric images of ImageNet, and EuroSat consists of satellite

images which again differ from ImageNet. Datasets such as

iNaturalist can also allow theoretical limitations to manifest

themselves more clearly due to the number of classes being

signiﬁcantly higher than the rank of learned representations.

RankMe

Figure 2.

Validation of RankMe when evaluating performance on

representations. We see that having a high rank is a necessary

condition for good downstream performance.

In order to evaluate on those datasets, we rely on the VISSL

library (Goyal et al., 2021). We provide complete details on

the pretraining and evaluation setup in Appendix I.

RankMe as a prediction of linear classiﬁcation accuracy.

As we can see in Figures 1 and 2, for a given method the

performance on the representations is improved by a higher

embedding rank, whether we look on ImageNet on which

the models were pretrained or on downstream datasets. This

is best seen when looking at DINO, where we notice a clear

trend across all datasets. On EuroSat, the relationship is

not clear since the performances are so close between all

models. When looking at VICReg on on StanfordCars, we

can clearly see that a high rank is only a necessary condi-

tion. Here the best performance is not achieved with the

highest rank, even if full rank embeddings still achieve good

performance. We discuss the link between rank, number

of classes, and performance in Section 5 to give some in-

sights into RankMe’s behavior in settings with few classes

such as StanfordCars. It is also very tempting to draw con-

clusions when comparing different approaches, especially

when looking at the ImageNet performance, however since

dimensional collapse is not the only performance deciding

factor one should refrain from doing so.

Figure 3.

Impact of rank on performance on other architectures and

evaluation protocols. (Left) Using a 3 layer MLP as classiﬁcation

head does not alter the performance before or after the projector,

showing that RankMe can go beyond linear evaluation. (Right)

The same conclusion holds for k-NN evaluation on ImageNet,

where RankMe remains a good indicator of performance.

3.3. RankMe Also Holds for Non-linear Probing

While we have been focusing on linear evaluation, one can

wonder if the behaviors change when using a more com-

plex task-related head. We thus give some evidence that the

previously observed behaviors are similar with a non-linear

classiﬁcation head. we use a simple 3 layer MLP with in-

termediate dimensions

2048

, where each layer is followed

by a ReLU activation. This choice of dimensions ensures

that there are no architectural rank constraints on the embed-

dings. We focus on SUN397 for its conceptual difference

to ImageNet. The low rank of embeddings produced by

SimCLR would suggest that a non-linear classiﬁer might

help improve performance, since it is not as theoretically

limited by the embeddings’ rank as it is in the linear setting.

However we can see in Figure 3 that the behaviors for all

methods are the same as in the linear regime. This would

suggest that RankMe is also a suitable metric to evaluate

downstream performance in a non-linear setting. We per-

form the same analysis using a

-NN classiﬁer, following

the protocol of (Zhuang et al., 2019; Caron et al., 2020),

where we use 36 combinations of

and temperature and re-

port the best performance. We see in Figure 3 that RankMe

remains a good predictor of dowstream performance, with

curves that are similar to what was observed with a linear

classiﬁer. Since a

-NN classiﬁer evaluates the preservation

of the euclidean distance instead of the linear separability,

the results suggest that RankMe can extend to more evalua-

tion protocols.

4. RankMe for Label-Free Cross-Validation

We previously focused on validating RankMe by compar-

ing overall performance compared to linear evaluation. In

this section we focus on the evolution of rank and perfor-

mance when varying one hyperparameter at a time in order

to demonstrate how RankMe can be used for hyperparam-

eter selection. We focus on loss speciﬁc hyperparameters

such as the loss weights or temperature as well as hyperpa-

rameters related to optimization, such as the learning rate

and weight decay.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RankMe:AssessingtheDownstreamPerformanceofPretrainedSelf-SupervisedRepresentationsbyTheirRankQuentinGarrido12RandallBalestriero1LaurentNajman2YannLeCun134AbstractJoint-EmbeddingSelfSupervisedLearning(JE-SSL)hasseenarapiddevelopment,withtheemer-genceofmanymethodvariationsbutonlyfewprincipledguideline...

展开>> 收起<<

RankMe Assessing the Downstream Performance of Pretrained Self-Supervised Representations by Their Rank Quentin Garrido1 2Randall Balestriero1Laurent Najman2Yann LeCun1 3 4.pdf

共46页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

RankMe Assessing the Downstream Performance of Pretrained Self-Supervised Representations by Their Rank Quentin Garrido1 2Randall Balestriero1Laurent Najman2Yann LeCun1 3 4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: