RankMe Assessing the Downstream Performance of Pretrained Self-Supervised Representations by Their Rank Quentin Garrido1 2Randall Balestriero1Laurent Najman2Yann LeCun1 3 4

2025-04-29 0 0 1.15MB 46 页 10玖币
侵权投诉
RankMe: Assessing the Downstream Performance of Pretrained Self-Supervised
Representations by Their Rank
Quentin Garrido 1 2 Randall Balestriero 1Laurent Najman 2Yann LeCun 134
Abstract
Joint-Embedding Self Supervised Learning (JE-
SSL) has seen a rapid development, with the emer-
gence of many method variations but only few
principled guidelines that would help practition-
ers to successfully deploy them. The main reason
for that pitfall comes from JE-SSLs core principle
of not employing any input reconstruction there-
fore lacking visual cues of unsuccessful training.
Adding non informative loss values to that, it be-
comes difficult to deploy SSL on a new dataset for
which no labels can help to judge the quality of the
learned representation. In this study, we develop a
simple unsupervised criterion that is indicative of
the quality of the learned JE-SSL representations:
their effective rank. Albeit simple and computa-
tionally friendly, this method —coined RankMe
allows one to assess the performance of JE-SSL
representations, even on different downstream
datasets, without requiring any labels. A further
benefit of RankMe is that it does not have any
training or hyper-parameters to tune. Through
thorough empirical experiments involving hun-
dreds of training episodes, we demonstrate how
RankMe can be used for hyperparameter selec-
tion with nearly no reduction in final performance
compared to the current selection method that in-
volve a dataset’s labels. We hope that RankMe
will facilitate the deployment of JE-SSL towards
domains that do not have the opportunity to rely
on labels for representations’ quality assessment.
1
Meta AI - FAIR
2
Univ Gustave Eiffel, CNRS, LIGM, F-77454
Marne-la-Vallée, France
3
Courant Institute, New York University
4
Center for Data Science, New York University. Correspondence
to: Quentin Garrido <garridoq@meta.com>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
1. Introduction
Self-supervised learning (SSL) has shown great progress to
learn informative data representations in recent years (Chen
et al., 2020a; He et al., 2020; Chen et al., 2020b; Grill
et al., 2020; Lee et al., 2021; Caron et al., 2020; Zbon-
tar et al., 2021; Bardes et al., 2021; Tomasev et al., 2022;
Caron et al., 2021; Chen et al., 2021; Li et al., 2022b; Zhou
et al., 2022a;b; HaoChen et al., 2021; He et al., 2022), catch-
ing up to supervised baselines and even surpassing them
in few-shot learning, i.e., when evaluating the SSL model
from only a few labeled examples. Although various SSL
families of losses have emerged, most are variants of the
joint-embedding (JE) framework with a siamese network
architecture (Bromley et al., 1994), denoted as JE-SSL for
short. The only technicality we ought to introduce to make
our study precise is the fact that JE-SSL has introduced
some different notations to denote an input’s representation.
In short, JE-SSL often composes a backbone or encoder
network e.g., a ResNet50 and a projector network e.g., a
multilayer perceptron. This projector is only employed dur-
ing training, and we refer to its outputs as embeddings, while
the actual inputs’ representation employed for downstream
tasks are obtained at the encoder’s output.
Although downstream tasks performance of JE-SSL repre-
sentations might seem impressive, one pondering fact should
be noted: all existing methods, hyperparameters, models —
and thus performances — are obtained by manual search
involving the labels of the considered datasets. In words,
JE-SSL is tuned by monitoring the supervised performance
of the model at hand. Therefore, successfully deploying a
SSL model on a new dataset relies on the strong assumption
of having labels on that dataset to tune the SSL method e.g.
through a linear classifier feeding on the JE-SSL represen-
tations (Misra & Maaten, 2020). This quality assessment
strategy was also extended to the use of nonlinear classifiers,
e.g., a
k
-nn classifier (Wu et al., 2018; Zhuang et al., 2019).
Hence, although labels are not directly employed to compute
the weight updates, they are used as a proxy. This limitation
prevents the deployment of JE-SSL in challenging domains
where the number of available labelled examples is limited.
Adding to the challenge, one milestone of JE-SSL is to
move away from reconstruction based learning; hence with-
1
arXiv:2210.02885v3 [cs.LG] 26 Jun 2023
RankMe
out labels and without visual cues, tuning JE-SSL methods
on unlabeled datasets remains challenging. This led to the
application of feature inversion methods e.g. Deep Image
Prior (Ulyanov et al., 2018) or conditional diffusion mod-
els (Bordes et al., 2021) to be deployed onto learned JE-SSL
representation to try to visualize the learned features. Those
alternative visualization solutions however suffer from their
own limitations e.g. bias of the used method, or compu-
tational cost. More importantly, those feature inversion
strategies have been designed for natural images i.e. it is not
clear how such methods would perform on different data
modalities.
In this study we propose RankMe to assess a model’s
performance without having access to any labels; a sim-
ple method that does not require any training or tuning.
RankMe accurately predicts a model’s performance both on
In-Distribution (ID), i.e., same data distribution as used dur-
ing the JE-SSL training, and on Out-Of-Distribution (OOD),
i.e., different data distribution onto which the learned model
is deployed onto. We highlight this crucial property at the
top of Figure 1. The strength of RankMe lies in the fact
that it is solely based on the singular values distribution of
the learned embeddings which is not only simple to obtain
but also easy to interpret. In fact, RankMe’s motivation
hinges on Cover’s theorem (Cover, 1965) that states how
increasing the rank of a linear classifier’s input increases its
training performance, and three simple hypotheses that thor-
oughly validate empirically at the end of our study. Since
RankMe provides a step towards (unlabeled) JE-SSL by al-
lowing practitioners to cross-validate hyperparameters and
select models without resorting to labels or feature inversion
methods, we hope that it will allow JE-SSL to move away
from using labels as part of their design search strategy. We
summarize our contributions below:
1.
We introduce RankMe (Equation (1)) and motivate its
construction from first principles (Section 5) e.g. Cover’s
theorem
2.
We demonstrate that RankMe’s ability to inform about
JE-SSL downstream performances is consistent across
methods, e.g. VICReg, SimCLR, DINO, and their vari-
ants, and across architectures, e.g. using a projector net-
work and/or a nonlinear evaluation method (see Figure 2
and Section 3.3)
3.
We demonstrate that RankMe enables hyperparameter
cross-validation for JE-SSL methods; RankMe is able to
retrieve and sometimes surpass most of the performance
previously found by manual –and label-guided– search
while not employing any labels, on both in domain and
out of domain datasets (Figure 1 and Tables 1 and 2)
We provide a hyperparameter free numerically stable imple-
mentation of RankMe in Section 3.1 and pseudo-code for
cross-validation in Figure 4. Through extensive experiments
involving 11 datasets and 110 models over 5 methods, we
demonstrate that in the linear and nonlinear probing regime,
RankMe is able to tell apart successful and sub-optimal JE-
SSL training, even on different downstream tasks without
having access to labels or downstream task data samples.
2. Background
Joint embedding self-supervised learning (JE-SSL). In
JE-SSL, two main families of method can be distinguished:
contrastive and non-contrastive. Contrastive methods (Chen
et al., 2020a; He et al., 2020; Chen et al., 2020b; 2021; Yeh
et al., 2021) mostly rely on the InfoNCE criterion (Oord
et al., 2018) except for (HaoChen et al., 2021) which uses
squared similarities between the embedding. A clustering
variant of contrastive learning has also emerged (Caron
et al., 2018; 2020; 2021) and can be thought of as con-
trastive methods, but between cluster centroids instead of
samples. Non-contrastive methods (Grill et al., 2020; Chen
& He, 2020; Caron et al., 2021; Bardes et al., 2021; Zbontar
et al., 2021; Ermolov et al., 2021; Li et al., 2022c) aim at
bringing together embeddings of positive samples, similar
to contrastive learning. However, a key difference with
contrastive learning lies in how those methods prevent a
representational collapse. In the former, the criterion ex-
plicitly pushes away negative samples, i.e., all samples that
are not positive, from each other. In the latter, the criterion
does not prevent collapse by distinguishing positive and
negative samples, but instead considers the embeddings as
a whole and encourages information content maximization
e.g., by regularizing the empirical covariance matrix of the
embeddings. Such a categorization is not needed for our
development, and we thus refer to any of the above method
as JE-SSL.
Known Observations About Representations’ Spectrum
in JE-SSL. The phenomenon of learning rank-deficient or
dimensional collapsed, embeddings in JE-SSL has recently
been studied from both a theoretical and empirical point of
view. The empirical emergence of dimensional collapse was
studied in (Hua et al., 2021) where they proposed the use
of a whitening batch normalization layer to help alleviate it.
In (Jing et al., 2022), a focus on contrastive approaches in a
linear setting enabled a better understanding of dimensional
collapse and the role of augmentations in its emergence.
Performance in a low label regime of a partially collapsed
encoder can also be improved by forcing the whitening of
its output, as shown in (He & Ozay, 2022). Furthermore, it
was shown in (Balestriero & LeCun, 2022) how dimensional
collapse is a phenomenon that should not necessarily happen
in theory and how its emergence is mostly due to practical
concerns. Interestingly, we will see through the lens of
RankMe that dimensional collapse is tightly linked with the
quality of the representation. In supervised learning, the
collapse of the embeddings was also studied and found to
2
RankMe
Dataset Method Labels VICReg SimCLR DINO
cov. inv. LR WD temp. LR. WD. t-temp. s-temp.
ImageNet
ImageNet Oracle 68.2 68.2 68.6 68.0 68.5 68.5 68.3 72.3 72.4
α-ReQ X 67.9 67.5 59.5 67.8 63.5 68.1 32.3 71.7 66.2
RankMe X 67.8 67.9 68.2 67.8 67.1 68.0 68.3 72.2 72.4
OOD
ImageNet Oracle 68.7 68.7 68.9 68.8 68.7 68.7 68.8 71.9 72.5
α-ReQ X 68.1 67.8 63.8 68.4 65.1 68.2 68.6 71.8 68.5
RankMe X 67.7 68.3 68.7 68.4 67.6 68.4 68.8 71.8 72.5
Figure 1.
Top: Performance of JE-SSL representations (encoder output) in y-axis against the embeddings (projector output) RankMe
values in x-axis on ImageNet-1k. Except for some degenerate solutions at full-rank, RankMe values correlate well with in-distribution
(left column) and out-of-distribution (right columns) classification performance. Bottom: Hyperparameter selection using the common
supervised linear probe strategy,
α
-ReQ the proposed unsupervised RankMe strategy. Values in bold represent the best performance
between RankMe and
α
-ReQ. OOD indicates the average performance over all the considered datasets other than ImageNet. Without
any label, optimization or parameters, RankMe is able to recover most of the performance obtained by using ImageNet validation set,
highlighting its strength as a hyperparameter selection tool. RankMe also outperforms
α
-ReQ on average and does not suffer from as big
performance drops in worst cases.
be detrimental to performances (Ganea et al., 2019).
As such, existing studies have started to prescribe informally
the choice of representations that have a lesser collapse; yet
no formal study on the ability of this recipe to actually
identify successfully trained models, nor how to quantify
the amount of collapse to improve representations as been
proposed; this is the goal of our study.
3.
RankMe Consistently Predicts Downstream
performances From Representations
The goal of this section is to introduce and motivate RankMe
while providing a numerically stable implementation. We
defer a theoretical justification to Section 5. To ease no-
tations, we refer to the (train) dataset used to obtain the
JE-SSL model as source dataset, and the test set on the
same dataset or a different OOD dataset as target dataset.
3.1.
RankMe: A Simple Method and Its Implementation
The most crucial step of RankMe is the estimation of the
embeddings’ rank. A trivial solution could be to check at the
number of nonzero singular values. Denoting by
σk
the
kth
singular value of the
(N×K)
embedding matrix
Z
, this
would lead to
rank(Z) = Pmin(N,K)
k=1 1{σk>0}
. However,
such a definition is too rigid for practical scenarios. For
example, round-off error alone could have a dramatic
impact on the rank estimate. Instead, alternative and robust
rank definitions have emerged (Press et al., 2007) such
as
rank(Z) = Pmin(N,K)
k=1 1{σk>maxiσi×max(M,N)×ϵ},
where
ϵ
is a small constant dependent on the data type,
typically
107
for
float32
. An alternative measure
of rank comes from a probabilistic viewpoint where the
singular values are normalized to sum to 1 and the Shannon
Entropy (Shannon, 1948) is used, which corresponds to our
definition of RankMe from Equation (1). We thus introduce
RankMe formally as the following smooth rank measure,
originally introduced in (Roy & Vetterli, 2007),
RankMe(Z) = exp
min(N,K)
X
k=1
pklog pk
,(1)
with pk=σk(Z)
σ(Z)1
+ϵ, (2)
3
RankMe
where
Z
is the source dataset’s embeddings. As opposed to
the classical rank, the chosen Equation (1) does not rely on
specifying the exact threshold at which the singular value
is treated as nonzero. Throughout our study, we employ
Equation (1), and provide the matching analysis with the
classical rank in the appendix. Another benefit of RankMe’s
Equation (1) is in its quantification of the whitening of the
embeddings in addition to their rank, which is known to
simplify optimization of (non)linear probes put on top of
them (Santurkar et al., 2018). Lastly, although Equation (1)
is defined with the full embedding matrix
Z
, we observe that
not all of the samples need to be used to have an accurate
estimate of RankMe. In practice, we use
25600
samples as
ablation studies provided in Appendix G and Figure S11 in-
dicate that this provides a highly accurate estimate. RankMe
should however only be used to compare different runs of
a given method, since the embeddings’ rank is not the only
factor that affects performance.
Relation of RankMe To Existing Solutions. Performance
evaluation without labels can also be done using a pretext-
task, such as rotation prediction. This technique helped in
selecting data augmentation policies in (Reed et al., 2021).
One limitation lies in the need to select and train the clas-
sifier of the pretext-task, and on the strong assumption that
rotation were not part of the transformations one aimed to be
invariant to. Since (supervised) linear evaluation is the most
widely used evaluation method, we will focus on showing
how RankMe compares with it. In (Li et al., 2022a), it is
shown that the eigenspectrum of representations can be used
to assess performance when used in conjunction with the
loss value. This requires training an additional classifier
to predict the performance and as such is not usable as is
in a completely unsupervised fashion. Most related to us
is (Ghosh et al., 2022) where representations are evaluated
by their eigenspectrum decay, giving a baseline for unsu-
pervised hyperparameter selection. α-ReQ relies on strong
assumptions, and if they hold, then RankMe and
α
-ReQ
can match, but we show that we outperform it on average.
In fact the assumptions made by
α
-ReQ are known to not
hold in light of collapse (He & Ozay, 2022). We investigate
α-ReQ’s behavior in detail in Appendix E.
3.2. RankMe Predicts Linear Probing performance
Even on Unseen Datasets
In order to empirically validate RankMe, we compare it to
linear evaluation, which is the default evaluation method
of JE-SSL methods. Finetuning has gained in popularity
with Masked Image Modeling methods (He et al., 2021),
but this can have a significant impact on the properties of
the embeddings and alters what was learned during the
pretraining. As such, we do not focus on this evaluation.
Experimental Methods and Datasets Considered. In
order to provide a meaningful assessment of the embed-
dings rank’s impact on performance, we focus on 5 JE-
SSL methods. We use SimCLR as a representative con-
trastive method, VICReg as a representative covariance
based method, and VICReg-exp and VICReg-ctr which
were introduced in (Garrido et al., 2022). We also include
DINO (Caron et al., 2021) as a clustering approach. Ap-
plying RankMe to DINO is not as straightforward due to
the clustering layer in the projector, so embeddings have to
be taken right before the last projector layer. Confer Ap-
pendix C for more details. To make our work self-contained,
we present the methods in Appendix A. We chose to use
VICReg-exp and VICReg-ctr as they provide small mod-
ifications to VICReg and SimCLR while producing em-
beddings with different rank properties. For each method
we vary parameters that directly influence the rank of the
embeddings, whether it is the temperature used in softmax
based methods, which directly impacts the hardness of the
softmax, or the loss weights to give more or less impor-
tance to the regularizing aspect of loss functions. We also
vary optimization parameters such as the learning rate and
weight decay to provide a more complete analysis. We
provide the hyperparameters used for all experiments in
Appendix K. All approaches were trained in the same ex-
perimental setting with a ResNet-50 (He et al., 2016) back-
bone with a MLP projector having intermediate layers of
size
8192,8192,2048
, which avoids any architectural rank
constraints. The models were trained for 100 epochs on
ImageNet with the LARS (You et al., 2017; Goyal et al.,
2017) optimizer. DINO was also trained using multi-crop.
In order to evaluate the methods, we use ImageNet (our
source dataset), as well as iNaturalist18 (Horn et al., 2018),
Places205 (Zhou et al., 2014), EuroSat (Helber et al., 2019),
SUN397 (Xiao et al., 2010), and StanfordCars (Krause et al.,
2013) to evaluate the trained models on unseen datasets.
While we focus on these datasets for our visualizations, we
also include CIFAR10, CIFAR100 (Krizhevsky et al., 2009),
Food101 (Bossard et al., 2014), VOC07 (Everingham et al.)
and CLVR-count (Johnson et al., 2017) for our hyperparam-
eter selection results, and provide matching visualizations
in Appendix D. These commonly used datasets provide a
wide range of scenarios that differ from ImageNet and pro-
vide meaningful ways to test the robustness of RankMe.
For example, iNaturalist18 consists of 8142 classes focused
on fauna and flora which requires more granularity than
similar classes on ImageNet, SUN397 focuses on scene
understanding, deviating from the single object and object-
centric images of ImageNet, and EuroSat consists of satellite
images which again differ from ImageNet. Datasets such as
iNaturalist can also allow theoretical limitations to manifest
themselves more clearly due to the number of classes being
significantly higher than the rank of learned representations.
4
RankMe
Figure 2.
Validation of RankMe when evaluating performance on
representations. We see that having a high rank is a necessary
condition for good downstream performance.
In order to evaluate on those datasets, we rely on the VISSL
library (Goyal et al., 2021). We provide complete details on
the pretraining and evaluation setup in Appendix I.
RankMe as a prediction of linear classification accuracy.
As we can see in Figures 1 and 2, for a given method the
performance on the representations is improved by a higher
embedding rank, whether we look on ImageNet on which
the models were pretrained or on downstream datasets. This
is best seen when looking at DINO, where we notice a clear
trend across all datasets. On EuroSat, the relationship is
not clear since the performances are so close between all
models. When looking at VICReg on on StanfordCars, we
can clearly see that a high rank is only a necessary condi-
tion. Here the best performance is not achieved with the
highest rank, even if full rank embeddings still achieve good
performance. We discuss the link between rank, number
of classes, and performance in Section 5 to give some in-
sights into RankMe’s behavior in settings with few classes
such as StanfordCars. It is also very tempting to draw con-
clusions when comparing different approaches, especially
when looking at the ImageNet performance, however since
dimensional collapse is not the only performance deciding
factor one should refrain from doing so.
Figure 3.
Impact of rank on performance on other architectures and
evaluation protocols. (Left) Using a 3 layer MLP as classification
head does not alter the performance before or after the projector,
showing that RankMe can go beyond linear evaluation. (Right)
The same conclusion holds for k-NN evaluation on ImageNet,
where RankMe remains a good indicator of performance.
3.3. RankMe Also Holds for Non-linear Probing
While we have been focusing on linear evaluation, one can
wonder if the behaviors change when using a more com-
plex task-related head. We thus give some evidence that the
previously observed behaviors are similar with a non-linear
classification head. we use a simple 3 layer MLP with in-
termediate dimensions
2048
, where each layer is followed
by a ReLU activation. This choice of dimensions ensures
that there are no architectural rank constraints on the embed-
dings. We focus on SUN397 for its conceptual difference
to ImageNet. The low rank of embeddings produced by
SimCLR would suggest that a non-linear classifier might
help improve performance, since it is not as theoretically
limited by the embeddings’ rank as it is in the linear setting.
However we can see in Figure 3 that the behaviors for all
methods are the same as in the linear regime. This would
suggest that RankMe is also a suitable metric to evaluate
downstream performance in a non-linear setting. We per-
form the same analysis using a
k
-NN classifier, following
the protocol of (Zhuang et al., 2019; Caron et al., 2020),
where we use 36 combinations of
k
and temperature and re-
port the best performance. We see in Figure 3 that RankMe
remains a good predictor of dowstream performance, with
curves that are similar to what was observed with a linear
classifier. Since a
k
-NN classifier evaluates the preservation
of the euclidean distance instead of the linear separability,
the results suggest that RankMe can extend to more evalua-
tion protocols.
4. RankMe for Label-Free Cross-Validation
We previously focused on validating RankMe by compar-
ing overall performance compared to linear evaluation. In
this section we focus on the evolution of rank and perfor-
mance when varying one hyperparameter at a time in order
to demonstrate how RankMe can be used for hyperparam-
eter selection. We focus on loss specific hyperparameters
such as the loss weights or temperature as well as hyperpa-
rameters related to optimization, such as the learning rate
and weight decay.
5
摘要:

RankMe:AssessingtheDownstreamPerformanceofPretrainedSelf-SupervisedRepresentationsbyTheirRankQuentinGarrido12RandallBalestriero1LaurentNajman2YannLeCun134AbstractJoint-EmbeddingSelfSupervisedLearning(JE-SSL)hasseenarapiddevelopment,withtheemer-genceofmanymethodvariationsbutonlyfewprincipledguideline...

展开>> 收起<<
RankMe Assessing the Downstream Performance of Pretrained Self-Supervised Representations by Their Rank Quentin Garrido1 2Randall Balestriero1Laurent Najman2Yann LeCun1 3 4.pdf

共46页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:46 页 大小:1.15MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 46
客服
关注