
RankMe
where
Z
is the source dataset’s embeddings. As opposed to
the classical rank, the chosen Equation (1) does not rely on
specifying the exact threshold at which the singular value
is treated as nonzero. Throughout our study, we employ
Equation (1), and provide the matching analysis with the
classical rank in the appendix. Another benefit of RankMe’s
Equation (1) is in its quantification of the whitening of the
embeddings in addition to their rank, which is known to
simplify optimization of (non)linear probes put on top of
them (Santurkar et al., 2018). Lastly, although Equation (1)
is defined with the full embedding matrix
Z
, we observe that
not all of the samples need to be used to have an accurate
estimate of RankMe. In practice, we use
25600
samples as
ablation studies provided in Appendix G and Figure S11 in-
dicate that this provides a highly accurate estimate. RankMe
should however only be used to compare different runs of
a given method, since the embeddings’ rank is not the only
factor that affects performance.
Relation of RankMe To Existing Solutions. Performance
evaluation without labels can also be done using a pretext-
task, such as rotation prediction. This technique helped in
selecting data augmentation policies in (Reed et al., 2021).
One limitation lies in the need to select and train the clas-
sifier of the pretext-task, and on the strong assumption that
rotation were not part of the transformations one aimed to be
invariant to. Since (supervised) linear evaluation is the most
widely used evaluation method, we will focus on showing
how RankMe compares with it. In (Li et al., 2022a), it is
shown that the eigenspectrum of representations can be used
to assess performance when used in conjunction with the
loss value. This requires training an additional classifier
to predict the performance and as such is not usable as is
in a completely unsupervised fashion. Most related to us
is (Ghosh et al., 2022) where representations are evaluated
by their eigenspectrum decay, giving a baseline for unsu-
pervised hyperparameter selection. α-ReQ relies on strong
assumptions, and if they hold, then RankMe and
α
-ReQ
can match, but we show that we outperform it on average.
In fact the assumptions made by
α
-ReQ are known to not
hold in light of collapse (He & Ozay, 2022). We investigate
α-ReQ’s behavior in detail in Appendix E.
3.2. RankMe Predicts Linear Probing performance
Even on Unseen Datasets
In order to empirically validate RankMe, we compare it to
linear evaluation, which is the default evaluation method
of JE-SSL methods. Finetuning has gained in popularity
with Masked Image Modeling methods (He et al., 2021),
but this can have a significant impact on the properties of
the embeddings and alters what was learned during the
pretraining. As such, we do not focus on this evaluation.
Experimental Methods and Datasets Considered. In
order to provide a meaningful assessment of the embed-
dings rank’s impact on performance, we focus on 5 JE-
SSL methods. We use SimCLR as a representative con-
trastive method, VICReg as a representative covariance
based method, and VICReg-exp and VICReg-ctr which
were introduced in (Garrido et al., 2022). We also include
DINO (Caron et al., 2021) as a clustering approach. Ap-
plying RankMe to DINO is not as straightforward due to
the clustering layer in the projector, so embeddings have to
be taken right before the last projector layer. Confer Ap-
pendix C for more details. To make our work self-contained,
we present the methods in Appendix A. We chose to use
VICReg-exp and VICReg-ctr as they provide small mod-
ifications to VICReg and SimCLR while producing em-
beddings with different rank properties. For each method
we vary parameters that directly influence the rank of the
embeddings, whether it is the temperature used in softmax
based methods, which directly impacts the hardness of the
softmax, or the loss weights to give more or less impor-
tance to the regularizing aspect of loss functions. We also
vary optimization parameters such as the learning rate and
weight decay to provide a more complete analysis. We
provide the hyperparameters used for all experiments in
Appendix K. All approaches were trained in the same ex-
perimental setting with a ResNet-50 (He et al., 2016) back-
bone with a MLP projector having intermediate layers of
size
8192,8192,2048
, which avoids any architectural rank
constraints. The models were trained for 100 epochs on
ImageNet with the LARS (You et al., 2017; Goyal et al.,
2017) optimizer. DINO was also trained using multi-crop.
In order to evaluate the methods, we use ImageNet (our
source dataset), as well as iNaturalist18 (Horn et al., 2018),
Places205 (Zhou et al., 2014), EuroSat (Helber et al., 2019),
SUN397 (Xiao et al., 2010), and StanfordCars (Krause et al.,
2013) to evaluate the trained models on unseen datasets.
While we focus on these datasets for our visualizations, we
also include CIFAR10, CIFAR100 (Krizhevsky et al., 2009),
Food101 (Bossard et al., 2014), VOC07 (Everingham et al.)
and CLVR-count (Johnson et al., 2017) for our hyperparam-
eter selection results, and provide matching visualizations
in Appendix D. These commonly used datasets provide a
wide range of scenarios that differ from ImageNet and pro-
vide meaningful ways to test the robustness of RankMe.
For example, iNaturalist18 consists of 8142 classes focused
on fauna and flora which requires more granularity than
similar classes on ImageNet, SUN397 focuses on scene
understanding, deviating from the single object and object-
centric images of ImageNet, and EuroSat consists of satellite
images which again differ from ImageNet. Datasets such as
iNaturalist can also allow theoretical limitations to manifest
themselves more clearly due to the number of classes being
significantly higher than the rank of learned representations.
4