
Push Pull
Pull
Pull
Push
Pull
Pull
Push
Push
Push Pull
Pull
Pull
Push
Pull
Pull
Push
Push
Push Pull
Pull
Pull
Pull
Push
Push
Push
Pull
Pull
Push
Push
Push
Pull
Pull
Push
Pull
Pull
Push
Pull
Push
Pull
Pull
Push
Push
Push Pull
Pull
Pull
Pull
Push
Push
Push
Pull
Pull
Push
Pull
Push
Push
Push
Push Pull
Pull
PullPull
PullPull
Pull
Pull
Push
Pull
Push
Push
Push
Push Pull
Pull
PullPull
PullPull
Pull
Push
Push
Push
Push Pull
Pull
Pull
Pull
Pull
Pull
Push
Push
Push
(a) Fully Supervised Learning
(b) Self-supervised Learning (c) OPERA
Conflict
Pull
Pull
Push
Pull
Push
Push
Push
Push Pull
Pull
Pull
Pull
Pull
Pull
Push
Push
Push
(a) Fully Supervised Learning
(b) Self-supervised Learning (c) OPERA
Conflict
Pull
Pull
Push
Pull
Push
Push
Push Pull
Pull
PullPull
PullPull
Pull
Pull
Push
Push
Push
(a) Fully Supervised Learning
(b) Self-supervised Learning (c) OPERA
Conflict Push
Figure 2. Comparisons of different learning strategies. Fully supervised learning (a) and self-supervised learning (b) constrain images at
the class level and instance level, respectively. They conflict with each other for different images from the same class. OPERA imposes
hierarchical supervisions on hierarchical spaces and uses a transformation to resolve the supervision conflicts.
We find that a simple combination of the self and full
supervisions results in contradictory training signals. To
address this, in this paper, we provide Omni-suPErvised
Representation leArning with hierarchical supervisions
(OPERA) as a solution, as demonstrated in Figure 2.
We unify full and self supervisions in a similarity learn-
ing framework where they differ only by the definition
of positive and negative pairs. Instead of directly impos-
ing supervisions on the representations, we extract a hier-
archy of proxy representations to receive the correspond-
ing supervision signals. Extensive experiments are con-
ducted with both convolutional neural networks [24] and
vision transformers [17] as the backbone model. We pre-
train the models using OPERA on ImageNet-1K [48] and
then transfer them to various downstream tasks to evalu-
ate the transferability. We report image classification ac-
curacy with both linear probe and end-to-end finetuning on
ImageNet-1K. We also conduct experiments when transfer-
ring the pretrained model to other classification tasks, se-
mantic segmentation, and object detection. Experimental
results demonstrate consistent improvements over FSL and
SSL on all the downstream tasks, as shown in Figure 1.
Additionally, we show that OPERA outperforms the coun-
terpart methods even with fewer pretraining epochs (e.g.,
fewer than 150 epochs), demonstrating good data efficiency.
2. Related Work
Fully Supervised Representation Learning. Fully su-
pervised representation learning (FSL) utilizes the ground-
truth labels of data to learn a discriminative representation
space. The general objective is to maximize the discrepan-
cies of representations from different categories and mini-
mize those from the same class. The softmax loss is widely
used for FSL [16,24,35,60], and various loss functions are
further developed in deep metric learning [26,30,38,51,63].
As fully supervised objectives entail strong constraints,
the learned representations are usually more suitable for the
specialized classification task and thus lag behind on trans-
ferability [18,27,79]. To alleviate this, many works devise
various data augmentation methods to expand the training
distribution [7,29,54,77]. Recent works also explore adding
more layers after the representation to avoid direct supervi-
sion [57,65]. Differently, we focus on effectively combin-
ing self and full supervisions to improve transferability.
Self-supervised Representation Learning. Self-
supervised representation learning (SSL) attracts increasing
attention in recent years due to its ability to learn mean-
ingful representation without human-annotated labels. The
main idea is to train the model to perform a carefully de-
signed label-free pretext task. Early self-supervised learn-
ing methods devised various pretext tasks including image
restoration [45,56,78], prediction of image rotation [19],
and solving jigsaw puzzles [41]. They achieve fair perfor-
mance but still cannot equal fully supervised learning until
the arise of self-supervised contrastive learning [10,20,22].
The pretext task of contrastive learning is instance discrim-
ination, i.e., to identify different views (augmentations) of
the same image from those of other images. Contrastive
learning methods [8,12,25,32,34,58,68,69] demonstrate
even better transferability than fully supervised learning, re-
sulting from their focus on lower-level and thus more gen-
eral features [18,27,79]. Very recently, masked image mod-
eling (MIM) [21,70,82] emerges as a strong competitor
to contrastive learning, which trains the model to correctly
predict the masked parts of the input image. In this paper,
we mainly focus on contrastive learning in self-supervised
learning. Our framework can be extended to other pretext
tasks by inserting a new task space in the hierarchy.
Omni-supervised Representation Learning: It is
worth mentioning that some existing studies have attempted
2