OPERA Omni-Supervised Representation Learning with Hierarchical Supervisions Chengkun Wang12Wenzhao Zheng12Zheng Zhu3Jie Zhou12Jiwen Lu12

2025-04-24 0 0 2.82MB 14 页 10玖币
侵权投诉
OPERA: Omni-Supervised Representation Learning with Hierarchical
Supervisions
Chengkun Wang1,2,*Wenzhao Zheng1,2,*Zheng Zhu3Jie Zhou1,2Jiwen Lu1,2,
1Beijing National Research Center for Information Science and Technology, China
2Department of Automation, Tsinghua University, China 3PhiGent Robotics
{wck20,zhengwz18}@mails.tsinghua.edu.cn; zhengzhu@ieee.org;
{jzhou,lujiwen}@tsinghua.edu.cn
Abstract
The pretrain-finetune paradigm in modern computer
vision facilitates the success of self-supervised learning,
which achieves better transferability than supervised learn-
ing. However, with the availability of massive labeled data,
a natural question emerges: how to train a better model
with both self and full supervision signals? In this pa-
per, we propose Omni-suPErvised Representation leArning
with hierarchical supervisions (OPERA) as a solution. We
provide a unified perspective of supervisions from labeled
and unlabeled data and propose a unified framework of
fully supervised and self-supervised learning. We extract
a set of hierarchical proxy representations for each image
and impose self and full supervisions on the corresponding
proxy representations. Extensive experiments on both con-
volutional neural networks and vision transformers demon-
strate the superiority of OPERA in image classification,
segmentation, and object detection. Code is available at:
https://github.com/wangck20/OPERA.
1. Introduction
Learning good representations is a significant yet chal-
lenging task in deep learning [12,22,80]. Researchers have
developed various ways to adapt to different supervisions,
such as fully supervised [30,42,55,59], self-supervised [10,
20,62,73], and semi-supervised learning [61,71,76]. They
serve as fundamental procedures in various tasks includ-
ing image classification [16,75,77], semantic segmenta-
tion [20,50], and object detection [5,23,72].
Fully supervised learning (FSL) has always been the de-
fault choice for representation learning, which learns from
discriminating samples with different ground-truth labels.
However, this dominance begins to fade with the rise of
*Equal contribution.
Corresponding author.
ImageNet
Linear Classification
ResNet-50
ImageNet
Linear Classification
DeiT-S
ImageNet
End-to-end Finetuning
DeiT-S
ImageNet
End-to-end Finetuning
DeiT-B
CIFAR-100
Transfer Learning
ResNet-50
CIFAR-100
Transfer Learning
DeiT-B
ADE20K
Semantic Segmentation
ResNet-50
ADE20K
Semantic Segmentation
DeiT-S
ADE20K
Semantic Segmentation
DeiT-B
COCO
Object Detection
ResNet-50
70.5 74.8
71.2
73.7
79.8
80.0
80.4
81.8
83.0
83.5
85.5
86.0
86.8
86.686.989.0
36.1
37.0
38.4
42.9
42.3
43.8
45.4
46.1
46.6
39.2
40.3
41.5
Figure 1. The proposed OPERA outperforms both fully supervised
and self-supervised counterparts on various downstream tasks.
the pretrain-finetune paradigm in modern computer vision.
Under such a paradigm, researchers usually pretrain a net-
work on a large dataset first and then transfer it to down-
stream tasks [12,14,21,22]. This advocates transferabil-
ity more than discriminativeness of the learned representa-
tions. This preference nurtures the recent success of self-
supervised learning (SSL) methods with contrastive objec-
tive [10,20,22,64,68]. They require two views (aug-
mentations) of the same image to be consistent and dis-
tinct from other images in the representation space. This
instance-level supervision is said to obtain more general
and thus transferable representations [18,27]. The ability
to learn without human-annotated labels also greatly pop-
ularizes self-supervised contrastive learning. Despite its
advantages, we want to explore whether combining self-
supervised signals1with fully supervised signals further im-
proves the transferability, given the already availability of
massive annotated labels [1,4,33,48].
1We mainly focus on self-supervised contrastive learning. In the rest
of the paper, we use self-supervised learning to refer to self-supervised
contrastive learning unless otherwise specified for simplicity.
1
arXiv:2210.05557v2 [cs.CV] 29 Nov 2022
Push Pull
Pull
Pull
Push
Pull
Pull
Push
Push
Push Pull
Pull
Pull
Push
Pull
Pull
Push
Push
Push Pull
Pull
Pull
Pull
Push
Push
Push
Pull
Pull
Push
Push
Push
Pull
Pull
Push
Pull
Pull
Push
Pull
Push
Pull
Pull
Push
Push
Push Pull
Pull
Pull
Pull
Push
Push
Push
Pull
Pull
Push
Pull
Push
Push
Push
Push Pull
Pull
PullPull
PullPull
Pull
Pull
Push
Pull
Push
Push
Push
Push Pull
Pull
PullPull
PullPull
Pull
Push
Push
Push
Push Pull
Pull
Pull
Pull
Pull
Pull
Push
Push
Push
(a) Fully Supervised Learning
(b) Self-supervised Learning (c) OPERA
Conflict
Pull
Pull
Push
Pull
Push
Push
Push
Push Pull
Pull
Pull
Pull
Pull
Pull
Push
Push
Push
(a) Fully Supervised Learning
(b) Self-supervised Learning (c) OPERA
Conflict
Pull
Pull
Push
Pull
Push
Push
Push Pull
Pull
PullPull
PullPull
Pull
Pull
Push
Push
Push
(a) Fully Supervised Learning
(b) Self-supervised Learning (c) OPERA
Conflict Push
Figure 2. Comparisons of different learning strategies. Fully supervised learning (a) and self-supervised learning (b) constrain images at
the class level and instance level, respectively. They conflict with each other for different images from the same class. OPERA imposes
hierarchical supervisions on hierarchical spaces and uses a transformation to resolve the supervision conflicts.
We find that a simple combination of the self and full
supervisions results in contradictory training signals. To
address this, in this paper, we provide Omni-suPErvised
Representation leArning with hierarchical supervisions
(OPERA) as a solution, as demonstrated in Figure 2.
We unify full and self supervisions in a similarity learn-
ing framework where they differ only by the definition
of positive and negative pairs. Instead of directly impos-
ing supervisions on the representations, we extract a hier-
archy of proxy representations to receive the correspond-
ing supervision signals. Extensive experiments are con-
ducted with both convolutional neural networks [24] and
vision transformers [17] as the backbone model. We pre-
train the models using OPERA on ImageNet-1K [48] and
then transfer them to various downstream tasks to evalu-
ate the transferability. We report image classification ac-
curacy with both linear probe and end-to-end finetuning on
ImageNet-1K. We also conduct experiments when transfer-
ring the pretrained model to other classification tasks, se-
mantic segmentation, and object detection. Experimental
results demonstrate consistent improvements over FSL and
SSL on all the downstream tasks, as shown in Figure 1.
Additionally, we show that OPERA outperforms the coun-
terpart methods even with fewer pretraining epochs (e.g.,
fewer than 150 epochs), demonstrating good data efficiency.
2. Related Work
Fully Supervised Representation Learning. Fully su-
pervised representation learning (FSL) utilizes the ground-
truth labels of data to learn a discriminative representation
space. The general objective is to maximize the discrepan-
cies of representations from different categories and mini-
mize those from the same class. The softmax loss is widely
used for FSL [16,24,35,60], and various loss functions are
further developed in deep metric learning [26,30,38,51,63].
As fully supervised objectives entail strong constraints,
the learned representations are usually more suitable for the
specialized classification task and thus lag behind on trans-
ferability [18,27,79]. To alleviate this, many works devise
various data augmentation methods to expand the training
distribution [7,29,54,77]. Recent works also explore adding
more layers after the representation to avoid direct supervi-
sion [57,65]. Differently, we focus on effectively combin-
ing self and full supervisions to improve transferability.
Self-supervised Representation Learning. Self-
supervised representation learning (SSL) attracts increasing
attention in recent years due to its ability to learn mean-
ingful representation without human-annotated labels. The
main idea is to train the model to perform a carefully de-
signed label-free pretext task. Early self-supervised learn-
ing methods devised various pretext tasks including image
restoration [45,56,78], prediction of image rotation [19],
and solving jigsaw puzzles [41]. They achieve fair perfor-
mance but still cannot equal fully supervised learning until
the arise of self-supervised contrastive learning [10,20,22].
The pretext task of contrastive learning is instance discrim-
ination, i.e., to identify different views (augmentations) of
the same image from those of other images. Contrastive
learning methods [8,12,25,32,34,58,68,69] demonstrate
even better transferability than fully supervised learning, re-
sulting from their focus on lower-level and thus more gen-
eral features [18,27,79]. Very recently, masked image mod-
eling (MIM) [21,70,82] emerges as a strong competitor
to contrastive learning, which trains the model to correctly
predict the masked parts of the input image. In this paper,
we mainly focus on contrastive learning in self-supervised
learning. Our framework can be extended to other pretext
tasks by inserting a new task space in the hierarchy.
Omni-supervised Representation Learning: It is
worth mentioning that some existing studies have attempted
2
to combine FSL and SSL [39,46,66]. Radosavovic et
el. [46] first trained an FSL model and then performed
knowledge distillation on unlabeled data. Wei et el. [66]
adopted an SSL pretrained model to generate instance la-
bels and compute an overall similarity to train a new model.
Nayman et el. [39] proposed to finetune an SSL pretrained
model using ground-truth labels in a controlled manner to
enhance its transferability. Nevertheless, they do not con-
sider the hierarchical relations between the self and full su-
pervision. Also, they perform SSL and FSL sequentially in
separate stages. Differently, OPERA unifies them in a uni-
versal perspective and imposes the supervisions on different
levels of the representations. Our framework can be trained
in an end-to-end manner efficiently with fewer epochs.
3. Proposed Approach
In this section, we first present a unified perspective of
self-supervised learning (SSL) and fully supervised learn-
ing (FSL) under a similarity learning framework. We then
propose OPERA to impose hierarchical supervisions on the
corresponding hierarchical representations for better trans-
ferability. Lastly, we elaborate on the instantiation of the
proposed OPERA framework.
3.1. Unified Framework of Similarity Learning
Given an image space X ⊂ RH×W×C, deep representa-
tion learning trains a deep neural network as the map to their
representation space Y ⊂ RD×1. Fully supervised learning
and self-supervised learning are two mainstream represen-
tation learning approaches in modern deep learning. FSL
utilizes the human-annotated labels as explicit supervision
to train a discriminative classifier. Differently, SSL trains
models without ground-truth labels. The widely used con-
trastive learning (e.g., MoCo-v3 [13]) obtains meaningful
representations by maximizing the similarity between ran-
dom augmentations of the same image.
Generally, FSL and SSL differ in both the supervision
form and optimization objective. To integrate them, we first
provide a unified similarity learning framework to include
both training objectives:
J(Y,P,L) = X
y∈Y,p∈P,l∈L
[wp·I(ly, lp)·s(y,p)
+wn·(1 I(ly, lp)) ·s(y,p)],
(1)
where wp0and wn0denote the coefficients of posi-
tive and negative pairs, lyand lpare the labels of the sam-
ples, and s(y,p)defines the pairwise similarity between y
and p.I(a, b)is an indicator function which outputs 1 if
a=band 0 otherwise. Lis the label space, and Pcan be
the same as Y, a transformation of Y, or a learnable class
prototype space. For example, to obtain the softmax objec-
tive widely employed in FSL [24,52], we can set:
wp= 1, wn=exp(s(y,p))
Plp06=lyexp(s(y,p0)),(2)
where s(y,p) = yT·p, and pis the row vector in the clas-
sifier matrix W. For the InfoNCE loss used in contrastive
learning [22,28,53], we set:
wp=1
τPllp06=yexp(s(y,p0))
exp(s(y,p)) + Plp06=lyexp(s(y,p0)),
wn=1
τ
exp(s(y,p))
exp(s(y,p)) + Plp06=lyexp(s(y,p0))
(3)
where τis the temperature hyper-parameter. See Ap-
pendix A.1 for more details.
Under the unified training objective (1), the main differ-
ence between FSL and SSL lies in the definition of the label
space Lfull and Lself . For the labels lfull ∈ Lf ull in FSL,
lfull
i=lfull
jonly if they are from the same ground-truth
category. For the labels lself ∈ Lself in SSL, lself
i=lself
j
only if they are the augmented views of the same image.
3.2. Hierarchical Supervisions on Hierarchical Rep-
resentations
With the same formulation of the training objective, a
naive way to combine the two training signals is to simply
add them:
Jnaive (Y,P,L) = X
y∈Y,p∈P ,l∈L
[wself
p·I(lself
y, lself
p)·s(y,p)
+wself
n·(1 I(lself
y, lself
p)) ·s(y,p)
wf ull
p·I(lf ull
y, lf ull
p)·s(y,p)
+wf ull
n·(1 I(lf ull
y, lf ull
p)) ·s(y,p)].
(4)
For yand pfrom the same class, i.e., I(lself
y, lself
p)=0
and I(lfull
y, lf ull
p)=1, the training loss is:
Jnaive(y,p,l)=(wself
nwfull
p)·s(y,p).(5)
This indicates the two training signals are contradictory
and may neutralize each other. This is particularly harm-
ful if we adopt similar loss functions for fully supervised
and self-supervised learning, i.e., wself
nwfull
p, and thus
Jnaive(y,p,l)0.
Existing methods [39,65,66] address this by subse-
quently imposing the two training signals. They tend to first
obtain a self-supervised pretrained model and then use the
full supervision to tune it. Differently, we propose a more
efficient way to adaptively balance the two weights so that
we can simultaneously employ them:
Jadap(y,p,l)=(wself
n·αwfull
p·β)·s(y,p),(6)
3
摘要:

OPERA:Omni-SupervisedRepresentationLearningwithHierarchicalSupervisionsChengkunWang1;2;*WenzhaoZheng1;2;*ZhengZhu3JieZhou1;2JiwenLu1;2;†1BeijingNationalResearchCenterforInformationScienceandTechnology,China2DepartmentofAutomation,TsinghuaUniversity,China3PhiGentRoboticsfwck20,zhengwz18g@mails.tsingh...

展开>> 收起<<
OPERA Omni-Supervised Representation Learning with Hierarchical Supervisions Chengkun Wang12Wenzhao Zheng12Zheng Zhu3Jie Zhou12Jiwen Lu12.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:2.82MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注