OPERA Omni-Supervised Representation Learning with Hierarchical Supervisions Chengkun Wang12Wenzhao Zheng12Zheng Zhu3Jie Zhou12Jiwen Lu12

2025-04-24 0 0 2.82MB 14 页 10玖币

侵权投诉

OPERA: Omni-Supervised Representation Learning with Hierarchical

Supervisions

Chengkun Wang1,2,*Wenzhao Zheng1,2,*Zheng Zhu3Jie Zhou1,2Jiwen Lu1,2,†

1Beijing National Research Center for Information Science and Technology, China

2Department of Automation, Tsinghua University, China 3PhiGent Robotics

{wck20,zhengwz18}@mails.tsinghua.edu.cn; zhengzhu@ieee.org;

{jzhou,lujiwen}@tsinghua.edu.cn

Abstract

The pretrain-ﬁnetune paradigm in modern computer

vision facilitates the success of self-supervised learning,

which achieves better transferability than supervised learn-

ing. However, with the availability of massive labeled data,

a natural question emerges: how to train a better model

with both self and full supervision signals? In this pa-

per, we propose Omni-suPErvised Representation leArning

with hierarchical supervisions (OPERA) as a solution. We

provide a uniﬁed perspective of supervisions from labeled

and unlabeled data and propose a uniﬁed framework of

fully supervised and self-supervised learning. We extract

a set of hierarchical proxy representations for each image

and impose self and full supervisions on the corresponding

proxy representations. Extensive experiments on both con-

volutional neural networks and vision transformers demon-

strate the superiority of OPERA in image classiﬁcation,

segmentation, and object detection. Code is available at:

https://github.com/wangck20/OPERA.

1. Introduction

Learning good representations is a signiﬁcant yet chal-

lenging task in deep learning [12,22,80]. Researchers have

developed various ways to adapt to different supervisions,

such as fully supervised [30,42,55,59], self-supervised [10,

20,62,73], and semi-supervised learning [61,71,76]. They

serve as fundamental procedures in various tasks includ-

ing image classiﬁcation [16,75,77], semantic segmenta-

tion [20,50], and object detection [5,23,72].

Fully supervised learning (FSL) has always been the de-

fault choice for representation learning, which learns from

discriminating samples with different ground-truth labels.

However, this dominance begins to fade with the rise of

*Equal contribution.

†Corresponding author.

ImageNet

Linear Classification

ResNet-50

ImageNet

Linear Classification

DeiT-S

ImageNet

End-to-end Finetuning

DeiT-S

ImageNet

End-to-end Finetuning

DeiT-B

CIFAR-100

Transfer Learning

ResNet-50

CIFAR-100

Transfer Learning

DeiT-B

ADE20K

Semantic Segmentation

ResNet-50

ADE20K

Semantic Segmentation

DeiT-S

ADE20K

Semantic Segmentation

DeiT-B

COCO

Object Detection

ResNet-50

70.5 74.8

71.2

73.7

79.8

80.0

80.4

81.8

83.0

83.5

85.5

86.0

86.8

86.686.989.0

36.1

37.0

38.4

42.9

42.3

43.8

45.4

46.1

46.6

39.2

40.3

41.5

Figure 1. The proposed OPERA outperforms both fully supervised

and self-supervised counterparts on various downstream tasks.

the pretrain-ﬁnetune paradigm in modern computer vision.

Under such a paradigm, researchers usually pretrain a net-

work on a large dataset ﬁrst and then transfer it to down-

stream tasks [12,14,21,22]. This advocates transferabil-

ity more than discriminativeness of the learned representa-

tions. This preference nurtures the recent success of self-

supervised learning (SSL) methods with contrastive objec-

tive [10,20,22,64,68]. They require two views (aug-

mentations) of the same image to be consistent and dis-

tinct from other images in the representation space. This

instance-level supervision is said to obtain more general

and thus transferable representations [18,27]. The ability

to learn without human-annotated labels also greatly pop-

ularizes self-supervised contrastive learning. Despite its

advantages, we want to explore whether combining self-

supervised signals1with fully supervised signals further im-

proves the transferability, given the already availability of

massive annotated labels [1,4,33,48].

1We mainly focus on self-supervised contrastive learning. In the rest

of the paper, we use self-supervised learning to refer to self-supervised

contrastive learning unless otherwise speciﬁed for simplicity.

arXiv:2210.05557v2 [cs.CV] 29 Nov 2022

Push Pull

Pull

Push

Pull

Push

Push Pull

Pull

Push

Pull

Push

Push Pull

Pull

Push

Pull

Push

Pull

Push

Pull

Push

Pull

Push

Pull

Push

Push Pull

Pull

Push

Pull

Push

Pull

Push

Push Pull

Pull

PullPull

Pull

Push

Pull

Push

Push Pull

Pull

PullPull

Pull

Push

Push Pull

Pull

Push

(a) Fully Supervised Learning

(b) Self-supervised Learning (c) OPERA

Conflict

Pull

Push

Pull

Push

Push Pull

Pull

Push

(a) Fully Supervised Learning

(b) Self-supervised Learning (c) OPERA

Conflict

Pull

Push

Pull

Push

Push Pull

Pull

PullPull

Pull

Push

(a) Fully Supervised Learning

(b) Self-supervised Learning (c) OPERA

Conflict Push

Figure 2. Comparisons of different learning strategies. Fully supervised learning (a) and self-supervised learning (b) constrain images at

the class level and instance level, respectively. They conﬂict with each other for different images from the same class. OPERA imposes

hierarchical supervisions on hierarchical spaces and uses a transformation to resolve the supervision conﬂicts.

We ﬁnd that a simple combination of the self and full

supervisions results in contradictory training signals. To

address this, in this paper, we provide Omni-suPErvised

Representation leArning with hierarchical supervisions

(OPERA) as a solution, as demonstrated in Figure 2.

We unify full and self supervisions in a similarity learn-

ing framework where they differ only by the deﬁnition

of positive and negative pairs. Instead of directly impos-

ing supervisions on the representations, we extract a hier-

archy of proxy representations to receive the correspond-

ing supervision signals. Extensive experiments are con-

ducted with both convolutional neural networks [24] and

vision transformers [17] as the backbone model. We pre-

train the models using OPERA on ImageNet-1K [48] and

then transfer them to various downstream tasks to evalu-

ate the transferability. We report image classiﬁcation ac-

curacy with both linear probe and end-to-end ﬁnetuning on

ImageNet-1K. We also conduct experiments when transfer-

ring the pretrained model to other classiﬁcation tasks, se-

mantic segmentation, and object detection. Experimental

results demonstrate consistent improvements over FSL and

SSL on all the downstream tasks, as shown in Figure 1.

Additionally, we show that OPERA outperforms the coun-

terpart methods even with fewer pretraining epochs (e.g.,

fewer than 150 epochs), demonstrating good data efﬁciency.

2. Related Work

Fully Supervised Representation Learning. Fully su-

pervised representation learning (FSL) utilizes the ground-

truth labels of data to learn a discriminative representation

space. The general objective is to maximize the discrepan-

cies of representations from different categories and mini-

mize those from the same class. The softmax loss is widely

used for FSL [16,24,35,60], and various loss functions are

further developed in deep metric learning [26,30,38,51,63].

As fully supervised objectives entail strong constraints,

the learned representations are usually more suitable for the

specialized classiﬁcation task and thus lag behind on trans-

ferability [18,27,79]. To alleviate this, many works devise

various data augmentation methods to expand the training

distribution [7,29,54,77]. Recent works also explore adding

more layers after the representation to avoid direct supervi-

sion [57,65]. Differently, we focus on effectively combin-

ing self and full supervisions to improve transferability.

Self-supervised Representation Learning. Self-

supervised representation learning (SSL) attracts increasing

attention in recent years due to its ability to learn mean-

ingful representation without human-annotated labels. The

main idea is to train the model to perform a carefully de-

signed label-free pretext task. Early self-supervised learn-

ing methods devised various pretext tasks including image

restoration [45,56,78], prediction of image rotation [19],

and solving jigsaw puzzles [41]. They achieve fair perfor-

mance but still cannot equal fully supervised learning until

the arise of self-supervised contrastive learning [10,20,22].

The pretext task of contrastive learning is instance discrim-

ination, i.e., to identify different views (augmentations) of

the same image from those of other images. Contrastive

learning methods [8,12,25,32,34,58,68,69] demonstrate

even better transferability than fully supervised learning, re-

sulting from their focus on lower-level and thus more gen-

eral features [18,27,79]. Very recently, masked image mod-

eling (MIM) [21,70,82] emerges as a strong competitor

to contrastive learning, which trains the model to correctly

predict the masked parts of the input image. In this paper,

we mainly focus on contrastive learning in self-supervised

learning. Our framework can be extended to other pretext

tasks by inserting a new task space in the hierarchy.

Omni-supervised Representation Learning: It is

worth mentioning that some existing studies have attempted

to combine FSL and SSL [39,46,66]. Radosavovic et

el. [46] ﬁrst trained an FSL model and then performed

knowledge distillation on unlabeled data. Wei et el. [66]

adopted an SSL pretrained model to generate instance la-

bels and compute an overall similarity to train a new model.

Nayman et el. [39] proposed to ﬁnetune an SSL pretrained

model using ground-truth labels in a controlled manner to

enhance its transferability. Nevertheless, they do not con-

sider the hierarchical relations between the self and full su-

pervision. Also, they perform SSL and FSL sequentially in

separate stages. Differently, OPERA uniﬁes them in a uni-

versal perspective and imposes the supervisions on different

levels of the representations. Our framework can be trained

in an end-to-end manner efﬁciently with fewer epochs.

3. Proposed Approach

In this section, we ﬁrst present a uniﬁed perspective of

self-supervised learning (SSL) and fully supervised learn-

ing (FSL) under a similarity learning framework. We then

propose OPERA to impose hierarchical supervisions on the

corresponding hierarchical representations for better trans-

ferability. Lastly, we elaborate on the instantiation of the

proposed OPERA framework.

3.1. Uniﬁed Framework of Similarity Learning

Given an image space X ⊂ RH×W×C, deep representa-

tion learning trains a deep neural network as the map to their

representation space Y ⊂ RD×1. Fully supervised learning

and self-supervised learning are two mainstream represen-

tation learning approaches in modern deep learning. FSL

utilizes the human-annotated labels as explicit supervision

to train a discriminative classiﬁer. Differently, SSL trains

models without ground-truth labels. The widely used con-

trastive learning (e.g., MoCo-v3 [13]) obtains meaningful

representations by maximizing the similarity between ran-

dom augmentations of the same image.

Generally, FSL and SSL differ in both the supervision

form and optimization objective. To integrate them, we ﬁrst

provide a uniﬁed similarity learning framework to include

both training objectives:

J(Y,P,L) = X

y∈Y,p∈P,l∈L

[−wp·I(ly, lp)·s(y,p)

+wn·(1 −I(ly, lp)) ·s(y,p)],

(1)

where wp≥0and wn≥0denote the coefﬁcients of posi-

tive and negative pairs, lyand lpare the labels of the sam-

ples, and s(y,p)deﬁnes the pairwise similarity between y

and p.I(a, b)is an indicator function which outputs 1 if

a=band 0 otherwise. Lis the label space, and Pcan be

the same as Y, a transformation of Y, or a learnable class

prototype space. For example, to obtain the softmax objec-

tive widely employed in FSL [24,52], we can set:

wp= 1, wn=exp(s(y,p))

Plp06=lyexp(s(y,p0)),(2)

where s(y,p) = yT·p, and pis the row vector in the clas-

siﬁer matrix W. For the InfoNCE loss used in contrastive

learning [22,28,53], we set:

wp=1

τPllp06=yexp(s(y,p0)/τ)

exp(s(y,p)/τ) + Plp06=lyexp(s(y,p0)/τ),

wn=1

exp(s(y,p)/τ)

exp(s(y,p)/τ) + Plp06=lyexp(s(y,p0)/τ)

(3)

where τis the temperature hyper-parameter. See Ap-

pendix A.1 for more details.

Under the uniﬁed training objective (1), the main differ-

ence between FSL and SSL lies in the deﬁnition of the label

space Lfull and Lself . For the labels lfull ∈ Lf ull in FSL,

lfull

i=lfull

jonly if they are from the same ground-truth

category. For the labels lself ∈ Lself in SSL, lself

i=lself

only if they are the augmented views of the same image.

3.2. Hierarchical Supervisions on Hierarchical Rep-

resentations

With the same formulation of the training objective, a

naive way to combine the two training signals is to simply

add them:

Jnaive (Y,P,L) = X

y∈Y,p∈P ,l∈L

[−wself

p·I(lself

y, lself

p)·s(y,p)

+wself

n·(1 −I(lself

y, lself

p)) ·s(y,p)

−wf ull

p·I(lf ull

y, lf ull

p)·s(y,p)

+wf ull

n·(1 −I(lf ull

y, lf ull

p)) ·s(y,p)].

(4)

For yand pfrom the same class, i.e., I(lself

y, lself

p)=0

and I(lfull

y, lf ull

p)=1, the training loss is:

Jnaive(y,p,l)=(wself

n−wfull

p)·s(y,p).(5)

This indicates the two training signals are contradictory

and may neutralize each other. This is particularly harm-

ful if we adopt similar loss functions for fully supervised

and self-supervised learning, i.e., wself

n≈wfull

p, and thus

Jnaive(y,p,l)≈0.

Existing methods [39,65,66] address this by subse-

quently imposing the two training signals. They tend to ﬁrst

obtain a self-supervised pretrained model and then use the

full supervision to tune it. Differently, we propose a more

efﬁcient way to adaptively balance the two weights so that

we can simultaneously employ them:

Jadap(y,p,l)=(wself

n·α−wfull

p·β)·s(y,p),(6)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OPERA:Omni-SupervisedRepresentationLearningwithHierarchicalSupervisionsChengkunWang1;2;*WenzhaoZheng1;2;*ZhengZhu3JieZhou1;2JiwenLu1;2;1BeijingNationalResearchCenterforInformationScienceandTechnology,China2DepartmentofAutomation,TsinghuaUniversity,China3PhiGentRoboticsfwck20,zhengwz18g@mails.tsingh...

展开>> 收起<<

OPERA Omni-Supervised Representation Learning with Hierarchical Supervisions Chengkun Wang12Wenzhao Zheng12Zheng Zhu3Jie Zhou12Jiwen Lu12.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

OPERA Omni-Supervised Representation Learning with Hierarchical Supervisions Chengkun Wang12Wenzhao Zheng12Zheng Zhu3Jie Zhou12Jiwen Lu12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: