Improving Dense Contrastive Learning with Dense Negative Pairs Berk Iskendery

2025-05-08 1 0 3.9MB 9 页 10玖币

侵权投诉

Improving Dense Contrastive Learning with

Dense Negative Pairs

Berk Iskender†∗

UIUC

Zhenlin Xu†

Amazon

Simon Kornblith

Google Research

En-Hung Chu

Google

Maryam Khademi

Google

Abstract

Many contrastive representation learning methods learn a single global representa-

tion of an entire image. However, dense contrastive representation learning methods

such as DenseCL (Wang et al., 2021) can learn better representations for tasks

requiring stronger spatial localization of features, such as multi-label classiﬁcation,

detection, and segmentation. In this work, we study how to improve the quality

of the representations learned by DenseCL by modifying the training scheme and

objective function, and propose DenseCL++. We also conduct several ablation

studies to better understand the effects of: (i) various techniques to form dense

negative pairs among augmentations of different images, (ii) cross-view dense

negative and positive pairs, and (iii) an auxiliary reconstruction task. Our results

show 3.5% and 4% mAP improvement over SimCLR (Chen et al., 2020a) and

DenseCL in COCO multi-label classiﬁcation. In COCO and VOC segmentation

tasks, we achieve 1.8% and 0.7% mIoU improvements over SimCLR, respectively.

1 Introduction

Self-supervised learning aims to learn representations from unlabeled data via pre-text task training.

Contrastive learning, as a self-supervised learning technique, performs the pre-text task of instance

discrimination (Dosovitskiy et al., 2014; Wu et al., 2018; Chen et al., 2020a). Instance discrimination

in contrastive learning usually trains a single global representation, and these representations are prin-

cipally evaluated in terms of downstream performance on a single-label classiﬁcation task. However,

methods that perform well in this setting may perform suboptimally on multi-label classiﬁcation tasks,

where each label is associated with a distinct object in an image, but different image regions contain

different semantic content. Motivated by the important application of multi-label classiﬁcation in

industry, we target representation learning for this task. We demonstrate that our approach also

improves accuracy on dense downstream tasks such as segmentation.

Our work is inspired by DenseCL (Wang et al., 2021) which proposes to use dense features rather

than global ones in contrastive learning to improve the performance in dense prediction tasks. We

focus on further boosting the performance of dense contrastive learning by modifying the training

scheme and the objective function. Unlike DenseCL, our proposed approach formulates negative

pairs between the dense features of augmented views of different images and uses their similarities

in the proposed dense contrastive loss scheme. We show that the proposed method outperforms

DenseCL in various settings. We also conduct several ablation studies to better understand the effects

of: (i) various methods to form dense negative pairs among augmentations of different images, (ii)

cross-view dense negative and positive pairs, and (iii) an auxiliary reconstruction task.

∗Correspondence to berki2@illinois.edu and maryamkhademi@google.com

†Work done during an internship at Google

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05063v2 [cs.CV] 10 Jan 2023

Transformer

Encoder

Global

proj

Dense

proj

Figure 1: DenseCL++ training scheme. Global and dense positive/negative correspondences are used in the

global (top row) and dense (bottom row) loss functions, respectively.

2 Related Work

SimCLR (Chen et al., 2020a) proposes a simple contrastive learning framework such that the projected

representations of randomly augmented views of the same image sample are attracted to each other

using a contrastive loss. DenseCL (Wang et al., 2021) proposes an extension of this framework better

suited to dense prediction tasks. In DenseCL, the contrastive loss is applied in a dense pairwise

manner which improves the performance compared to the global representation learning counterparts

(Chen et al., 2020b).

On the other hand, following widespread success of the transformer architecture (Vaswani et al.,

2017) in NLP tasks, Vision Transformer (ViT) (Dosovitskiy et al., 2020) adapts the architecture for

visual tasks and achieves impressive results when pretrained on sufﬁcient amount of data. Inspired

by (Devlin et al., 2018), Li et al. (2021) explored the idea of introducing a reconstruction task to

the contrastive learning framework using ViT as an encoder. Wang et al. (2022) further study the

use of reconstruction as a pretext task, incorporating a decoder module in various self-supervised

contrastive settings. These two methods use shallow convolutional networks for their decoders to

preferably learn additional useful local features in the latent space. However, the authors suggest that

the use of sophisticated reconstruction models may be harmful to transfer tasks, as they could lead to

excessively local representations.

3 Method

3.1 Dense Contrastive Learning

Contrastive learning learns latent representations of signals for which the positive correspondences

are attracted to each other and negative ones are repelled from one another. Dense contrastive learning

(Wang et al., 2021) further adapts this framework for dense prediction tasks by replacing the global

representations with their dense counterparts. For each image, instead of the global representation

v∈RD,S×Smany dense feature vectors z∈RLare extracted.

Then, dense positive pairs are formed between dense features of the anchor view

and its corre-

sponding augmented view

xj+

by ﬁnding the most similar correspondence for each dense vector of

xj+

k+= arg maxlsim(z(i)

k, z(j+)

, where

z(i)

is the

-th dense feature of the anchor view

z(j+)

is the

-th dense feature of

xj+

, and

sim(a, b)

calculates the cosine similarity between two

feature vectors. Dense negative pairs are formed between the dense feature vectors of the anchor view

and the global representations of views from other images. The dense contrastive loss is computed as

Li,d =X

−log exp (z(i)

k·z(j+)

k+)/τ

exp (z(i)

k·z(j+)

k+) + Pj−

exp (z(i)

k·vj−)/τ

(1)

where

z(j+)

is the positive dense correspondence for the dense feature vector

z(i)

in the view

xj+

vj−is the global feature for the image xj−, and τis the temperature parameter.

The overall loss is a linear combination of the global InfoNCE loss term

Li,g

(Oord et al., 2018) and

dense loss, Li= (1 −λ)Li,g +λLi,d, where λ∈[0,1] is a weight constant.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingDenseContrastiveLearningwithDenseNegativePairsBerkIskenderyUIUCZhenlinXuyAmazonSimonKornblithGoogleResearchEn-HungChuGoogleMaryamKhademiGoogleAbstractManycontrastiverepresentationlearningmethodslearnasingleglobalrepresenta-tionofanentireimage.However,densecontrastiverepresentationlearningm...

展开>> 收起<<

Improving Dense Contrastive Learning with Dense Negative Pairs Berk Iskendery.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Dense Contrastive Learning with Dense Negative Pairs Berk Iskendery

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: