Improving Dense Contrastive Learning with Dense Negative Pairs Berk Iskendery

2025-05-08 1 0 3.9MB 9 页 10玖币
侵权投诉
Improving Dense Contrastive Learning with
Dense Negative Pairs
Berk Iskender
UIUC
Zhenlin Xu
Amazon
Simon Kornblith
Google Research
En-Hung Chu
Google
Maryam Khademi
Google
Abstract
Many contrastive representation learning methods learn a single global representa-
tion of an entire image. However, dense contrastive representation learning methods
such as DenseCL (Wang et al., 2021) can learn better representations for tasks
requiring stronger spatial localization of features, such as multi-label classification,
detection, and segmentation. In this work, we study how to improve the quality
of the representations learned by DenseCL by modifying the training scheme and
objective function, and propose DenseCL++. We also conduct several ablation
studies to better understand the effects of: (i) various techniques to form dense
negative pairs among augmentations of different images, (ii) cross-view dense
negative and positive pairs, and (iii) an auxiliary reconstruction task. Our results
show 3.5% and 4% mAP improvement over SimCLR (Chen et al., 2020a) and
DenseCL in COCO multi-label classification. In COCO and VOC segmentation
tasks, we achieve 1.8% and 0.7% mIoU improvements over SimCLR, respectively.
1 Introduction
Self-supervised learning aims to learn representations from unlabeled data via pre-text task training.
Contrastive learning, as a self-supervised learning technique, performs the pre-text task of instance
discrimination (Dosovitskiy et al., 2014; Wu et al., 2018; Chen et al., 2020a). Instance discrimination
in contrastive learning usually trains a single global representation, and these representations are prin-
cipally evaluated in terms of downstream performance on a single-label classification task. However,
methods that perform well in this setting may perform suboptimally on multi-label classification tasks,
where each label is associated with a distinct object in an image, but different image regions contain
different semantic content. Motivated by the important application of multi-label classification in
industry, we target representation learning for this task. We demonstrate that our approach also
improves accuracy on dense downstream tasks such as segmentation.
Our work is inspired by DenseCL (Wang et al., 2021) which proposes to use dense features rather
than global ones in contrastive learning to improve the performance in dense prediction tasks. We
focus on further boosting the performance of dense contrastive learning by modifying the training
scheme and the objective function. Unlike DenseCL, our proposed approach formulates negative
pairs between the dense features of augmented views of different images and uses their similarities
in the proposed dense contrastive loss scheme. We show that the proposed method outperforms
DenseCL in various settings. We also conduct several ablation studies to better understand the effects
of: (i) various methods to form dense negative pairs among augmentations of different images, (ii)
cross-view dense negative and positive pairs, and (iii) an auxiliary reconstruction task.
Correspondence to berki2@illinois.edu and maryamkhademi@google.com
Work done during an internship at Google
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05063v2 [cs.CV] 10 Jan 2023
Transformer
Encoder
Global
proj
Dense
proj
Figure 1: DenseCL++ training scheme. Global and dense positive/negative correspondences are used in the
global (top row) and dense (bottom row) loss functions, respectively.
2 Related Work
SimCLR (Chen et al., 2020a) proposes a simple contrastive learning framework such that the projected
representations of randomly augmented views of the same image sample are attracted to each other
using a contrastive loss. DenseCL (Wang et al., 2021) proposes an extension of this framework better
suited to dense prediction tasks. In DenseCL, the contrastive loss is applied in a dense pairwise
manner which improves the performance compared to the global representation learning counterparts
(Chen et al., 2020b).
On the other hand, following widespread success of the transformer architecture (Vaswani et al.,
2017) in NLP tasks, Vision Transformer (ViT) (Dosovitskiy et al., 2020) adapts the architecture for
visual tasks and achieves impressive results when pretrained on sufficient amount of data. Inspired
by (Devlin et al., 2018), Li et al. (2021) explored the idea of introducing a reconstruction task to
the contrastive learning framework using ViT as an encoder. Wang et al. (2022) further study the
use of reconstruction as a pretext task, incorporating a decoder module in various self-supervised
contrastive settings. These two methods use shallow convolutional networks for their decoders to
preferably learn additional useful local features in the latent space. However, the authors suggest that
the use of sophisticated reconstruction models may be harmful to transfer tasks, as they could lead to
excessively local representations.
3 Method
3.1 Dense Contrastive Learning
Contrastive learning learns latent representations of signals for which the positive correspondences
are attracted to each other and negative ones are repelled from one another. Dense contrastive learning
(Wang et al., 2021) further adapts this framework for dense prediction tasks by replacing the global
representations with their dense counterparts. For each image, instead of the global representation
vRD,S×Smany dense feature vectors zRLare extracted.
Then, dense positive pairs are formed between dense features of the anchor view
xi
and its corre-
sponding augmented view
xj+
by finding the most similar correspondence for each dense vector of
xi
in
xj+
as
k+= arg maxlsim(z(i)
k, z(j+)
l)
, where
z(i)
k
is the
k
-th dense feature of the anchor view
xi
,
z(j+)
l
is the
l
-th dense feature of
xj+
, and
sim(a, b)
calculates the cosine similarity between two
feature vectors. Dense negative pairs are formed between the dense feature vectors of the anchor view
and the global representations of views from other images. The dense contrastive loss is computed as
Li,d =X
k
log exp (z(i)
k·z(j+)
k+)
exp (z(i)
k·z(j+)
k+) + Pj
exp (z(i)
k·vj)
(1)
where
z(j+)
k+
is the positive dense correspondence for the dense feature vector
z(i)
k
in the view
xj+
,
vjis the global feature for the image xj, and τis the temperature parameter.
The overall loss is a linear combination of the global InfoNCE loss term
Li,g
(Oord et al., 2018) and
dense loss, Li= (1 λ)Li,g +λLi,d, where λ[0,1] is a weight constant.
2
摘要:

ImprovingDenseContrastiveLearningwithDenseNegativePairsBerkIskenderyUIUCZhenlinXuyAmazonSimonKornblithGoogleResearchEn-HungChuGoogleMaryamKhademiGoogleAbstractManycontrastiverepresentationlearningmethodslearnasingleglobalrepresenta-tionofanentireimage.However,densecontrastiverepresentationlearningm...

展开>> 收起<<
Improving Dense Contrastive Learning with Dense Negative Pairs Berk Iskendery.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:3.9MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注