Brief Introduction to Contrastive Learning Pretext Tasks for Visual Representation Zhenyuan Lu

2025-04-27 0 0 4.39MB 8 页 10玖币

侵权投诉

Brief Introduction to Contrastive Learning Pretext

Tasks for Visual Representation

Zhenyuan Lu

Northeastern University, Jan. 2021

Abstract

To improve performance in visual feature representation from photos or videos for

practical applications, we generally require large-scale human-annotated labeled

data while training deep neural networks. However, the cost of gathering and

annotating human-annotated labeled data is expensive. Given that there is a lot of

unlabeled data in the actual world, it is possible to introduce self-deﬁned pseudo

labels as supervisions to prevent this issue. Self-supervised learning, speciﬁcally

contrastive learning, is a subset of unsupervised learning methods that has grown

popular in computer vision, natural language processing, and other domains. The

purpose of contrastive learning is to embed augmented samples from the same

sample near to each other while pushing away those that are not. In the following

sections, we will introduce the regular formulation among different learnings. In

the next sections, we will discuss the regular formulation of various learnings.

Furthermore, we offer some strategies from contrastive learning that have recently

been published and are focused on pretext tasks for visual representation.

1 Introduction

Large-scale dataset collection and annotation are time-consuming and costly. To avoid time-

consuming and costly data annotations, a number of self-supervised learning methods have recently

been developed to learn visual representations from massive unlabeled photos or videos that are

not involved in human annotations. One frequent way of learning such visual representations is to

propose a pretext task for the neural network to perform with. Here, we leverage contrastive learning

to focus on the pretext task.

Consider Robert Epstein’s experiment, in which the goal is to encourage participants to draw a

detailed representation of a one-dollar bill (Figure 1). The image sketched for the dollar bill from

memory is depicted in the ﬁgure on the left. While the dollar bill is presented, the correct ﬁgure is

precisely drawn. As a result, the drawing produced by memory differs signiﬁcantly from the drawing

produced by the target presented (Epstein 2016). Regardless of how dissimilar these two pictures

are, they share common representations such as Mr. Washington’s ﬁgure, the one-dollar inscription,

and others. Humans can comprehend that these two drawings depict the same target, one dollar.

But what if we let the machine guess whether they are from the same image, which may require a

representation based on a pair of positive sample pairs: a drawing and a dollar bill, and a pair of

negative sample pairs: a random other drawing and a dollar bill. This is the concept of contrastive

learning, which has lately been expanded to various algorithms.

2 Formulations Among Different Learning Paradigms

The distinction between different learnings is primarily determined by training labels. There are four

types of visual feature learning methods: (1) supervised learning, (2) semi-supervised learning, (3)

weakly supervised learning, and (4) unsupervised learning (e.g. contrastive learning).

arXiv:2210.03163v1 [cs.CV] 6 Oct 2022

Figure 1: Fig. Left: Drawing of a dollar bill from memory. Right: Drawing subsequently made with

a dollar bill present. Image source: Epstein, 2016.

2.1 Supervised Learning

For supervised learning, the model is given a dataset

X∈(x1, x2, . . . , xN)

. Such dataset is associated

with manually annotated labels Yi. The training loss function is deﬁned as follows:

loss(D) = min

i=1

loss(Xi, Yi)

where

D={Xi}N

i=0

is the

labeled training data. The advantage of training models with human-

annotated labels is that they produce signiﬁcant outcomes in a variety of computer vision applications

(A. Krizhevsky 2012, R. Girshick 2014, D. Tran 2015, J. Long 2015). However, label annotation

is frequently extremely expensive, demanding advanced professional skills and domain expertise.

As a result, the other four learning algorithms are now more common than supervised learning for

lowering labeling costs.

2.2 Semi-supervised Learning

The model is given a small labeled dataset X and a large unlabeled dataset Z for semi-supervised

learning. This dataset is associated with manually annotated labels Y i. The following is the deﬁnition

of the training loss function:

loss(D1, D2) = min

i=1

loss(Xi, Yi) + 1

i=1

loss(Zi, R(Zi, X))

where

D1={Xi}N

i=0

labeled training dataset, and

D2={Zi}M

i=0

unlabeled training

dataset.

R(Zi, X)

is a function that represents the relationship between the unlabeled and labeled

training datasets.

2.3 Weakly Supervised Learning

A dataset

is associated with a collection of coarse-grained labels

for weakly supervised learning.

The training loss function for X∈(x1, x2, . . . , xi)is deﬁned as follows:

loss(D) = min

i=1

loss(Xi, Ci)

where

D1={Xi}N

i=0

denotes the training dataset. In a weakly supervised learning system, a ﬁne-

grained label is substantially more expensive than a coarse-grained label. Because of this fact, the

advantage of weak supervision labels is that it is relatively easier to gather large-scale datasets. For

example, picture features collected from websites utilizing the hashtag as coarse-grained labels were

recently introduced (W. Li 2017, D. Mahajan and Y. Li 2018).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BriefIntroductiontoContrastiveLearningPretextTasksforVisualRepresentationZhenyuanLuNortheasternUniversity,Jan.2021AbstractToimproveperformanceinvisualfeaturerepresentationfromphotosorvideosforpracticalapplications,wegenerallyrequirelarge-scalehuman-annotatedlabeleddatawhiletrainingdeepneuralnetworks...

展开>> 收起<<

Brief Introduction to Contrastive Learning Pretext Tasks for Visual Representation Zhenyuan Lu.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Brief Introduction to Contrastive Learning Pretext Tasks for Visual Representation Zhenyuan Lu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: