
Brief Introduction to Contrastive Learning Pretext
Tasks for Visual Representation
Zhenyuan Lu
Northeastern University, Jan. 2021
Abstract
To improve performance in visual feature representation from photos or videos for
practical applications, we generally require large-scale human-annotated labeled
data while training deep neural networks. However, the cost of gathering and
annotating human-annotated labeled data is expensive. Given that there is a lot of
unlabeled data in the actual world, it is possible to introduce self-defined pseudo
labels as supervisions to prevent this issue. Self-supervised learning, specifically
contrastive learning, is a subset of unsupervised learning methods that has grown
popular in computer vision, natural language processing, and other domains. The
purpose of contrastive learning is to embed augmented samples from the same
sample near to each other while pushing away those that are not. In the following
sections, we will introduce the regular formulation among different learnings. In
the next sections, we will discuss the regular formulation of various learnings.
Furthermore, we offer some strategies from contrastive learning that have recently
been published and are focused on pretext tasks for visual representation.
1 Introduction
Large-scale dataset collection and annotation are time-consuming and costly. To avoid time-
consuming and costly data annotations, a number of self-supervised learning methods have recently
been developed to learn visual representations from massive unlabeled photos or videos that are
not involved in human annotations. One frequent way of learning such visual representations is to
propose a pretext task for the neural network to perform with. Here, we leverage contrastive learning
to focus on the pretext task.
Consider Robert Epstein’s experiment, in which the goal is to encourage participants to draw a
detailed representation of a one-dollar bill (Figure 1). The image sketched for the dollar bill from
memory is depicted in the figure on the left. While the dollar bill is presented, the correct figure is
precisely drawn. As a result, the drawing produced by memory differs significantly from the drawing
produced by the target presented (Epstein 2016). Regardless of how dissimilar these two pictures
are, they share common representations such as Mr. Washington’s figure, the one-dollar inscription,
and others. Humans can comprehend that these two drawings depict the same target, one dollar.
But what if we let the machine guess whether they are from the same image, which may require a
representation based on a pair of positive sample pairs: a drawing and a dollar bill, and a pair of
negative sample pairs: a random other drawing and a dollar bill. This is the concept of contrastive
learning, which has lately been expanded to various algorithms.
2 Formulations Among Different Learning Paradigms
The distinction between different learnings is primarily determined by training labels. There are four
types of visual feature learning methods: (1) supervised learning, (2) semi-supervised learning, (3)
weakly supervised learning, and (4) unsupervised learning (e.g. contrastive learning).
arXiv:2210.03163v1 [cs.CV] 6 Oct 2022