3
text pairs. The experimental results demonstrate that our
method can better learn continuous semantic relations.
II. RELATED WORK
A. Instance-based Image-Text Retrieval
Image-text retrieval task, either image-to-text or text-to-
image, is formulated as retrieving relevant samples across the
different image and text modalities [1]–[12]. According to
the assumption of the relevance between query and candidate,
image-text retrieval can be mainly divided into two categories,
instance-based and semantic-based. Most image-text retrieval
studies [1]–[6] focus on instance-based retrieval. A variety
of methods have been devoted to learning modality invariant
features. More specifically, Wang et al. [6] propose a position
focused attention network to investigate the relation between
the visual and the textual views for image-text retrieval. In
recent years, multi-modal pre-training models [22]–[29] have
been intensively explored to bridge image and text. The
paradigm of vision-language pre-training is to design pre-
training tasks on large-scale vision-language data for pre-
training and then finetune the model on specific downstream
tasks. The above methods learn advanced encoding networks
to generate richer semantic representations for different modal-
ities. The framework with BCLS proposed in this paper is
independent of image and text feature representation and sim-
ilarity calculation. It can be plug-and-play applied to existing
instance-based retrieval models.
In addition to the work on the feature representation and
similarity calculation of images and text, a variety of deep
metric learning methods have been proposed in instance-based
image-text retrieval [1], [30]–[32]. A hinge-based triplet loss
is widely employed as an objective to enforce aligned pairs
to have a higher similarity score than misaligned pairs by a
margin [33]. Faghri et al. [1] incorporate hard negatives in
the ranking loss function, which yields significant gains in
retrieval performance. There are several studies [30], [31], [34]
that propose weighting metric learning frameworks for image-
text retrieval, which can further improve retrieval performance.
These loss functions for instance-based image-text retrieval
adopt binary labels to indicate whether a pair of image and text
match or not, which is not sufficient to represent the relevance
degree between image and text. Using a binary label based
loss function to train a model will destroy the coherence of
the visual-semantic embedding space, making it difficult for
the model to learn continuous semantic relations.
B. Semantic-based Image-Text Retrieval
While most work focuses on instance-based retrieval, a few
studies have explored semantic-based retrieval. Some studies
propose that the semantic similarity between captions can
be used to approximate the relevance degree between image
and text [16], [19]. Wray et al. [19] propose several proxies
to estimate relevance degrees. Biten et al. [18] use image
captioning evaluation metrics, i.e., Consensus-based Image
Description Evaluation (CIDEr) [35] and Semantic Proposi-
tional Image Caption Evaluation (SPICE) [36], to approximate
the relevance degree, and design a semantic adaptive margin
(SAM) loss for semantic-based retrieval. SAM loss is a variant
of triplet loss, where candidates are pushed away from the
query by semantic adaptive margins in the embedding space.
The adjustment range of the margin in SAM loss will be
affected by the dataset, retrieval model, and pseudo label
calculation method. It needs to be carefully adjusted manually
and cannot be flexibly applied to different data and models.
Zhou et al. [16], [17] propose to measure the relevance degrees
by BERT [20] and design a ladder loss to learn a coherent
embedding space. In the ladder loss, the relevance degrees
are artificially divided into several levels, and a large number
of hyper-parameters are introduced. Moreover, pseudo labels
approximated by text similarity are not completely accurate,
and existing methods ignore the negative effects of inaccurate
pseudo labels. The Kendall ranking loss proposed in this paper
will solve the problems existing in the current semantic-based
retrieval methods.
In addition to methodological problems, performance eval-
uation of semantic-based retrieval methods also has shortcom-
ings. Existing evaluation metrics of semantic-based retrieval
can only reflect the fit of the retrieval model to inaccurate
pseudo labels, which cannot objectively reflect the retrieval
performance. This paper will make up for the existing short-
comings in the performance evaluation of semantic-based
retrieval.
C. Deep Metric Learning
The main work of this paper belongs to the field of deep
metric learning. Deep metric learning aims to construct an
embedding space to reflect the semantic distances among
instances. It has many other applications such as face recog-
nition [37] and image retrieval [38]. Contrastive loss [39] and
triplet loss [40] are two representative pairwise approaches
in deep metric learning. Unlike the contrastive loss, which
aims to push misaligned pairs apart by a fixed margin as
well as to pull aligned pairs as close as possible. Triplet
loss only aims to force the similarity of a positive pair to
be higher than that of a negative one by a margin and enjoys
more flexibility. Unsatisfied with potential slow convergence
and unstable performance, recent work have proposed several
variants. N-pair loss [41] employed multiple negatives for each
positive sample. However, the above methods are all applied
to unimodal image retrieval, where relevance degrees of the
instances can be clearly defined as a binary variable. These
loss functions can only guide the model to map images of
the same class to relatively close locations and images of
different classes to distant locations, and cannot be used to
learn continuous semantic relationships between images and
text.
Recently, some methods [42], [43] for directly optimizing
evaluation metrics such as average precision (AP) have been
proposed. Cakir et al. [42] propose FastAP to optimize AP
using a soft histogram binning technique. Brown et al. [43],
on the other hand, optimize a smoothed approximation of AP,
called Smooth-AP. Direct optimization of evaluation metrics
looks at more samples from the retrieval set and has been
proven to improve training efficiency and performance [43].