1 Image-Text Retrieval with Binary and Continuous Label Supervision

2025-04-30
0
0
937.66KB
13 页
10玖币
侵权投诉
1
Image-Text Retrieval with Binary and Continuous
Label Supervision
Zheng Li, Caili Guo, Senior Member, IEEE, Zerun Feng, Jenq-Neng Hwang, Fellow, IEEE, Ying Jin, Yufeng
Zhang
Abstract—Most image-text retrieval work adopts binary labels
indicating whether a pair of image and text matches or not. Such
a binary indicator covers only a limited subset of image-text
semantic relations, which is insufficient to represent relevance
degrees between images and texts described by continuous labels
such as image captions. The visual-semantic embedding space
obtained by learning binary labels is incoherent and cannot
fully characterize the relevance degrees. In addition to the use
of binary labels, this paper further incorporates continuous
pseudo labels (generally approximated by text similarity between
captions) to indicate the relevance degrees. To learn a coherent
embedding space, we propose an image-text retrieval framework
with Binary and Continuous Label Supervision (BCLS), where
binary labels are used to guide the retrieval model to learn limited
binary correlations, and continuous labels are complementary to
the learning of image-text semantic relations. For the learning
of binary labels, we improve the common Triplet ranking loss
with Soft Negative mining (Triplet-SN) to improve convergence.
For the learning of continuous labels, we design Kendall ranking
loss inspired by Kendall rank correlation coefficient (Kendall τ),
which improves the correlation between the similarity scores
predicted by the retrieval model and the continuous labels. To
mitigate the noise introduced by the continuous pseudo labels, we
further design Sliding Window sampling and Hard Sample mining
strategy (SW-HS) to alleviate the impact of noise and reduce the
complexity of our framework to the same order of magnitude as
the triplet ranking loss. Extensive experiments on two image-text
retrieval benchmarks demonstrate that our method can improve
the performance of state-of-the-art image-text retrieval models.
We conduct an objective and fair comparison of existing retrieval
methods with continuous label supervision based on the ECCV
Caption dataset, which provides semantic associations for more
image-text pairs. The experimental results further demonstrate
that our method can better learn continuous semantic relations.
Index Terms—Image-text retrieval, deep metric learning, bi-
nary label, continuous pseudo label, Kendall rank correlation
coefficient.
I. INTRODUCTION
IMAGE-TEXT retrieval is formulated as retrieving relevant
samples across the different image and text modalities [1]–
[12]. In the case of image-to-text retrieval, given a query
image, the goal is to find the most relevant caption from the
Caili Guo, Zheng Li and Zerun Feng are with School of Informa-
tion and Communication Engineering, Beijing University of Posts and
Telecommunications, Beijing 100876, China (e-mail: guocaili@bupt.edu.cn;
lizhengzachary@bupt.edu.cn; fengzerun@bupt.edu.cn).
Jenq-Neng Hwang and Ying Jin are with the Department of Electrical
Engineering, University of Washington, Seattle, WA 98105 USA (e-mail:
hwang@uw.edu; jinying@uw.edu).
Yufeng Zhang is with China Telecom Dict Application Capability Center,
Beijing, China (e-mail: zyf68@chinatelecom.cn).
People standing outside of a building.Query:
True False False
FalseFalse
1 0.87 0.59 0.15 -0.01
Instance-
based
Semantic-
based
Binary Label
Continuous
Pseudo Label
Fig. 1. Instance-based retrieval adopts a binary label indicating whether a
pair of image and text match or not. However, the relevance degrees between
the samples marked as False and the query are different, and the binary label
cannot reflect this relevance. In semantic-based retrieval, continuous labels
are used to model the relevance degrees between queries and candidates. The
continuous pseudo labels in the figure are approximated by calculating text
similarity using Sentence-BERT [13]. This allows multiple candidates to be
considered relevant to a query and provides a way of ranking candidates from
most to least similar.
text gallery. On the other hand, text-to-image retrieval starts
with a query text, and the goal is to find the most relevant
image from the image gallery. Compared with unimodal
image retrieval, image-text retrieval is more challenging due
to the heterogeneous gap between image and text. A dominant
approach to deal with the above challenge is to learn a shared
visual-semantic embedding space, where the distance between
the embedding vectors of related image and text is minimized.
According to the assumption of the relevance between query
and candidate, image-text retrieval can be mainly divided into
two categories, instance-based and semantic-based, as shown
in TABLE I. Most image-text retrieval work [1]–[6] focuses
on instance-based retrieval. As shown in Fig. 1, instance-based
retrieval adopts a binary label indicating whether a pair of
image and text match or not. Widely used image-text retrieval
datasets such as Flickr30K [14] and MS-COCO [15] provide
manually annotated binary labels. Such a binary indicator
covers only a limited subset of image-text semantic relations,
which is insufficient to represent relevance degrees between
images and texts described by continuous labels such as image
captions. With binary label supervision, a query sentence is
relevant to only one image. However, in fact, there may
be multiple candidates related to the query that are directly
arbitrarily classified as irrelevant. It’s inconsistent with the
user experience of the retrieval system. In practice, we hope
that the retrieval system can return multiple results related to
the query and rank them by relevance degrees since humans’
judgment of relevance degree is not simply a binary relevance
and irrelevance. Loss functions designed based on the above
arXiv:2210.11319v1 [cs.CV] 20 Oct 2022
2
TABLE I
IMAGE-TEXT RETRIEVAL CLASSIFICATION.
Instance-based Semantic-based
Label Binary label Continuous pseudo label
Label source Human annotation Text similarity approxi-
mation
Advantages Binary labels are accu-
rate and do not introduce
training noise
Continuous pseudo labels
can represent continuous
correlations between im-
ages and texts
Disadvantages Binary labels cannot ade-
quately represent the cor-
relation between images
and texts
Continuous pseudo la-
bels are approximated,
not completely accurate,
and will introduce train-
ing noise
Optimization
objective
Triplet-HN [1] Ladder loss [16], [17],
SAM loss [18]
assumptions, such as the most widely used Triplet ranking
loss with Hard Negative mining (Triplet-HN) [1], cannot guide
the model to learn a coherent visual-semantic embedding
space. Moreover, Triplet-HN only optimizes hardest negative
samples, which will make the model training slow to converge
and make optimization difficult.
In semantic-based retrieval, continuous labels are used to
indicate the relevance degrees between queries and candidates,
as Fig. 1 shows. It allows multiple candidates to be considered
relevant to a query and provides a way of ranking candidates
from most to least similar [19]. TABLE I compares the
advantages and disadvantages of the two types of retrieval
methods. Compared with instance-based retrieval, semantic-
based retrieval is more in line with the actual user experience.
Although semantic-based retrieval is more preferred, it is
difficult to obtain continuous labels. The ideal ground truth for
the continuous label is human annotation, but it is infeasible
to annotate an image-text pairwise relevance degree dataset.
Recently, some studies [16], [18], [19] present that the seman-
tic similarities between texts can be used to approximate the
relevance degrees between images and texts as the continuous
pseudo labels. Wray et al. [19] propose several proxies to
estimate relevance degrees. Zhou et al. [16], [17] propose to
measure the relevance degrees by BERT [20] and design a
ladder loss to learn a coherent embedding space. Biten et al.
[18] use image captioning evaluation metrics to approximate
the relevance degree, and design a Semantic Adaptive Margin
(SAM) loss for semantic-based retrieval. Additionally, these
studies also propose to use normalized Discounted Cumulative
Gain (nDCG) [19], Coherent Score [16], [17] and Normalized
Cumulative Semantic Score [18] to evaluate the performance
of semantic-based retrieval methods. Compared with binary
labels, continuous pseudo labels can represent the continuous
correlation between images and texts. But since pseudo labels
are calculated approximately, they are not completely accurate,
which will introduce noise into training.
Existing semantic-based retrieval work has made great
progress, but there are still the following problems:
•Existing semantic-based retrieval methods achieves the
learning of continuous pseudo labels by hierarchically
embedding samples with different relevance degrees [16],
[17], or adjusting the margin of triplet loss according
to pseudo labels [18]. These methods require manual
selection of appropriate hyper-parameters and cannot be
flexibly applied to different data and retrieval models.
•Continuous pseudo labels approximated by text similarity
are not completely accurate, which will introduce noise
into model training. Existing methods ignore the negative
effects of inaccurate pseudo labels.
•Existing evaluation metrics of semantic-based retrieval
can only reflect the fit of the retrieval model to inac-
curate pseudo labels, which cannot objectively reflect the
retrieval performance.
Using binary or continuous labels alone has its own short-
comings. This paper proposes an image-text retrieval frame-
work with Binary and Continuous Label Supervision (BCLS)
to learn a coherent visual-semantic embedding space. For the
learning of binary labels, we improve the common Triplet
ranking loss with Soft Negative mining (Triplet-SN) to improve
convergence. For the learning of continuous labels, we design
Kendall ranking loss inspired by Kendall rank correlation
coefficient (Kendall τ). In statistics, Kendall τis a statistic
used to measure the ordinal association between two measured
quantities. This loss function improves the correlation between
the similarity scores predicted by the retrieval model and
the continuous pseudo labels by optimizing the discordant
pairs in the ranking results, which can be flexibly applied to
various data and retrieval models without complex parameter
settings. For the problem of pseudo labels introducing training
noise, we further design Sliding Window sampling and Hard
Sample mining strategy (SW-HS) to alleviate the impact of
noise and reduce the complexity of our framework to the
same order of magnitude as the common triplet ranking
loss. For the evaluation problem of semantic-based retrieval,
we conducted an objective and fair comparison of existing
semantic-based retrieval methods with the help of the Ex-
tended COCO Validation (ECCV) Caption dataset [21]. This
dataset leverages machine and human annotations to provide
semantic associations for more image-text pairs. The major
contributions of this paper are summarized as follows:
•A novel image-text retrieval framework with Binary and
Continuous Label Supervision (BCLS) is proposed to
guide retrieval models to learn a coherent visual-semantic
embedding space. The framework combines the advan-
tages of binary and continuous labels and makes targeted
improvements for the problems existing in the two types
of label learning.
•ASliding Window sampling and Hard Sample mining
strategy (SW-HS) is designed to mitigate the negative
effects of continuous pseudo label inaccuracy and reduce
the complexity of framework with BCLS to the same
order of magnitude as the common triplet ranking loss.
•To address the shortcomings of performance evaluation
for semantic-based retrieval, we conduct an objective
and fair comparison of existing semantic-based retrieval
methods with the help of the ECCV Caption dataset,
which provides semantic associations for more image-
3
text pairs. The experimental results demonstrate that our
method can better learn continuous semantic relations.
II. RELATED WORK
A. Instance-based Image-Text Retrieval
Image-text retrieval task, either image-to-text or text-to-
image, is formulated as retrieving relevant samples across the
different image and text modalities [1]–[12]. According to
the assumption of the relevance between query and candidate,
image-text retrieval can be mainly divided into two categories,
instance-based and semantic-based. Most image-text retrieval
studies [1]–[6] focus on instance-based retrieval. A variety
of methods have been devoted to learning modality invariant
features. More specifically, Wang et al. [6] propose a position
focused attention network to investigate the relation between
the visual and the textual views for image-text retrieval. In
recent years, multi-modal pre-training models [22]–[29] have
been intensively explored to bridge image and text. The
paradigm of vision-language pre-training is to design pre-
training tasks on large-scale vision-language data for pre-
training and then finetune the model on specific downstream
tasks. The above methods learn advanced encoding networks
to generate richer semantic representations for different modal-
ities. The framework with BCLS proposed in this paper is
independent of image and text feature representation and sim-
ilarity calculation. It can be plug-and-play applied to existing
instance-based retrieval models.
In addition to the work on the feature representation and
similarity calculation of images and text, a variety of deep
metric learning methods have been proposed in instance-based
image-text retrieval [1], [30]–[32]. A hinge-based triplet loss
is widely employed as an objective to enforce aligned pairs
to have a higher similarity score than misaligned pairs by a
margin [33]. Faghri et al. [1] incorporate hard negatives in
the ranking loss function, which yields significant gains in
retrieval performance. There are several studies [30], [31], [34]
that propose weighting metric learning frameworks for image-
text retrieval, which can further improve retrieval performance.
These loss functions for instance-based image-text retrieval
adopt binary labels to indicate whether a pair of image and text
match or not, which is not sufficient to represent the relevance
degree between image and text. Using a binary label based
loss function to train a model will destroy the coherence of
the visual-semantic embedding space, making it difficult for
the model to learn continuous semantic relations.
B. Semantic-based Image-Text Retrieval
While most work focuses on instance-based retrieval, a few
studies have explored semantic-based retrieval. Some studies
propose that the semantic similarity between captions can
be used to approximate the relevance degree between image
and text [16], [19]. Wray et al. [19] propose several proxies
to estimate relevance degrees. Biten et al. [18] use image
captioning evaluation metrics, i.e., Consensus-based Image
Description Evaluation (CIDEr) [35] and Semantic Proposi-
tional Image Caption Evaluation (SPICE) [36], to approximate
the relevance degree, and design a semantic adaptive margin
(SAM) loss for semantic-based retrieval. SAM loss is a variant
of triplet loss, where candidates are pushed away from the
query by semantic adaptive margins in the embedding space.
The adjustment range of the margin in SAM loss will be
affected by the dataset, retrieval model, and pseudo label
calculation method. It needs to be carefully adjusted manually
and cannot be flexibly applied to different data and models.
Zhou et al. [16], [17] propose to measure the relevance degrees
by BERT [20] and design a ladder loss to learn a coherent
embedding space. In the ladder loss, the relevance degrees
are artificially divided into several levels, and a large number
of hyper-parameters are introduced. Moreover, pseudo labels
approximated by text similarity are not completely accurate,
and existing methods ignore the negative effects of inaccurate
pseudo labels. The Kendall ranking loss proposed in this paper
will solve the problems existing in the current semantic-based
retrieval methods.
In addition to methodological problems, performance eval-
uation of semantic-based retrieval methods also has shortcom-
ings. Existing evaluation metrics of semantic-based retrieval
can only reflect the fit of the retrieval model to inaccurate
pseudo labels, which cannot objectively reflect the retrieval
performance. This paper will make up for the existing short-
comings in the performance evaluation of semantic-based
retrieval.
C. Deep Metric Learning
The main work of this paper belongs to the field of deep
metric learning. Deep metric learning aims to construct an
embedding space to reflect the semantic distances among
instances. It has many other applications such as face recog-
nition [37] and image retrieval [38]. Contrastive loss [39] and
triplet loss [40] are two representative pairwise approaches
in deep metric learning. Unlike the contrastive loss, which
aims to push misaligned pairs apart by a fixed margin as
well as to pull aligned pairs as close as possible. Triplet
loss only aims to force the similarity of a positive pair to
be higher than that of a negative one by a margin and enjoys
more flexibility. Unsatisfied with potential slow convergence
and unstable performance, recent work have proposed several
variants. N-pair loss [41] employed multiple negatives for each
positive sample. However, the above methods are all applied
to unimodal image retrieval, where relevance degrees of the
instances can be clearly defined as a binary variable. These
loss functions can only guide the model to map images of
the same class to relatively close locations and images of
different classes to distant locations, and cannot be used to
learn continuous semantic relationships between images and
text.
Recently, some methods [42], [43] for directly optimizing
evaluation metrics such as average precision (AP) have been
proposed. Cakir et al. [42] propose FastAP to optimize AP
using a soft histogram binning technique. Brown et al. [43],
on the other hand, optimize a smoothed approximation of AP,
called Smooth-AP. Direct optimization of evaluation metrics
looks at more samples from the retrieval set and has been
proven to improve training efficiency and performance [43].
摘要:
展开>>
收起<<
1Image-TextRetrievalwithBinaryandContinuousLabelSupervisionZhengLi,CailiGuo,SeniorMember,IEEE,ZerunFeng,Jenq-NengHwang,Fellow,IEEE,YingJin,YufengZhangAbstractMostimage-textretrievalworkadoptsbinarylabelsindicatingwhetherapairofimageandtextmatchesornot.Suchabinaryindicatorcoversonlyalimitedsubsetofi...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源
价格:10玖币
属性:13 页
大小:937.66KB
格式:PDF
时间:2025-04-30
作者详情
-
Voltage-Controlled High-Bandwidth Terahertz Oscillators Based On Antiferromagnets Mike A. Lund1Davi R. Rodrigues2Karin Everschor-Sitte3and Kjetil M. D. Hals1 1Department of Engineering Sciences University of Agder 4879 Grimstad Norway10 玖币0人下载
-
Voltage-controlled topological interface states for bending waves in soft dielectric phononic crystal plates10 玖币0人下载