1 Image-Text Retrieval with Binary and Continuous Label Supervision

2025-04-30 1 0 937.66KB 13 页 10玖币

侵权投诉

Image-Text Retrieval with Binary and Continuous

Label Supervision

Zheng Li, Caili Guo, Senior Member, IEEE, Zerun Feng, Jenq-Neng Hwang, Fellow, IEEE, Ying Jin, Yufeng

Zhang

Abstract—Most image-text retrieval work adopts binary labels

indicating whether a pair of image and text matches or not. Such

a binary indicator covers only a limited subset of image-text

semantic relations, which is insufﬁcient to represent relevance

degrees between images and texts described by continuous labels

such as image captions. The visual-semantic embedding space

obtained by learning binary labels is incoherent and cannot

fully characterize the relevance degrees. In addition to the use

of binary labels, this paper further incorporates continuous

pseudo labels (generally approximated by text similarity between

captions) to indicate the relevance degrees. To learn a coherent

embedding space, we propose an image-text retrieval framework

with Binary and Continuous Label Supervision (BCLS), where

binary labels are used to guide the retrieval model to learn limited

binary correlations, and continuous labels are complementary to

the learning of image-text semantic relations. For the learning

of binary labels, we improve the common Triplet ranking loss

with Soft Negative mining (Triplet-SN) to improve convergence.

For the learning of continuous labels, we design Kendall ranking

loss inspired by Kendall rank correlation coefﬁcient (Kendall τ),

which improves the correlation between the similarity scores

predicted by the retrieval model and the continuous labels. To

mitigate the noise introduced by the continuous pseudo labels, we

further design Sliding Window sampling and Hard Sample mining

strategy (SW-HS) to alleviate the impact of noise and reduce the

complexity of our framework to the same order of magnitude as

the triplet ranking loss. Extensive experiments on two image-text

retrieval benchmarks demonstrate that our method can improve

the performance of state-of-the-art image-text retrieval models.

We conduct an objective and fair comparison of existing retrieval

methods with continuous label supervision based on the ECCV

Caption dataset, which provides semantic associations for more

image-text pairs. The experimental results further demonstrate

that our method can better learn continuous semantic relations.

Index Terms—Image-text retrieval, deep metric learning, bi-

nary label, continuous pseudo label, Kendall rank correlation

coefﬁcient.

I. INTRODUCTION

IMAGE-TEXT retrieval is formulated as retrieving relevant

samples across the different image and text modalities [1]–

[12]. In the case of image-to-text retrieval, given a query

image, the goal is to ﬁnd the most relevant caption from the

Caili Guo, Zheng Li and Zerun Feng are with School of Informa-

tion and Communication Engineering, Beijing University of Posts and

Telecommunications, Beijing 100876, China (e-mail: guocaili@bupt.edu.cn;

lizhengzachary@bupt.edu.cn; fengzerun@bupt.edu.cn).

Jenq-Neng Hwang and Ying Jin are with the Department of Electrical

Engineering, University of Washington, Seattle, WA 98105 USA (e-mail:

hwang@uw.edu; jinying@uw.edu).

Yufeng Zhang is with China Telecom Dict Application Capability Center,

Beijing, China (e-mail: zyf68@chinatelecom.cn).

People standing outside of a building.Query:

True False False

FalseFalse

1 0.87 0.59 0.15 -0.01

Instance-

based

Semantic-

based

Binary Label

Continuous

Pseudo Label

Fig. 1. Instance-based retrieval adopts a binary label indicating whether a

pair of image and text match or not. However, the relevance degrees between

the samples marked as False and the query are different, and the binary label

cannot reﬂect this relevance. In semantic-based retrieval, continuous labels

are used to model the relevance degrees between queries and candidates. The

continuous pseudo labels in the ﬁgure are approximated by calculating text

similarity using Sentence-BERT [13]. This allows multiple candidates to be

considered relevant to a query and provides a way of ranking candidates from

most to least similar.

text gallery. On the other hand, text-to-image retrieval starts

with a query text, and the goal is to ﬁnd the most relevant

image from the image gallery. Compared with unimodal

image retrieval, image-text retrieval is more challenging due

to the heterogeneous gap between image and text. A dominant

approach to deal with the above challenge is to learn a shared

visual-semantic embedding space, where the distance between

the embedding vectors of related image and text is minimized.

According to the assumption of the relevance between query

and candidate, image-text retrieval can be mainly divided into

two categories, instance-based and semantic-based, as shown

in TABLE I. Most image-text retrieval work [1]–[6] focuses

on instance-based retrieval. As shown in Fig. 1, instance-based

retrieval adopts a binary label indicating whether a pair of

image and text match or not. Widely used image-text retrieval

datasets such as Flickr30K [14] and MS-COCO [15] provide

manually annotated binary labels. Such a binary indicator

covers only a limited subset of image-text semantic relations,

which is insufﬁcient to represent relevance degrees between

images and texts described by continuous labels such as image

captions. With binary label supervision, a query sentence is

relevant to only one image. However, in fact, there may

be multiple candidates related to the query that are directly

arbitrarily classiﬁed as irrelevant. It’s inconsistent with the

user experience of the retrieval system. In practice, we hope

that the retrieval system can return multiple results related to

the query and rank them by relevance degrees since humans’

judgment of relevance degree is not simply a binary relevance

and irrelevance. Loss functions designed based on the above

arXiv:2210.11319v1 [cs.CV] 20 Oct 2022

TABLE I

IMAGE-TEXT RETRIEVAL CLASSIFICATION.

Instance-based Semantic-based

Label Binary label Continuous pseudo label

Label source Human annotation Text similarity approxi-

mation

Advantages Binary labels are accu-

rate and do not introduce

training noise

Continuous pseudo labels

can represent continuous

correlations between im-

ages and texts

Disadvantages Binary labels cannot ade-

quately represent the cor-

relation between images

and texts

Continuous pseudo la-

bels are approximated,

not completely accurate,

and will introduce train-

ing noise

Optimization

objective

Triplet-HN [1] Ladder loss [16], [17],

SAM loss [18]

assumptions, such as the most widely used Triplet ranking

loss with Hard Negative mining (Triplet-HN) [1], cannot guide

the model to learn a coherent visual-semantic embedding

space. Moreover, Triplet-HN only optimizes hardest negative

samples, which will make the model training slow to converge

and make optimization difﬁcult.

In semantic-based retrieval, continuous labels are used to

indicate the relevance degrees between queries and candidates,

as Fig. 1 shows. It allows multiple candidates to be considered

relevant to a query and provides a way of ranking candidates

from most to least similar [19]. TABLE I compares the

advantages and disadvantages of the two types of retrieval

methods. Compared with instance-based retrieval, semantic-

based retrieval is more in line with the actual user experience.

Although semantic-based retrieval is more preferred, it is

difﬁcult to obtain continuous labels. The ideal ground truth for

the continuous label is human annotation, but it is infeasible

to annotate an image-text pairwise relevance degree dataset.

Recently, some studies [16], [18], [19] present that the seman-

tic similarities between texts can be used to approximate the

relevance degrees between images and texts as the continuous

pseudo labels. Wray et al. [19] propose several proxies to

estimate relevance degrees. Zhou et al. [16], [17] propose to

measure the relevance degrees by BERT [20] and design a

ladder loss to learn a coherent embedding space. Biten et al.

[18] use image captioning evaluation metrics to approximate

the relevance degree, and design a Semantic Adaptive Margin

(SAM) loss for semantic-based retrieval. Additionally, these

studies also propose to use normalized Discounted Cumulative

Gain (nDCG) [19], Coherent Score [16], [17] and Normalized

Cumulative Semantic Score [18] to evaluate the performance

of semantic-based retrieval methods. Compared with binary

labels, continuous pseudo labels can represent the continuous

correlation between images and texts. But since pseudo labels

are calculated approximately, they are not completely accurate,

which will introduce noise into training.

Existing semantic-based retrieval work has made great

progress, but there are still the following problems:

•Existing semantic-based retrieval methods achieves the

learning of continuous pseudo labels by hierarchically

embedding samples with different relevance degrees [16],

[17], or adjusting the margin of triplet loss according

to pseudo labels [18]. These methods require manual

selection of appropriate hyper-parameters and cannot be

ﬂexibly applied to different data and retrieval models.

•Continuous pseudo labels approximated by text similarity

are not completely accurate, which will introduce noise

into model training. Existing methods ignore the negative

effects of inaccurate pseudo labels.

•Existing evaluation metrics of semantic-based retrieval

can only reﬂect the ﬁt of the retrieval model to inac-

curate pseudo labels, which cannot objectively reﬂect the

retrieval performance.

Using binary or continuous labels alone has its own short-

comings. This paper proposes an image-text retrieval frame-

work with Binary and Continuous Label Supervision (BCLS)

to learn a coherent visual-semantic embedding space. For the

learning of binary labels, we improve the common Triplet

ranking loss with Soft Negative mining (Triplet-SN) to improve

convergence. For the learning of continuous labels, we design

Kendall ranking loss inspired by Kendall rank correlation

coefﬁcient (Kendall τ). In statistics, Kendall τis a statistic

used to measure the ordinal association between two measured

quantities. This loss function improves the correlation between

the similarity scores predicted by the retrieval model and

the continuous pseudo labels by optimizing the discordant

pairs in the ranking results, which can be ﬂexibly applied to

various data and retrieval models without complex parameter

settings. For the problem of pseudo labels introducing training

noise, we further design Sliding Window sampling and Hard

Sample mining strategy (SW-HS) to alleviate the impact of

noise and reduce the complexity of our framework to the

same order of magnitude as the common triplet ranking

loss. For the evaluation problem of semantic-based retrieval,

we conducted an objective and fair comparison of existing

semantic-based retrieval methods with the help of the Ex-

tended COCO Validation (ECCV) Caption dataset [21]. This

dataset leverages machine and human annotations to provide

semantic associations for more image-text pairs. The major

contributions of this paper are summarized as follows:

•A novel image-text retrieval framework with Binary and

Continuous Label Supervision (BCLS) is proposed to

guide retrieval models to learn a coherent visual-semantic

embedding space. The framework combines the advan-

tages of binary and continuous labels and makes targeted

improvements for the problems existing in the two types

of label learning.

•ASliding Window sampling and Hard Sample mining

strategy (SW-HS) is designed to mitigate the negative

effects of continuous pseudo label inaccuracy and reduce

the complexity of framework with BCLS to the same

order of magnitude as the common triplet ranking loss.

•To address the shortcomings of performance evaluation

for semantic-based retrieval, we conduct an objective

and fair comparison of existing semantic-based retrieval

methods with the help of the ECCV Caption dataset,

which provides semantic associations for more image-

text pairs. The experimental results demonstrate that our

method can better learn continuous semantic relations.

II. RELATED WORK

A. Instance-based Image-Text Retrieval

Image-text retrieval task, either image-to-text or text-to-

image, is formulated as retrieving relevant samples across the

different image and text modalities [1]–[12]. According to

the assumption of the relevance between query and candidate,

image-text retrieval can be mainly divided into two categories,

instance-based and semantic-based. Most image-text retrieval

studies [1]–[6] focus on instance-based retrieval. A variety

of methods have been devoted to learning modality invariant

features. More speciﬁcally, Wang et al. [6] propose a position

focused attention network to investigate the relation between

the visual and the textual views for image-text retrieval. In

recent years, multi-modal pre-training models [22]–[29] have

been intensively explored to bridge image and text. The

paradigm of vision-language pre-training is to design pre-

training tasks on large-scale vision-language data for pre-

training and then ﬁnetune the model on speciﬁc downstream

tasks. The above methods learn advanced encoding networks

to generate richer semantic representations for different modal-

ities. The framework with BCLS proposed in this paper is

independent of image and text feature representation and sim-

ilarity calculation. It can be plug-and-play applied to existing

instance-based retrieval models.

In addition to the work on the feature representation and

similarity calculation of images and text, a variety of deep

metric learning methods have been proposed in instance-based

image-text retrieval [1], [30]–[32]. A hinge-based triplet loss

is widely employed as an objective to enforce aligned pairs

to have a higher similarity score than misaligned pairs by a

margin [33]. Faghri et al. [1] incorporate hard negatives in

the ranking loss function, which yields signiﬁcant gains in

retrieval performance. There are several studies [30], [31], [34]

that propose weighting metric learning frameworks for image-

text retrieval, which can further improve retrieval performance.

These loss functions for instance-based image-text retrieval

adopt binary labels to indicate whether a pair of image and text

match or not, which is not sufﬁcient to represent the relevance

degree between image and text. Using a binary label based

loss function to train a model will destroy the coherence of

the visual-semantic embedding space, making it difﬁcult for

the model to learn continuous semantic relations.

B. Semantic-based Image-Text Retrieval

While most work focuses on instance-based retrieval, a few

studies have explored semantic-based retrieval. Some studies

propose that the semantic similarity between captions can

be used to approximate the relevance degree between image

and text [16], [19]. Wray et al. [19] propose several proxies

to estimate relevance degrees. Biten et al. [18] use image

captioning evaluation metrics, i.e., Consensus-based Image

Description Evaluation (CIDEr) [35] and Semantic Proposi-

tional Image Caption Evaluation (SPICE) [36], to approximate

the relevance degree, and design a semantic adaptive margin

(SAM) loss for semantic-based retrieval. SAM loss is a variant

of triplet loss, where candidates are pushed away from the

query by semantic adaptive margins in the embedding space.

The adjustment range of the margin in SAM loss will be

affected by the dataset, retrieval model, and pseudo label

calculation method. It needs to be carefully adjusted manually

and cannot be ﬂexibly applied to different data and models.

Zhou et al. [16], [17] propose to measure the relevance degrees

by BERT [20] and design a ladder loss to learn a coherent

embedding space. In the ladder loss, the relevance degrees

are artiﬁcially divided into several levels, and a large number

of hyper-parameters are introduced. Moreover, pseudo labels

approximated by text similarity are not completely accurate,

and existing methods ignore the negative effects of inaccurate

pseudo labels. The Kendall ranking loss proposed in this paper

will solve the problems existing in the current semantic-based

retrieval methods.

In addition to methodological problems, performance eval-

uation of semantic-based retrieval methods also has shortcom-

ings. Existing evaluation metrics of semantic-based retrieval

can only reﬂect the ﬁt of the retrieval model to inaccurate

pseudo labels, which cannot objectively reﬂect the retrieval

performance. This paper will make up for the existing short-

comings in the performance evaluation of semantic-based

retrieval.

C. Deep Metric Learning

The main work of this paper belongs to the ﬁeld of deep

metric learning. Deep metric learning aims to construct an

embedding space to reﬂect the semantic distances among

instances. It has many other applications such as face recog-

nition [37] and image retrieval [38]. Contrastive loss [39] and

triplet loss [40] are two representative pairwise approaches

in deep metric learning. Unlike the contrastive loss, which

aims to push misaligned pairs apart by a ﬁxed margin as

well as to pull aligned pairs as close as possible. Triplet

loss only aims to force the similarity of a positive pair to

be higher than that of a negative one by a margin and enjoys

more ﬂexibility. Unsatisﬁed with potential slow convergence

and unstable performance, recent work have proposed several

variants. N-pair loss [41] employed multiple negatives for each

positive sample. However, the above methods are all applied

to unimodal image retrieval, where relevance degrees of the

instances can be clearly deﬁned as a binary variable. These

loss functions can only guide the model to map images of

the same class to relatively close locations and images of

different classes to distant locations, and cannot be used to

learn continuous semantic relationships between images and

text.

Recently, some methods [42], [43] for directly optimizing

evaluation metrics such as average precision (AP) have been

proposed. Cakir et al. [42] propose FastAP to optimize AP

using a soft histogram binning technique. Brown et al. [43],

on the other hand, optimize a smoothed approximation of AP,

called Smooth-AP. Direct optimization of evaluation metrics

looks at more samples from the retrieval set and has been

proven to improve training efﬁciency and performance [43].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1Image-TextRetrievalwithBinaryandContinuousLabelSupervisionZhengLi,CailiGuo,SeniorMember,IEEE,ZerunFeng,Jenq-NengHwang,Fellow,IEEE,YingJin,YufengZhangAbstractMostimage-textretrievalworkadoptsbinarylabelsindicatingwhetherapairofimageandtextmatchesornot.Suchabinaryindicatorcoversonlyalimitedsubsetofi...

展开>> 收起<<

1 Image-Text Retrieval with Binary and Continuous Label Supervision.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Image-Text Retrieval with Binary and Continuous Label Supervision

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: