Dissecting Deep Metric Learning Losses for Image-Text Retrieval Hong Xuan Xi Stephen Chen Microsoft

2025-05-03 0 0 1.15MB 10 页 10玖币

侵权投诉

Dissecting Deep Metric Learning Losses for Image-Text Retrieval

Hong Xuan, Xi (Stephen) Chen

Microsoft

{Hong.Xuan|Chen.Stephen}@microsoft.com

Abstract

Visual-Semantic Embedding (VSE) is a prevalent ap-

proach in image-text retrieval by learning a joint embedding

space between the image and language modalities where se-

mantic similarities would be preserved. The triplet loss with

hard-negative mining has become the de-facto objective for

most VSE methods. Inspired by recent progress in deep met-

ric learning (DML) in the image domain which gives rise to

new loss functions that outperform triplet loss, in this paper

we revisit the problem of ﬁnding better objectives for VSE

in image-text matching. Despite some attempts in design-

ing losses based on gradient movement, most DML losses

are deﬁned empirically in the embedding space. Instead

of directly applying these loss functions which may lead to

sub-optimal gradient updates in model parameters, in this

paper we present a novel Gradient-based Objective AnaLy-

sis framework, or GOAL, to systematically analyze the com-

binations and reweighting of the gradients in existing DML

functions. With the help of this analysis framework, we fur-

ther propose a new family of objectives in the gradient space

exploring different gradient combinations. In the event that

the gradients are not integrable to a valid loss function, we

implement our proposed objectives such that they would di-

rectly operate in the gradient space instead of on the losses

in the embedding space. Comprehensive experiments have

demonstrated that our novel objectives have consistently

improved performance over baselines across different vi-

sual/text features and model frameworks. We also showed

the generalizability of the GOAL framework by extending it

to other models using triplet family losses including vision-

language model with heavy cross-modal interactions and

have achieved state-of-the-art results on the image-text re-

trieval tasks on COCO and Flick30K. Code is available at:

https://github.com/littleredxh/VSE-Gradient.git

1. Introduction

Recognizing and describing the visual world with lan-

guage is a basic human ability but still remains challeng-

ing for artiﬁcial intelligence. With recent advances in Deep

Figure 1. To realize a desired visual semantic embedding space, a

common method is to design a loss function which can be calcu-

lated on deep learning platforms such as PyTorch or TensorFlow.

The auto-grad mechanism on these platforms automatically cal-

culates the gradients to update the model parameters to form a

desired embedding space. In practice, the goal of visual semantic

embedding is about optimizing the clustering or separation of fea-

ture points extracted from image and text, and the loss function is

a somewhat indirect approach to reach that goal, while the gradi-

ent more directly affects the update of the embedding space. We

propose a method to directly design the gradient to train models.

Neural Networks, tremendous progress has been made in

bridging the vision-language modalities. Visual-semantic

embedding (VSE) [8, 15, 7] is one of the major topics to

build a connection between images and natural language. It

aims to map images and their descriptive text information

into a joint space, such that a relevant pair of image and

text should be mapped close to each other while an irrele-

vant pair of image and text should be mapped far from each

other. In this paper, we focus on visual-semantic embed-

ding for the task of image-text matching and retrieval, but

our approach is generalizable to other image-text retrieval

models using the triplet loss family [17, 4, 20, 40].

A VSE model usually consists of feature extractors for

image and text, a feature aggregator [2], and an objective

function during training. Despite signiﬁcant advances of

VSE in feature extractors [31, 6, 1] and feature aggrega-

tors [32, 2], there is less attention on the loss function for

training the model. A hinge-based triplet ranking loss with

arXiv:2210.13188v1 [cs.CV] 21 Oct 2022

hard-negative sampling [26, 7] has become the de-facto

training objective for many VSE approaches [17, 20, 41].

Few innovations have been made in designing the loss func-

tion for learning joint image-text embeddings since then.

On the other hand, designing deep metric learning

(DML) losses has been well-studied for image-to-image re-

trieval. Many loss functions have been proposed to im-

prove the training performance on image embedding tasks,

showing that triplet loss is not optimal for general metric

learning [37, 28, 33, 36, 29]. Early losses such as triplet

loss and contrastive loss [26, 27] are deﬁned with the in-

tuition that positive pairs should be close while negative

pairs should be apart in the embedding space. However,

such deﬁned loss functions may not lead to desirable gra-

dients which can explicitly impact the update of model pa-

rameters. Some attempts have been made in deﬁning the

loss function to achieve desirable gradient updates [37, 29].

However, such approaches lack a systematic view and anal-

ysis of the combinations in gradients, and are only limited

to integrable gradients so that the resulting losses are differ-

entiable. Therefore, these loss functions may not be optimal

and applicable to the image-text retrieval task.

Instead of directly applying established loss functions

to VSE for image-text matching, in this paper we present

Gradient-based Objective AnaLysis framework, or GOAL,

a novel gradient-based analysis framework for the VSE

problem. We ﬁrstly propose a new gradient framework to

dissect the losses at the gradient level, and extract their key

gradient elements. Then, we explore a new training idea

to directly deﬁne the gradient to update the model in each

training step instead of deﬁning the loss functions, as shown

in Figure 1. This new framework allows us to simply com-

bine the key gradient elements in DML losses to form a

family of new gradients and avoids the concern of integrat-

ing the gradient into a loss function. Finally, the new gra-

dients continue to improve existing VSE performance on

image-text retrieval tasks.

In brief, our contributions can be summarized as the fol-

lowing:

• We propose a general framework GOAL to compre-

hensively analyze the update of gradients of exist-

ing deep metric learning loss functions and apply this

framework to help ﬁnd better objectives for the VSE

problem.

• We propose a new method to deal with image-text re-

trieval task directly by optimizing the model with a

family of gradient objectives instead of using a loss

function.

• We show consistent improvement over existing meth-

ods, achieving state-of-the-art results in image-text re-

trieval tasks on COCO datasets.

2. Related Work

Visual Semantic Embedding for Image-text match-

ing There is a rich line of literature focused on map-

ping visual and text modalities to a joint semantic embed-

ding space for image-text matching [8, 15, 7, 17, 35, 2].

VSE++ is proposed in [7] as a fundamental VSE schema

where visual and text embeddings are pretrained sepa-

rately then aggregated with AvgPool after being projected

to a shared space, which later are jointly optimized by a

triplet loss with hard-negative mining. Since then consis-

tent advances have been made to improve visual and text

feature extractors [11, 6, 12, 31, 5] and feature aggrega-

tors [14, 19, 32, 35]. In contrast to dominant use of spa-

tial grids of the feature map as visual features, bottom-up

attention [1] has been introduced to learn visual semantic

embeddings for image-text matching, which is commonly

realized by stacking the region representations from pre-

trained object detectors [17, 41]. [2] proposed Gener-

alized Pooling Operators (GPO) to learn the best pooling

strategy which outperforms approaches with complex fea-

ture aggregators. Inspired by the success of large-scale

pretraining in language models [5, 21], there is a recent

trend of performing task-agnostic vision-language pretrain-

ing (VLP) on massive image-text pairs for generic repre-

sentations, then ﬁne-tune on task-speciﬁc data and losses to

achieve state-of-the-art results in downstream tasks includ-

ing image-text retrieval [23, 30, 4, 20, 40]. However, as op-

posed to our proposed method, prevalent approaches choose

to optimize the triplet loss as the de-facto objective for the

image-text matching task. In this paper, we will strive to

revisit the problem of ﬁnding better training objectives for

visual semantic embeddings.

Deep Metric Learning is useful in extreme classiﬁca-

tion settings such as ﬁne-grained recognition [28, 22, 34,

16, 26]. The goal is to train networks to map semantically

related images to nearby locations and unrelated images to

distant locations in an embedding space. There are many

loss functions that have been proposed to solve the deep

metric learning problem. Triplet loss function [13, 26] and

its variants such as circle loss [29] form a triplet that con-

tains anchor, positive and negative instances, where the an-

chor and positive instance share the same label, and anchor

and negative instance share different labels. Pair-wise loss

functions such as contrastive loss [10], binomial deviance

loss [37], lifted structure loss [28] and multi-similarity

loss [33] penalize when the distance is large between a pair

of instances with the same labels and when the distance is

small between a pair of instances with different labels. All

these loss functions encourage the distance of positive im-

ages pairs to be smaller than the distance of the negative

images pairs. Due to the fact that the training goal of DML

is similar to VSE problem, in this paper, we borrow these

loss design ideas of DML to improve the VSE problem.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DissectingDeepMetricLearningLossesforImage-TextRetrievalHongXuan,Xi(Stephen)ChenMicrosoft{Hong.Xuan|Chen.Stephen}@microsoft.comAbstractVisual-SemanticEmbedding(VSE)isaprevalentap-proachinimage-textretrievalbylearningajointembeddingspacebetweentheimageandlanguagemodalitieswherese-manticsimilaritieswo...

展开>> 收起<<

Dissecting Deep Metric Learning Losses for Image-Text Retrieval Hong Xuan Xi Stephen Chen Microsoft.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dissecting Deep Metric Learning Losses for Image-Text Retrieval Hong Xuan Xi Stephen Chen Microsoft

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: