Dissecting Deep Metric Learning Losses for Image-Text Retrieval Hong Xuan Xi Stephen Chen Microsoft

2025-05-03 0 0 1.15MB 10 页 10玖币
侵权投诉
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
Hong Xuan, Xi (Stephen) Chen
Microsoft
{Hong.Xuan|Chen.Stephen}@microsoft.com
Abstract
Visual-Semantic Embedding (VSE) is a prevalent ap-
proach in image-text retrieval by learning a joint embedding
space between the image and language modalities where se-
mantic similarities would be preserved. The triplet loss with
hard-negative mining has become the de-facto objective for
most VSE methods. Inspired by recent progress in deep met-
ric learning (DML) in the image domain which gives rise to
new loss functions that outperform triplet loss, in this paper
we revisit the problem of finding better objectives for VSE
in image-text matching. Despite some attempts in design-
ing losses based on gradient movement, most DML losses
are defined empirically in the embedding space. Instead
of directly applying these loss functions which may lead to
sub-optimal gradient updates in model parameters, in this
paper we present a novel Gradient-based Objective AnaLy-
sis framework, or GOAL, to systematically analyze the com-
binations and reweighting of the gradients in existing DML
functions. With the help of this analysis framework, we fur-
ther propose a new family of objectives in the gradient space
exploring different gradient combinations. In the event that
the gradients are not integrable to a valid loss function, we
implement our proposed objectives such that they would di-
rectly operate in the gradient space instead of on the losses
in the embedding space. Comprehensive experiments have
demonstrated that our novel objectives have consistently
improved performance over baselines across different vi-
sual/text features and model frameworks. We also showed
the generalizability of the GOAL framework by extending it
to other models using triplet family losses including vision-
language model with heavy cross-modal interactions and
have achieved state-of-the-art results on the image-text re-
trieval tasks on COCO and Flick30K. Code is available at:
https://github.com/littleredxh/VSE-Gradient.git
1. Introduction
Recognizing and describing the visual world with lan-
guage is a basic human ability but still remains challeng-
ing for artificial intelligence. With recent advances in Deep
Figure 1. To realize a desired visual semantic embedding space, a
common method is to design a loss function which can be calcu-
lated on deep learning platforms such as PyTorch or TensorFlow.
The auto-grad mechanism on these platforms automatically cal-
culates the gradients to update the model parameters to form a
desired embedding space. In practice, the goal of visual semantic
embedding is about optimizing the clustering or separation of fea-
ture points extracted from image and text, and the loss function is
a somewhat indirect approach to reach that goal, while the gradi-
ent more directly affects the update of the embedding space. We
propose a method to directly design the gradient to train models.
Neural Networks, tremendous progress has been made in
bridging the vision-language modalities. Visual-semantic
embedding (VSE) [8, 15, 7] is one of the major topics to
build a connection between images and natural language. It
aims to map images and their descriptive text information
into a joint space, such that a relevant pair of image and
text should be mapped close to each other while an irrele-
vant pair of image and text should be mapped far from each
other. In this paper, we focus on visual-semantic embed-
ding for the task of image-text matching and retrieval, but
our approach is generalizable to other image-text retrieval
models using the triplet loss family [17, 4, 20, 40].
A VSE model usually consists of feature extractors for
image and text, a feature aggregator [2], and an objective
function during training. Despite significant advances of
VSE in feature extractors [31, 6, 1] and feature aggrega-
tors [32, 2], there is less attention on the loss function for
training the model. A hinge-based triplet ranking loss with
arXiv:2210.13188v1 [cs.CV] 21 Oct 2022
hard-negative sampling [26, 7] has become the de-facto
training objective for many VSE approaches [17, 20, 41].
Few innovations have been made in designing the loss func-
tion for learning joint image-text embeddings since then.
On the other hand, designing deep metric learning
(DML) losses has been well-studied for image-to-image re-
trieval. Many loss functions have been proposed to im-
prove the training performance on image embedding tasks,
showing that triplet loss is not optimal for general metric
learning [37, 28, 33, 36, 29]. Early losses such as triplet
loss and contrastive loss [26, 27] are defined with the in-
tuition that positive pairs should be close while negative
pairs should be apart in the embedding space. However,
such defined loss functions may not lead to desirable gra-
dients which can explicitly impact the update of model pa-
rameters. Some attempts have been made in defining the
loss function to achieve desirable gradient updates [37, 29].
However, such approaches lack a systematic view and anal-
ysis of the combinations in gradients, and are only limited
to integrable gradients so that the resulting losses are differ-
entiable. Therefore, these loss functions may not be optimal
and applicable to the image-text retrieval task.
Instead of directly applying established loss functions
to VSE for image-text matching, in this paper we present
Gradient-based Objective AnaLysis framework, or GOAL,
a novel gradient-based analysis framework for the VSE
problem. We firstly propose a new gradient framework to
dissect the losses at the gradient level, and extract their key
gradient elements. Then, we explore a new training idea
to directly define the gradient to update the model in each
training step instead of defining the loss functions, as shown
in Figure 1. This new framework allows us to simply com-
bine the key gradient elements in DML losses to form a
family of new gradients and avoids the concern of integrat-
ing the gradient into a loss function. Finally, the new gra-
dients continue to improve existing VSE performance on
image-text retrieval tasks.
In brief, our contributions can be summarized as the fol-
lowing:
We propose a general framework GOAL to compre-
hensively analyze the update of gradients of exist-
ing deep metric learning loss functions and apply this
framework to help find better objectives for the VSE
problem.
We propose a new method to deal with image-text re-
trieval task directly by optimizing the model with a
family of gradient objectives instead of using a loss
function.
We show consistent improvement over existing meth-
ods, achieving state-of-the-art results in image-text re-
trieval tasks on COCO datasets.
2. Related Work
Visual Semantic Embedding for Image-text match-
ing There is a rich line of literature focused on map-
ping visual and text modalities to a joint semantic embed-
ding space for image-text matching [8, 15, 7, 17, 35, 2].
VSE++ is proposed in [7] as a fundamental VSE schema
where visual and text embeddings are pretrained sepa-
rately then aggregated with AvgPool after being projected
to a shared space, which later are jointly optimized by a
triplet loss with hard-negative mining. Since then consis-
tent advances have been made to improve visual and text
feature extractors [11, 6, 12, 31, 5] and feature aggrega-
tors [14, 19, 32, 35]. In contrast to dominant use of spa-
tial grids of the feature map as visual features, bottom-up
attention [1] has been introduced to learn visual semantic
embeddings for image-text matching, which is commonly
realized by stacking the region representations from pre-
trained object detectors [17, 41]. [2] proposed Gener-
alized Pooling Operators (GPO) to learn the best pooling
strategy which outperforms approaches with complex fea-
ture aggregators. Inspired by the success of large-scale
pretraining in language models [5, 21], there is a recent
trend of performing task-agnostic vision-language pretrain-
ing (VLP) on massive image-text pairs for generic repre-
sentations, then fine-tune on task-specific data and losses to
achieve state-of-the-art results in downstream tasks includ-
ing image-text retrieval [23, 30, 4, 20, 40]. However, as op-
posed to our proposed method, prevalent approaches choose
to optimize the triplet loss as the de-facto objective for the
image-text matching task. In this paper, we will strive to
revisit the problem of finding better training objectives for
visual semantic embeddings.
Deep Metric Learning is useful in extreme classifica-
tion settings such as fine-grained recognition [28, 22, 34,
16, 26]. The goal is to train networks to map semantically
related images to nearby locations and unrelated images to
distant locations in an embedding space. There are many
loss functions that have been proposed to solve the deep
metric learning problem. Triplet loss function [13, 26] and
its variants such as circle loss [29] form a triplet that con-
tains anchor, positive and negative instances, where the an-
chor and positive instance share the same label, and anchor
and negative instance share different labels. Pair-wise loss
functions such as contrastive loss [10], binomial deviance
loss [37], lifted structure loss [28] and multi-similarity
loss [33] penalize when the distance is large between a pair
of instances with the same labels and when the distance is
small between a pair of instances with different labels. All
these loss functions encourage the distance of positive im-
ages pairs to be smaller than the distance of the negative
images pairs. Due to the fact that the training goal of DML
is similar to VSE problem, in this paper, we borrow these
loss design ideas of DML to improve the VSE problem.
摘要:

DissectingDeepMetricLearningLossesforImage-TextRetrievalHongXuan,Xi(Stephen)ChenMicrosoft{Hong.Xuan|Chen.Stephen}@microsoft.comAbstractVisual-SemanticEmbedding(VSE)isaprevalentap-proachinimage-textretrievalbylearningajointembeddingspacebetweentheimageandlanguagemodalitieswherese-manticsimilaritieswo...

展开>> 收起<<
Dissecting Deep Metric Learning Losses for Image-Text Retrieval Hong Xuan Xi Stephen Chen Microsoft.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.15MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注