hard-negative sampling [26, 7] has become the de-facto
training objective for many VSE approaches [17, 20, 41].
Few innovations have been made in designing the loss func-
tion for learning joint image-text embeddings since then.
On the other hand, designing deep metric learning
(DML) losses has been well-studied for image-to-image re-
trieval. Many loss functions have been proposed to im-
prove the training performance on image embedding tasks,
showing that triplet loss is not optimal for general metric
learning [37, 28, 33, 36, 29]. Early losses such as triplet
loss and contrastive loss [26, 27] are defined with the in-
tuition that positive pairs should be close while negative
pairs should be apart in the embedding space. However,
such defined loss functions may not lead to desirable gra-
dients which can explicitly impact the update of model pa-
rameters. Some attempts have been made in defining the
loss function to achieve desirable gradient updates [37, 29].
However, such approaches lack a systematic view and anal-
ysis of the combinations in gradients, and are only limited
to integrable gradients so that the resulting losses are differ-
entiable. Therefore, these loss functions may not be optimal
and applicable to the image-text retrieval task.
Instead of directly applying established loss functions
to VSE for image-text matching, in this paper we present
Gradient-based Objective AnaLysis framework, or GOAL,
a novel gradient-based analysis framework for the VSE
problem. We firstly propose a new gradient framework to
dissect the losses at the gradient level, and extract their key
gradient elements. Then, we explore a new training idea
to directly define the gradient to update the model in each
training step instead of defining the loss functions, as shown
in Figure 1. This new framework allows us to simply com-
bine the key gradient elements in DML losses to form a
family of new gradients and avoids the concern of integrat-
ing the gradient into a loss function. Finally, the new gra-
dients continue to improve existing VSE performance on
image-text retrieval tasks.
In brief, our contributions can be summarized as the fol-
lowing:
• We propose a general framework GOAL to compre-
hensively analyze the update of gradients of exist-
ing deep metric learning loss functions and apply this
framework to help find better objectives for the VSE
problem.
• We propose a new method to deal with image-text re-
trieval task directly by optimizing the model with a
family of gradient objectives instead of using a loss
function.
• We show consistent improvement over existing meth-
ods, achieving state-of-the-art results in image-text re-
trieval tasks on COCO datasets.
2. Related Work
Visual Semantic Embedding for Image-text match-
ing There is a rich line of literature focused on map-
ping visual and text modalities to a joint semantic embed-
ding space for image-text matching [8, 15, 7, 17, 35, 2].
VSE++ is proposed in [7] as a fundamental VSE schema
where visual and text embeddings are pretrained sepa-
rately then aggregated with AvgPool after being projected
to a shared space, which later are jointly optimized by a
triplet loss with hard-negative mining. Since then consis-
tent advances have been made to improve visual and text
feature extractors [11, 6, 12, 31, 5] and feature aggrega-
tors [14, 19, 32, 35]. In contrast to dominant use of spa-
tial grids of the feature map as visual features, bottom-up
attention [1] has been introduced to learn visual semantic
embeddings for image-text matching, which is commonly
realized by stacking the region representations from pre-
trained object detectors [17, 41]. [2] proposed Gener-
alized Pooling Operators (GPO) to learn the best pooling
strategy which outperforms approaches with complex fea-
ture aggregators. Inspired by the success of large-scale
pretraining in language models [5, 21], there is a recent
trend of performing task-agnostic vision-language pretrain-
ing (VLP) on massive image-text pairs for generic repre-
sentations, then fine-tune on task-specific data and losses to
achieve state-of-the-art results in downstream tasks includ-
ing image-text retrieval [23, 30, 4, 20, 40]. However, as op-
posed to our proposed method, prevalent approaches choose
to optimize the triplet loss as the de-facto objective for the
image-text matching task. In this paper, we will strive to
revisit the problem of finding better training objectives for
visual semantic embeddings.
Deep Metric Learning is useful in extreme classifica-
tion settings such as fine-grained recognition [28, 22, 34,
16, 26]. The goal is to train networks to map semantically
related images to nearby locations and unrelated images to
distant locations in an embedding space. There are many
loss functions that have been proposed to solve the deep
metric learning problem. Triplet loss function [13, 26] and
its variants such as circle loss [29] form a triplet that con-
tains anchor, positive and negative instances, where the an-
chor and positive instance share the same label, and anchor
and negative instance share different labels. Pair-wise loss
functions such as contrastive loss [10], binomial deviance
loss [37], lifted structure loss [28] and multi-similarity
loss [33] penalize when the distance is large between a pair
of instances with the same labels and when the distance is
small between a pair of instances with different labels. All
these loss functions encourage the distance of positive im-
ages pairs to be smaller than the distance of the negative
images pairs. Due to the fact that the training goal of DML
is similar to VSE problem, in this paper, we borrow these
loss design ideas of DML to improve the VSE problem.