ing the distance between a target with the hardest
in-batch negative sample. In practice, however,
it is often difficult for the model to find a good
negative sample in the early stages of training (as
instances are randomly distributed in space), result-
ing in slow convergence (see Appendix A.2). To
improve optimization, we propose an adaptive op-
timization objective that selects multiple in-batch
negative samples based on model quality during
training. The intuition is that in the early stages of
training we want to sample more negative samples,
and in the later stages fewer negative samples.
Over two public datasets, MS-COCO (Lin et al.,
2014) and Flickr30K (Young et al.,2014), we
show that a standard VSE model using our pro-
posed feature aggregation and optimization strate-
gies outperforms benchmark models substantially.
In particular, our method obtains 1.4% relative
gains on RSUM for MS-COCO and 1.0% for
Flickr30K. Compared with the pre-trained vision-
language model with similar performance Geigle
et al. (2022), our method is 4.3×faster.
2 Related Work
Depending on whether the image and text features
have any form of cross-modal interaction before
similarity calculation, existing image-text retrieval
can be broadly categorized into two types.
The visual semantic embedding (VSE) (Faghri
et al.,2018;Wang et al.,2020;Chun et al.,2021)
methods process the multimodal instances inde-
pendently before projecting them into a joint em-
bedding space for similarity matching. Wang et al.
(2018) design a two-branch neural networks, LIWE
(Wehrmann et al.,2019) considers character-based
alignment and embedding methods for language
encoder and Faghri et al. (2018) extend it by us-
ing hard triplet loss for optimization. Following
these ideas, PVSE (Song and Soleymani,2019)
and CVSE (Wang et al.,2020) are proposed to
consider intra-modal polysemous and consensus
information. Recently, Chun et al. (2021) samples
instances as probabilistic distributions and achieves
further improvement. These VSE-based methods
are fast as they do not consider cross-modal inter-
action and as such the visual and text features can
be pre-computed. The non-VSE methods concen-
trate on the interaction of modalities. Specially,
late-interaction methods explore to fusion multi-
modal information by attention (Lee et al.,2018;
Chen et al.,2020), alignment (Zhang et al.,2020),
multi-view representation (Qu et al.,2020) and
fine-grained reasoning (Qu et al.,2021). The early-
interaction methods (Geigle et al.,2022), like pre-
trained vision-language models (Lu et al.,2019;
Chen et al.,2019;Li et al.,2020;Jia et al.,2021;
Li et al.,2022), focuses on the maximum of perfor-
mance while sacrifices efficiency.
Our paper focuses on the improvement of feature
aggregation and optimization for VSE. The existing
explorations of those two steps are as follows.
The performance of VSE ultimately depends on
the quality of the joint embedding space, which is
usually learned with simple transformations (e.g.
linear projection or multi-layer perceptron) and
pooling aggregators (e.g. mean pooling (Faghri
et al.,2018;Qu et al.,2020), max pooling (Zhang
and Lu,2018;Li et al.,2021), or a combination of
them (Lee et al.,2018)). Compared to these simple
aggregation methods, more complex aggregators
that introduce a large number of trainable param-
eters have also been explored, e.g. inter-modal at-
tention (Wehrmann et al.,2020) and self-attention
mechanisms (Han et al.,2021). Zhang et al. (2021)
design a cross-modal guided pooling module that
attends to local information dynamically. These
sophisticated aggregators typically require more
time, and don’t always outperform simple pooling
strategies. Perhaps the closest study to our work is
GPO (VSE
∞
) (Chen et al.,2021), which builds a
generalized operator to learn the best pooling strat-
egy that only considers the position information of
the extracted features.
Some studies focus on improving the optimiza-
tion objective, and the most widely adopted ob-
jective is the hinge-based hard triplet ranking loss
(Faghri et al.,2018;Wei et al.,2020b;Messina
et al.,2021), which dynamically selects the “hard-
est” negative sample within a mini-batch. Other
studies explore solutions that choose multiple neg-
ative samples. Zhou et al. (2020) introduce a co-
herence metric to rank the “irrelevant” candidates.
Extending the idea, Wei et al. (2020a) assign dif-
ferent weights for positive and negative pairs. To
tackle the issue of noisy labels which impacts mul-
timodal representation, Hu et al. (2021) propose
maximizing the mutual information between differ-
ent modalities. Huang et al. (2021) separate data
into “clean” and “noisy” partitions by co-teaching.
However, the above methods do not change adap-
tively according to the model performance when
selecting negative samples.