cross-modal information retrieval [
16
]. However, compared to traditional VSE networks, vision transformer-based
cross-modal retrieval networks require a large amount of data for training and the time they require for retrieving the
results of a query makes them unsuitable for real-world applications [
1
]. The hashing-based network is another active
solution for cross-modal information retrieval [
17
]. For example, Liu et al. [
18
] firstly proposed a hashing framework
for learning varying hash codes of different lengths for the comparison between images and descriptions, and the
learned modality-specific hash codes contain more semantics. Hashing-based networks are concerned with reducing
data storage costs and improving retrieval speed. Such networks are out of scope for this paper because the focus herein
is on VSE networks which mostly aim to explore the local information alignment between images and descriptions for
improved retrieval performance.
Loss Functions for Cross-modal Information Retrieval.
One of the earliest and most used cross-modal information
retrieval loss functions is the Sum of Hinges Loss (LSH) [
19
]. LSH is also known as a negatives loss function, and
it learns a fixed margin between the similarities of the relevant image–description embedding pairs and those of the
irrelevant embedding pairs. A more recent hard negatives loss function, the Max of Hinges Loss (LMH) [
3
], is adopted
in most recent VSE networks, due to its ability to outperform LSH [20, 21]. An improved version of LSH, LMH only
focuses on learning the hard negatives, which are the irrelevant image–description embedding pairs that are nearest
to the relevant pairs. Song et al. [
9
] presented a margin-adaptive triplet loss for the task of cross-modal information
retrieval that uses a hashing-based method which embeds the image and text into a low-dimensional Hamming space.
Liu et al. [
22
] applied a variant triplet loss function into their novel VSE network for cross-modal information retrieval,
where the input text embedding for the loss is replaced by the reconstructed image embedding of the network. Recently,
Wei et al. [
10
] proposed a polynomial [
23
] based Universal Weighting Metric Loss (LUWM) with flexible objective
margins, and that has been shown to outperform existing hard negatives loss functions.
A summary of the limitations of the existing loss functions are as follows. (1) The learning objectives of LSH [
19
] and
LMH [
3
] are not flexible because of their fixed margins. (2) The adaptive margin in [
9
] is not optimal, because it relies
on the computed similarities between irrelevant image–description embedding pairs by the training network which is
optimising. (3) The modified ranking loss of [
22
] cannot be integrated into other networks. (4) LUWM [
10
] does not
consider the optimal semantic information from irrelevant image–description pairs.
VSE Networks with a Negatives Loss Function.
LMH was proposed by Faghri et al. [
3
], and thereafter other VSE
networks adopted LMH. For improving the attention mechanism, Lee et al. [
24
] proposed an approach to align the
image region features with keywords of the relevant image–description pair; and Diao et al. [
4
] built an architecture
that deeply extends the attention mechanisms of image-image, text-text, and image-text tasks. For extracting high-level
semantics, Li et al. [
13
] utilised GCN [
14
] to explore the relations of image objects. For aggregating the image and
description embeddings, Chen et al. [
8
] proposed a special pooling operator. Wang et al. [
25
] proposed an end-to-end
VSE network without relying on a pre-trained CNN for image feature extraction tasks.
Methods for Finding the Underlying Meaning of Descriptions.
BERT [
26
] is a supervised and widely used deep
neural network for NLP tasks [
27
]. Singular Value Decomposition (SVD) [
28
] is an unsupervised matrix decomposition
method and an established approach in NLP and information retrieval [
29
]. BERT and SVD both have dimensionality
reduction capabilities that enable them to find the underlying semantic similarity between texts (e.g. sentences, image
captions, documents).
3 Proposed Semantically-Enhanced Hard Negatives Loss Function
Let
X={(Ii, Di)|i= 1 . . . n}
denote a training set containing paired images and descriptions, where each image
Ii
corresponds to its relevant description
Di
;
i
is the index and
n
is the size of set
X
. Let
X0={(vi, ui)|i= 1 . . . n}
be a
set of image–description embedding pairs output by a VSE network, where each
i
th relevant pair consists of an image
embedding
vi
and its relevant description embedding
ui
. Let
ˆvi={vj|j= 1 . . . n, j 6=i}
denote a set of all image
embeddings from
X0
irrelevant to
ui
, and
ˆui={uj|j= 1 . . . n, j 6=i}
denote a set of all description embeddings from
X0
that are irrelevant to
vi
. LSH, LMH, and the proposed approach and loss function that are used during the training
of VSE networks are computed as follows.
3.1 Related Methods and Notation
LSH Description. The basic negatives loss function, LSH, is shown in Eq. (1):
LSH(vi, ui) = X
ˆui
[α+s(vi,ˆui)−s(vi, ui)]++X
ˆvi
[α+s(ui,ˆvi)−s(vi, ui)]+(1)
3