IMPROVING VISUAL -SEMANTIC EMBEDDINGS BY LEARNING SEMANTICALLY -ENHANCED HARD NEGATIVES FOR CROSS -MODAL INFORMATION RETRIEVAL

2025-05-08 0 0 2.21MB 18 页 10玖币
侵权投诉
IMPROVING VISUAL-SEMANTIC EMBEDDINGS BY LEARNING
SEMANTICALLY-ENHANCED HARD NEGATIVES FOR
CROSS-MODAL INFORMATION RETRIEVAL
Yan Gong, Georgina Cosma
Department of Computer Science
Loughborough University
{y.gong2, g.cosma}@lboro.ac.uk
The paper has been accepted by Elsevier Pattern Recognition.
ABSTRACT
Visual Semantic Embedding (VSE) networks aim to extract the semantics of images and their
descriptions and embed them into the same latent space for cross-modal information retrieval. Most
existing VSE networks are trained by adopting a hard negatives loss function which learns an
objective margin between the similarity of relevant and irrelevant image–description embedding pairs.
However, the objective margin in the hard negatives loss function is set as a fixed hyperparameter that
ignores the semantic differences of the irrelevant image–description pairs. To address the challenge
of measuring the optimal similarities between image–description pairs before obtaining the trained
VSE networks, this paper presents a novel approach that comprises two main parts: (1) finds the
underlying semantics of image descriptions; and (2) proposes a novel semantically-enhanced hard
negatives loss function, where the learning objective is dynamically determined based on the optimal
similarity scores between irrelevant image–description pairs. Extensive experiments were carried
out by integrating the proposed methods into five state-of-the-art VSE networks that were applied to
three benchmark datasets for cross-modal information retrieval tasks. The results revealed that the
proposed methods achieved the best performance and can also be adopted by existing and future VSE
networks.
Keywords Visual semantic embedding network ·Cross-modal ·Information retrieval ·Hard negatives
1 Introduction
In information retrieval, Visual Semantic Embedding (VSE) networks aim to create joint representations of images and
textual descriptions and map these in a joint embedding space (i.e. same latent space) to enable various information
retrieval-related tasks, such as image–text retrieval, image captioning, and visual question answering [
1
]. Within the
shared embedding space, the aim is to position the relevant image–description pairs far away from the irrelevant pairs [
2
].
Currently, VSE literature can be summarised into: (1) approaches that extend the cross-modal encoder-decoder network
for improving the learning of latent representations crossing images and descriptions [
3
]; (2) specifically designed
attention architectures that improve the performance of networks [
4
]; (3) networks that are modified based on generative
adversarial methods for learning the common representation of images and descriptions [
5
]. The above-mentioned
studies aim to improve the VSE networks for information retrieval and have been evaluated using the benchmark
MS-COCO [
6
] and Flickr30K [
7
] datasets. Few studies focus on exploring the learning potential of VSE networks. The
hard negatives loss function [
3
] defines the learning objective of VSE networks, and it is commonly adopted by the
current VSE architectures [8].
Furthermore, the hard negatives loss function learns a fixed margin that is the optimal difference between the similarity
of the relevant image–description embedding pair and that of the irrelevant embedding pair. However, the fixed margin
ignores the semantic differences between the irrelevant image–description pairs. The hard negatives loss function does
not consider the distance of the irrelevant items to the query and sets the same learning objective (i.e. fixed margin)
arXiv:2210.04754v4 [cs.CV] 14 Feb 2023
Figure 1: Sample of irrelevant image–description pairs. Description D1 is the one semantically closer to the image.
for both pairs, image–D1 and image–D2 (sample from Fig. 1), even though the semantic differences of the irrelevant
training pairs are useful for training an information retrieval model [
9
]. To illustrate this point, consider Fig. 1, where in
the irrelevant image–D1 pair the image and description are semantically closer than those of the irrelevant image–D2
pair, but the hard negatives loss function sets the same learning objective for both pairs, i.e. image–D1 and image–D2,
and this is not suitable. To solve the limitations of the fixed margin, Wei et al. [
10
] introduced a polynomial loss
function with an adaptive objective margin, but their method does not consider the optimal semantic information from
irrelevant image–description pairs.
Our paper aims to semantically enhance the hard negatives loss function for exploring the learning potential of VSE
networks. This paper (1) proposes a new loss function for improving the learning efficiency and the cross-modal
information retrieval performance of VSE networks; (2) embeds the proposed loss function within state-of-the-art VSE
networks, and (3) evaluates its efficiency using benchmark datasets suitable for the task of cross-modal information
retrieval. The contributions of our paper are as follows.
A novel approach that infers the semantics of image descriptions by finding the underlying meaning of
descriptions using eigendecomposition and dimensionality reduction (i.e. Singular Value Decomposition).
The derived descriptions are then utilised by a proposed semantically-enhanced hard negatives loss function,
entitled LSEH, when computing the optimal similarities between irrelevant image–description pairs.
A semantically-enhanced hard negatives loss function that redefines the learning objective for VSE networks.
The proposed loss function dynamically adjusts the learning objective according to the semantic similarities
between irrelevant image–description pairs. Ambiguous training pairs with larger optimal similarity scores
obtain larger gradients that are utilised by the proposed loss function to improve training efficiency.
The proposed approach and loss function can be integrated into other VSE networks that improve learning
efficiency and cross-modal information retrieval. Extensive experiments were carried out by integrating the
proposed methods into five state-of-the-art VSE networks that were applied to the Flickr30K, MS-COCO, and
IAPR TC12 datasets, and the results showed that the proposed methods achieved the best performance.
2 Related Work
VSE Networks.
VSE networks aim to align embeddings of relevant images and descriptions in the same latent space for
cross-modal information retrieval [
1
]. Faghri et al. [
3
] proposed an Improved Visual Semantic Embedding architecture
(VSE++). Image region features extracted by the faster R-CNN [
11
] and their descriptions were embedded into
the same latent space by using a fully connected neural network and a Gated Recurrent Units (GRU) network [
12
].
Most state-of-the-art VSE networks improve upon VSE++. Li et al. [
13
] introduced a Visual Semantic Reasoning
Network (VSRN) to enhance image features with image region relationships extracted by a Graph Convolution Network
(GCN) [
14
]; Liu et al. [
15
] applied a Graph Structured Matching Network (GSMN) to build a graph of image features
and words and learn the fine-grained correspondence between image features and words; Diao et al. [
4
] proposed
the Similarity Graph Reasoning and Attention Filtration network (SGRAF) that extends the attention mechanisms of
image and description sets. SGRAF also provides two individual sub-networks to process the attention results between
the image features and the description – where a Similarity Graph Reasoning network (SGR) builds a graph of the
attention results for reasoning, and a Similarity Attention Filtration network (SAF) filters the important information
from the attention results. Chen et al. [
8
] proposed a variation of the VSE network, VSE
, that benefits from a
generalized pooling operator which discovers the best strategy for pooling image and description embeddings. Recently,
vision transformer-based networks, that are not relying on the hard negatives loss function, have become popular for
2
cross-modal information retrieval [
16
]. However, compared to traditional VSE networks, vision transformer-based
cross-modal retrieval networks require a large amount of data for training and the time they require for retrieving the
results of a query makes them unsuitable for real-world applications [
1
]. The hashing-based network is another active
solution for cross-modal information retrieval [
17
]. For example, Liu et al. [
18
] firstly proposed a hashing framework
for learning varying hash codes of different lengths for the comparison between images and descriptions, and the
learned modality-specific hash codes contain more semantics. Hashing-based networks are concerned with reducing
data storage costs and improving retrieval speed. Such networks are out of scope for this paper because the focus herein
is on VSE networks which mostly aim to explore the local information alignment between images and descriptions for
improved retrieval performance.
Loss Functions for Cross-modal Information Retrieval.
One of the earliest and most used cross-modal information
retrieval loss functions is the Sum of Hinges Loss (LSH) [
19
]. LSH is also known as a negatives loss function, and
it learns a fixed margin between the similarities of the relevant image–description embedding pairs and those of the
irrelevant embedding pairs. A more recent hard negatives loss function, the Max of Hinges Loss (LMH) [
3
], is adopted
in most recent VSE networks, due to its ability to outperform LSH [20, 21]. An improved version of LSH, LMH only
focuses on learning the hard negatives, which are the irrelevant image–description embedding pairs that are nearest
to the relevant pairs. Song et al. [
9
] presented a margin-adaptive triplet loss for the task of cross-modal information
retrieval that uses a hashing-based method which embeds the image and text into a low-dimensional Hamming space.
Liu et al. [
22
] applied a variant triplet loss function into their novel VSE network for cross-modal information retrieval,
where the input text embedding for the loss is replaced by the reconstructed image embedding of the network. Recently,
Wei et al. [
10
] proposed a polynomial [
23
] based Universal Weighting Metric Loss (LUWM) with flexible objective
margins, and that has been shown to outperform existing hard negatives loss functions.
A summary of the limitations of the existing loss functions are as follows. (1) The learning objectives of LSH [
19
] and
LMH [
3
] are not flexible because of their fixed margins. (2) The adaptive margin in [
9
] is not optimal, because it relies
on the computed similarities between irrelevant image–description embedding pairs by the training network which is
optimising. (3) The modified ranking loss of [
22
] cannot be integrated into other networks. (4) LUWM [
10
] does not
consider the optimal semantic information from irrelevant image–description pairs.
VSE Networks with a Negatives Loss Function.
LMH was proposed by Faghri et al. [
3
], and thereafter other VSE
networks adopted LMH. For improving the attention mechanism, Lee et al. [
24
] proposed an approach to align the
image region features with keywords of the relevant image–description pair; and Diao et al. [
4
] built an architecture
that deeply extends the attention mechanisms of image-image, text-text, and image-text tasks. For extracting high-level
semantics, Li et al. [
13
] utilised GCN [
14
] to explore the relations of image objects. For aggregating the image and
description embeddings, Chen et al. [
8
] proposed a special pooling operator. Wang et al. [
25
] proposed an end-to-end
VSE network without relying on a pre-trained CNN for image feature extraction tasks.
Methods for Finding the Underlying Meaning of Descriptions.
BERT [
26
] is a supervised and widely used deep
neural network for NLP tasks [
27
]. Singular Value Decomposition (SVD) [
28
] is an unsupervised matrix decomposition
method and an established approach in NLP and information retrieval [
29
]. BERT and SVD both have dimensionality
reduction capabilities that enable them to find the underlying semantic similarity between texts (e.g. sentences, image
captions, documents).
3 Proposed Semantically-Enhanced Hard Negatives Loss Function
Let
X={(Ii, Di)|i= 1 . . . n}
denote a training set containing paired images and descriptions, where each image
Ii
corresponds to its relevant description
Di
;
i
is the index and
n
is the size of set
X
. Let
X0={(vi, ui)|i= 1 . . . n}
be a
set of image–description embedding pairs output by a VSE network, where each
i
th relevant pair consists of an image
embedding
vi
and its relevant description embedding
ui
. Let
ˆvi={vj|j= 1 . . . n, j 6=i}
denote a set of all image
embeddings from
X0
irrelevant to
ui
, and
ˆui={uj|j= 1 . . . n, j 6=i}
denote a set of all description embeddings from
X0
that are irrelevant to
vi
. LSH, LMH, and the proposed approach and loss function that are used during the training
of VSE networks are computed as follows.
3.1 Related Methods and Notation
LSH Description. The basic negatives loss function, LSH, is shown in Eq. (1):
LSH(vi, ui) = X
ˆui
[α+s(vi,ˆui)s(vi, ui)]++X
ˆvi
[α+s(ui,ˆvi)s(vi, ui)]+(1)
3
where
[x]+max(x, 0)
, and
α
serves as a margin parameter. Let
s(vi, ui)
be the similarity score between the relevant
image embedding
vi
and description embedding
ui
; let
s(vi,ˆui)
be the set of similarity scores of the image embedding
vi
with its all irrelevant description embeddings
ˆui
; and let
s(ui,ˆvi)
be the set of similarity scores of the description
embedding
ui
with its all irrelevant image embeddings
ˆvi
. Given a relevant pair of image–description embeddings
(vi, ui), the result of the function takes the sum from irrelevant pairs s(vi,ˆui)and s(ui,ˆvi)respectively.
LMH Description.
The hard negatives loss function, LMH, is an improved version of LSH, that only focuses on the
hard negatives [3].
LMH(vi, ui) = max
ˆui
[α+s(vi,ˆui)s(vi, ui)]++max
ˆvi
[α+s(ui,ˆvi)s(vi, ui)]+(2)
As shown in Eq. (2), given a relevant image–description pair
(vi, ui)
, the result of the function only takes the max value
of the irrelevant pairs s(vi,ˆui)and s(ui,ˆvi)respectively.
3.2 Proposed LSEH Loss Function
The proposed Semantically-Enhanced Hard negatives Loss function (LSEH) is an improved version of LMH and it is
defined in Eq. (3):
LSEH(vi, ui) = max
ˆui
[α+ (s(vi,ˆui) + f(vi,ˆui)) s(vi, ui)]+
+max
ˆvi
[α+ (s(ui,ˆvi) + f(ui,ˆvi)) s(vi, ui)]+
(3)
LSEH introduces two sets of semantic factors
f(vi,ˆui)
and
f(ui,ˆvi)
for the image–description embedding pairs
(vi,ˆui)
and the description–image embedding pairs
(ui,ˆvi)
respectively, and
f(vi,ˆui)
and
f(ui,ˆvi)
can be obtained via Eq.
(4):
f(vi,ˆui) = λ×S(vi,ˆui), f(ui,ˆvi) = λ×S(ui,ˆvi)(4)
where
λ
serves as a temperature hyperparameter, and let
S(vi,ˆui)
denote the optimal semantic similarity scores of the
irrelevant image–description embedding pairs
(vi,ˆui)
, and
S(ui,ˆvi)
denote the optimal semantic similarity scores of
the irrelevant description–image embedding pairs
(ui,ˆvi)
. Therefore, the question is to compute the semantic factors
f(vi,ˆui)and f(ui,ˆvi).
The semantic factors are computed by finding the underlying meaning of descriptions using SVD. After pre-processing
[
30
], the descriptions set
{Di|i= 1 . . . n}
is converted to a matrix
A
of size
n×w
, where
n
is the number of
descriptions,
w
is the total number of unique terms found in the set of descriptions, and each
i
th row of
A
corresponds
to each ith description Di. Then the truncated SVD is applied as shown in Eq. (5) [31]:
An×wUn×kΛk×kVT
k×w,Bn×k=An×wVw×k(5)
where
k
is the number of singular values. The reduced matrix
B
containing
n
rows of description vectors (
k
dimension)
is obtained by multiplying the original descriptions matrix An×wwith matrix Vn×k.
Let a set
C=D0
i|i= 1 . . . n
contain the reduced description vectors, and be derived from matrix
Bn×k
, where each
i
th element
D0
i
is each
i
th row vector of
B
, therefore
D0
i
represents the extracted semantic of each
i
th description
Di
.
Also, as shown in Fig. 2, description
Di
is relevant to image
Ii
, and embeddings
ui
and
vi
are output from
Di
and
Ii
respectively, hence D0
ican also simultaneously represent the semantics of Ii,vi, and ui.
Figure 2: Based on the joint relations between
Di
with
Ii
,
vi
, and
ui
,
D0
i
can simultaneously represent the semantics of
Ii,Di,vi, and ui.
Therefore, let
ˆ
D0
i=nD0
j|j= 1 . . . n, j 6=io
denote a set of reduced vectors of descriptions from set
C
, where each
j
th
vector
D0
j
simultaneously represents the optimal semantics of
vj
and
uj
from sets
ˆvi
and
ˆui
respectively, then
S(vi,ˆui)
4
摘要:

IMPROVINGVISUAL-SEMANTICEMBEDDINGSBYLEARNINGSEMANTICALLY-ENHANCEDHARDNEGATIVESFORCROSS-MODALINFORMATIONRETRIEVALYanGong,GeorginaCosmaDepartmentofComputerScienceLoughboroughUniversity{y.gong2,g.cosma}@lboro.ac.ukThepaperhasbeenacceptedbyElsevierPatternRecognition.ABSTRACTVisualSemanticEmbedding(VSE)n...

展开>> 收起<<
IMPROVING VISUAL -SEMANTIC EMBEDDINGS BY LEARNING SEMANTICALLY -ENHANCED HARD NEGATIVES FOR CROSS -MODAL INFORMATION RETRIEVAL.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:2.21MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注