IMPROVING VISUAL -SEMANTIC EMBEDDINGS BY LEARNING SEMANTICALLY -ENHANCED HARD NEGATIVES FOR CROSS -MODAL INFORMATION RETRIEVAL

2025-05-08 3 0 2.21MB 18 页 10玖币

侵权投诉

IMPROVING VISUAL-SEMANTIC EMBEDDINGS BY LEARNING

SEMANTICALLY-ENHANCED HARD NEGATIVES FOR

CROSS-MODAL INFORMATION RETRIEVAL

Yan Gong, Georgina Cosma

Department of Computer Science

Loughborough University

{y.gong2, g.cosma}@lboro.ac.uk

The paper has been accepted by Elsevier Pattern Recognition.

ABSTRACT

Visual Semantic Embedding (VSE) networks aim to extract the semantics of images and their

descriptions and embed them into the same latent space for cross-modal information retrieval. Most

existing VSE networks are trained by adopting a hard negatives loss function which learns an

objective margin between the similarity of relevant and irrelevant image–description embedding pairs.

However, the objective margin in the hard negatives loss function is set as a ﬁxed hyperparameter that

ignores the semantic differences of the irrelevant image–description pairs. To address the challenge

of measuring the optimal similarities between image–description pairs before obtaining the trained

VSE networks, this paper presents a novel approach that comprises two main parts: (1) ﬁnds the

underlying semantics of image descriptions; and (2) proposes a novel semantically-enhanced hard

negatives loss function, where the learning objective is dynamically determined based on the optimal

similarity scores between irrelevant image–description pairs. Extensive experiments were carried

out by integrating the proposed methods into ﬁve state-of-the-art VSE networks that were applied to

three benchmark datasets for cross-modal information retrieval tasks. The results revealed that the

proposed methods achieved the best performance and can also be adopted by existing and future VSE

networks.

Keywords Visual semantic embedding network ·Cross-modal ·Information retrieval ·Hard negatives

1 Introduction

In information retrieval, Visual Semantic Embedding (VSE) networks aim to create joint representations of images and

textual descriptions and map these in a joint embedding space (i.e. same latent space) to enable various information

retrieval-related tasks, such as image–text retrieval, image captioning, and visual question answering [

]. Within the

shared embedding space, the aim is to position the relevant image–description pairs far away from the irrelevant pairs [

Currently, VSE literature can be summarised into: (1) approaches that extend the cross-modal encoder-decoder network

for improving the learning of latent representations crossing images and descriptions [

]; (2) speciﬁcally designed

attention architectures that improve the performance of networks [

]; (3) networks that are modiﬁed based on generative

adversarial methods for learning the common representation of images and descriptions [

]. The above-mentioned

studies aim to improve the VSE networks for information retrieval and have been evaluated using the benchmark

MS-COCO [

] and Flickr30K [

] datasets. Few studies focus on exploring the learning potential of VSE networks. The

hard negatives loss function [

] deﬁnes the learning objective of VSE networks, and it is commonly adopted by the

current VSE architectures [8].

Furthermore, the hard negatives loss function learns a ﬁxed margin that is the optimal difference between the similarity

of the relevant image–description embedding pair and that of the irrelevant embedding pair. However, the ﬁxed margin

ignores the semantic differences between the irrelevant image–description pairs. The hard negatives loss function does

not consider the distance of the irrelevant items to the query and sets the same learning objective (i.e. ﬁxed margin)

arXiv:2210.04754v4 [cs.CV] 14 Feb 2023

Figure 1: Sample of irrelevant image–description pairs. Description D1 is the one semantically closer to the image.

for both pairs, image–D1 and image–D2 (sample from Fig. 1), even though the semantic differences of the irrelevant

training pairs are useful for training an information retrieval model [

]. To illustrate this point, consider Fig. 1, where in

the irrelevant image–D1 pair the image and description are semantically closer than those of the irrelevant image–D2

pair, but the hard negatives loss function sets the same learning objective for both pairs, i.e. image–D1 and image–D2,

and this is not suitable. To solve the limitations of the ﬁxed margin, Wei et al. [

] introduced a polynomial loss

function with an adaptive objective margin, but their method does not consider the optimal semantic information from

irrelevant image–description pairs.

Our paper aims to semantically enhance the hard negatives loss function for exploring the learning potential of VSE

networks. This paper (1) proposes a new loss function for improving the learning efﬁciency and the cross-modal

information retrieval performance of VSE networks; (2) embeds the proposed loss function within state-of-the-art VSE

networks, and (3) evaluates its efﬁciency using benchmark datasets suitable for the task of cross-modal information

retrieval. The contributions of our paper are as follows.

•

A novel approach that infers the semantics of image descriptions by ﬁnding the underlying meaning of

descriptions using eigendecomposition and dimensionality reduction (i.e. Singular Value Decomposition).

The derived descriptions are then utilised by a proposed semantically-enhanced hard negatives loss function,

entitled LSEH, when computing the optimal similarities between irrelevant image–description pairs.

•

A semantically-enhanced hard negatives loss function that redeﬁnes the learning objective for VSE networks.

The proposed loss function dynamically adjusts the learning objective according to the semantic similarities

between irrelevant image–description pairs. Ambiguous training pairs with larger optimal similarity scores

obtain larger gradients that are utilised by the proposed loss function to improve training efﬁciency.

•

The proposed approach and loss function can be integrated into other VSE networks that improve learning

efﬁciency and cross-modal information retrieval. Extensive experiments were carried out by integrating the

proposed methods into ﬁve state-of-the-art VSE networks that were applied to the Flickr30K, MS-COCO, and

IAPR TC12 datasets, and the results showed that the proposed methods achieved the best performance.

2 Related Work

VSE Networks.

VSE networks aim to align embeddings of relevant images and descriptions in the same latent space for

cross-modal information retrieval [

]. Faghri et al. [

] proposed an Improved Visual Semantic Embedding architecture

(VSE++). Image region features extracted by the faster R-CNN [

] and their descriptions were embedded into

the same latent space by using a fully connected neural network and a Gated Recurrent Units (GRU) network [

Most state-of-the-art VSE networks improve upon VSE++. Li et al. [

] introduced a Visual Semantic Reasoning

Network (VSRN) to enhance image features with image region relationships extracted by a Graph Convolution Network

(GCN) [

]; Liu et al. [

] applied a Graph Structured Matching Network (GSMN) to build a graph of image features

and words and learn the ﬁne-grained correspondence between image features and words; Diao et al. [

] proposed

the Similarity Graph Reasoning and Attention Filtration network (SGRAF) that extends the attention mechanisms of

image and description sets. SGRAF also provides two individual sub-networks to process the attention results between

the image features and the description – where a Similarity Graph Reasoning network (SGR) builds a graph of the

attention results for reasoning, and a Similarity Attention Filtration network (SAF) ﬁlters the important information

from the attention results. Chen et al. [

] proposed a variation of the VSE network, VSE

∞

, that beneﬁts from a

generalized pooling operator which discovers the best strategy for pooling image and description embeddings. Recently,

vision transformer-based networks, that are not relying on the hard negatives loss function, have become popular for

cross-modal information retrieval [

]. However, compared to traditional VSE networks, vision transformer-based

cross-modal retrieval networks require a large amount of data for training and the time they require for retrieving the

results of a query makes them unsuitable for real-world applications [

]. The hashing-based network is another active

solution for cross-modal information retrieval [

]. For example, Liu et al. [

] ﬁrstly proposed a hashing framework

for learning varying hash codes of different lengths for the comparison between images and descriptions, and the

learned modality-speciﬁc hash codes contain more semantics. Hashing-based networks are concerned with reducing

data storage costs and improving retrieval speed. Such networks are out of scope for this paper because the focus herein

is on VSE networks which mostly aim to explore the local information alignment between images and descriptions for

improved retrieval performance.

Loss Functions for Cross-modal Information Retrieval.

One of the earliest and most used cross-modal information

retrieval loss functions is the Sum of Hinges Loss (LSH) [

]. LSH is also known as a negatives loss function, and

it learns a ﬁxed margin between the similarities of the relevant image–description embedding pairs and those of the

irrelevant embedding pairs. A more recent hard negatives loss function, the Max of Hinges Loss (LMH) [

], is adopted

in most recent VSE networks, due to its ability to outperform LSH [20, 21]. An improved version of LSH, LMH only

focuses on learning the hard negatives, which are the irrelevant image–description embedding pairs that are nearest

to the relevant pairs. Song et al. [

] presented a margin-adaptive triplet loss for the task of cross-modal information

retrieval that uses a hashing-based method which embeds the image and text into a low-dimensional Hamming space.

Liu et al. [

] applied a variant triplet loss function into their novel VSE network for cross-modal information retrieval,

where the input text embedding for the loss is replaced by the reconstructed image embedding of the network. Recently,

Wei et al. [

] proposed a polynomial [

] based Universal Weighting Metric Loss (LUWM) with ﬂexible objective

margins, and that has been shown to outperform existing hard negatives loss functions.

A summary of the limitations of the existing loss functions are as follows. (1) The learning objectives of LSH [

] and

LMH [

] are not ﬂexible because of their ﬁxed margins. (2) The adaptive margin in [

] is not optimal, because it relies

on the computed similarities between irrelevant image–description embedding pairs by the training network which is

optimising. (3) The modiﬁed ranking loss of [

] cannot be integrated into other networks. (4) LUWM [

] does not

consider the optimal semantic information from irrelevant image–description pairs.

VSE Networks with a Negatives Loss Function.

LMH was proposed by Faghri et al. [

], and thereafter other VSE

networks adopted LMH. For improving the attention mechanism, Lee et al. [

] proposed an approach to align the

image region features with keywords of the relevant image–description pair; and Diao et al. [

] built an architecture

that deeply extends the attention mechanisms of image-image, text-text, and image-text tasks. For extracting high-level

semantics, Li et al. [

] utilised GCN [

] to explore the relations of image objects. For aggregating the image and

description embeddings, Chen et al. [

] proposed a special pooling operator. Wang et al. [

] proposed an end-to-end

VSE network without relying on a pre-trained CNN for image feature extraction tasks.

Methods for Finding the Underlying Meaning of Descriptions.

BERT [

] is a supervised and widely used deep

neural network for NLP tasks [

]. Singular Value Decomposition (SVD) [

] is an unsupervised matrix decomposition

method and an established approach in NLP and information retrieval [

]. BERT and SVD both have dimensionality

reduction capabilities that enable them to ﬁnd the underlying semantic similarity between texts (e.g. sentences, image

captions, documents).

3 Proposed Semantically-Enhanced Hard Negatives Loss Function

Let

X={(Ii, Di)|i= 1 . . . n}

denote a training set containing paired images and descriptions, where each image

corresponds to its relevant description

;

is the index and

is the size of set

. Let

X0={(vi, ui)|i= 1 . . . n}

be a

set of image–description embedding pairs output by a VSE network, where each

th relevant pair consists of an image

embedding

and its relevant description embedding

. Let

ˆvi={vj|j= 1 . . . n, j 6=i}

denote a set of all image

embeddings from

irrelevant to

, and

ˆui={uj|j= 1 . . . n, j 6=i}

denote a set of all description embeddings from

that are irrelevant to

. LSH, LMH, and the proposed approach and loss function that are used during the training

of VSE networks are computed as follows.

3.1 Related Methods and Notation

LSH Description. The basic negatives loss function, LSH, is shown in Eq. (1):

LSH(vi, ui) = X

ˆui

[α+s(vi,ˆui)−s(vi, ui)]++X

ˆvi

[α+s(ui,ˆvi)−s(vi, ui)]+(1)

where

[x]+≡max(x, 0)

, and

serves as a margin parameter. Let

s(vi, ui)

be the similarity score between the relevant

image embedding

and description embedding

; let

s(vi,ˆui)

be the set of similarity scores of the image embedding

with its all irrelevant description embeddings

ˆui

; and let

s(ui,ˆvi)

be the set of similarity scores of the description

embedding

with its all irrelevant image embeddings

ˆvi

. Given a relevant pair of image–description embeddings

(vi, ui), the result of the function takes the sum from irrelevant pairs s(vi,ˆui)and s(ui,ˆvi)respectively.

LMH Description.

The hard negatives loss function, LMH, is an improved version of LSH, that only focuses on the

hard negatives [3].

LMH(vi, ui) = max

ˆui

[α+s(vi,ˆui)−s(vi, ui)]++max

ˆvi

[α+s(ui,ˆvi)−s(vi, ui)]+(2)

As shown in Eq. (2), given a relevant image–description pair

(vi, ui)

, the result of the function only takes the max value

of the irrelevant pairs s(vi,ˆui)and s(ui,ˆvi)respectively.

3.2 Proposed LSEH Loss Function

The proposed Semantically-Enhanced Hard negatives Loss function (LSEH) is an improved version of LMH and it is

deﬁned in Eq. (3):

LSEH(vi, ui) = max

ˆui

[α+ (s(vi,ˆui) + f(vi,ˆui)) −s(vi, ui)]+

+max

ˆvi

[α+ (s(ui,ˆvi) + f(ui,ˆvi)) −s(vi, ui)]+

(3)

LSEH introduces two sets of semantic factors

f(vi,ˆui)

and

f(ui,ˆvi)

for the image–description embedding pairs

(vi,ˆui)

and the description–image embedding pairs

(ui,ˆvi)

respectively, and

f(vi,ˆui)

and

f(ui,ˆvi)

can be obtained via Eq.

(4):

f(vi,ˆui) = λ×S(vi,ˆui), f(ui,ˆvi) = λ×S(ui,ˆvi)(4)

where

serves as a temperature hyperparameter, and let

S(vi,ˆui)

denote the optimal semantic similarity scores of the

irrelevant image–description embedding pairs

(vi,ˆui)

, and

S(ui,ˆvi)

denote the optimal semantic similarity scores of

the irrelevant description–image embedding pairs

(ui,ˆvi)

. Therefore, the question is to compute the semantic factors

f(vi,ˆui)and f(ui,ˆvi).

The semantic factors are computed by ﬁnding the underlying meaning of descriptions using SVD. After pre-processing

[

], the descriptions set

{Di|i= 1 . . . n}

is converted to a matrix

of size

n×w

, where

is the number of

descriptions,

is the total number of unique terms found in the set of descriptions, and each

th row of

corresponds

to each ith description Di. Then the truncated SVD is applied as shown in Eq. (5) [31]:

An×w≈Un×kΛk×kVT

k×w,Bn×k=An×wVw×k(5)

where

is the number of singular values. The reduced matrix

containing

rows of description vectors (

dimension)

is obtained by multiplying the original descriptions matrix An×wwith matrix Vn×k.

Let a set

C=D0

i|i= 1 . . . n

contain the reduced description vectors, and be derived from matrix

Bn×k

, where each

th element

is each

th row vector of

, therefore

represents the extracted semantic of each

th description

Also, as shown in Fig. 2, description

is relevant to image

, and embeddings

and

are output from

and

respectively, hence D0

ican also simultaneously represent the semantics of Ii,vi, and ui.

Figure 2: Based on the joint relations between

with

, and

can simultaneously represent the semantics of

Ii,Di,vi, and ui.

Therefore, let

i=nD0

j|j= 1 . . . n, j 6=io

denote a set of reduced vectors of descriptions from set

, where each

vector

simultaneously represents the optimal semantics of

and

from sets

ˆvi

and

ˆui

respectively, then

S(vi,ˆui)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IMPROVINGVISUAL-SEMANTICEMBEDDINGSBYLEARNINGSEMANTICALLY-ENHANCEDHARDNEGATIVESFORCROSS-MODALINFORMATIONRETRIEVALYanGong,GeorginaCosmaDepartmentofComputerScienceLoughboroughUniversity{y.gong2,g.cosma}@lboro.ac.ukThepaperhasbeenacceptedbyElsevierPatternRecognition.ABSTRACTVisualSemanticEmbedding(VSE)n...

展开>> 收起<<

IMPROVING VISUAL -SEMANTIC EMBEDDINGS BY LEARNING SEMANTICALLY -ENHANCED HARD NEGATIVES FOR CROSS -MODAL INFORMATION RETRIEVAL.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IMPROVING VISUAL -SEMANTIC EMBEDDINGS BY LEARNING SEMANTICALLY -ENHANCED HARD NEGATIVES FOR CROSS -MODAL INFORMATION RETRIEVAL

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: