Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective Zijian Zhang1 Chang Shu23 Ya Xiao1 Yuan Shen1 Di Zhu1 Jing Xiao2

2025-05-08 2 0 4.01MB 13 页 10玖币

侵权投诉

Improving Visual-Semantic Embedding with Adaptive Pooling and

Optimization Objective

Zijian Zhang∗,1, Chang Shu∗,2,3, Ya Xiao1, Yuan Shen1, Di Zhu1, Jing Xiao2,

Youxin Chen2, Jey Han Lau4, Qian Zhang3and Zheng Lu3

1Meituan, China

2Ping An Technology (Shenzhen) Co., Ltd, China

3University of Nottingham Ningbo, China

4The University of Melbourne, Australia

Abstract

Visual-Semantic Embedding (VSE) aims to

learn an embedding space where related vi-

sual and semantic instances are close to each

other. Recent VSE models tend to design

complex structures to pool visual and seman-

tic features into ﬁxed-length vectors and use

hard triplet loss for optimization. However,

we ﬁnd that: (1) combining simple pooling

methods is no worse than these sophisticated

methods; and (2) only considering the most

difﬁcult-to-distinguish negative sample leads

to slow convergence and poor Recall@K im-

provement. To this end, we propose an adap-

tive pooling strategy that allows the model to

learn how to aggregate features through a com-

bination of simple pooling methods. We also

introduce a strategy to dynamically select a

group of negative samples to make the opti-

mization converge faster and perform better.

Experimental results on Flickr30K and MS-

COCO demonstrate that a standard VSE using

our pooling and optimization strategies outper-

forms current state-of-the-art systems (at least

1.0% on the metrics of recall) in image-to-

text and text-to-image retrieval. Source code

of our experiments is available at https://

github.com/96-Zachary/vse_2ad.

1 Introduction

Visual Semantic Embedding (VSE) (Frome et al.,

2013;Faghri et al.,2018) is representation learning

method that embeds images and texts for efﬁcient

cross-modal retrieval, and typically has the follow-

ing steps (see Figure 1for an illustration). The

image and text are extracted as features by separate

visual and text encoders. These features are then

projected into a joint embedding space and pooled

to form ﬁxed-length vectors. Then, the similarity

calculation is utilized to measure the distance be-

tween instances and a suitable target is chosen for

optimization. Our paper focuses on improving the

steps of feature aggregation and optimization.

∗These authors contributed equally to this work.

Feature Extraction Feature Aggregation

Similarity CalculationOptimization Target

Image Text

Inputs

Figure 1: Illustration of VSE.

For feature aggregation, the most commonly

used methods are simple pooling aggregators.

MaxPool (Wang et al.,2018) and MeanPool

(Reimers and Gurevych,2019) are designed to de-

tect the salient and mean points of features, and

K-MaxPool (Kalchbrenner et al.,2014) extracts

the mean of top-

features. Some complex aggre-

gation techniques have been proposed, e.g. local-

importance projection (Gao et al.,2019), sequence-

to-sequence encoder (Hu et al.,2019), graph convo-

lution network (Li et al.,2019), exponential adap-

tive pooling (Stergiou and Poppe,2021) and self-

attention encoder (Wang et al.,2020). However,

we found that carefully selected pooling functions

can surpass complex methods (see Appendix A.1).

Motivated by this, our paper proposes an approach

that can automatically discover the best pooling

functions. Speciﬁcally, we seek to improve the fea-

ture aggregation step by proposing a formulation

that parameterizes the different pooling strategies

and allows the model to learn the best conﬁgura-

tion automatically via its objective, alleviating the

need to do manual tuning. In other words, we’ve

turned these hyper-parameters (i.e. choices of pool-

ing functions) into parameters in the model.

For optimization, most VSE models optimize

using the hinge triplet ranking loss with in-batch

negative example (Faghri et al.,2018). The intu-

ition of the objective is to encourage positive pairs

to be embedded in a similar space while widen-

arXiv:2210.02206v1 [cs.MM] 5 Oct 2022

ing the distance between a target with the hardest

in-batch negative sample. In practice, however,

it is often difﬁcult for the model to ﬁnd a good

negative sample in the early stages of training (as

instances are randomly distributed in space), result-

ing in slow convergence (see Appendix A.2). To

improve optimization, we propose an adaptive op-

timization objective that selects multiple in-batch

negative samples based on model quality during

training. The intuition is that in the early stages of

training we want to sample more negative samples,

and in the later stages fewer negative samples.

Over two public datasets, MS-COCO (Lin et al.,

2014) and Flickr30K (Young et al.,2014), we

show that a standard VSE model using our pro-

posed feature aggregation and optimization strate-

gies outperforms benchmark models substantially.

In particular, our method obtains 1.4% relative

gains on RSUM for MS-COCO and 1.0% for

Flickr30K. Compared with the pre-trained vision-

language model with similar performance Geigle

et al. (2022), our method is 4.3×faster.

2 Related Work

Depending on whether the image and text features

have any form of cross-modal interaction before

similarity calculation, existing image-text retrieval

can be broadly categorized into two types.

The visual semantic embedding (VSE) (Faghri

et al.,2018;Wang et al.,2020;Chun et al.,2021)

methods process the multimodal instances inde-

pendently before projecting them into a joint em-

bedding space for similarity matching. Wang et al.

(2018) design a two-branch neural networks, LIWE

(Wehrmann et al.,2019) considers character-based

alignment and embedding methods for language

encoder and Faghri et al. (2018) extend it by us-

ing hard triplet loss for optimization. Following

these ideas, PVSE (Song and Soleymani,2019)

and CVSE (Wang et al.,2020) are proposed to

consider intra-modal polysemous and consensus

information. Recently, Chun et al. (2021) samples

instances as probabilistic distributions and achieves

further improvement. These VSE-based methods

are fast as they do not consider cross-modal inter-

action and as such the visual and text features can

be pre-computed. The non-VSE methods concen-

trate on the interaction of modalities. Specially,

late-interaction methods explore to fusion multi-

modal information by attention (Lee et al.,2018;

Chen et al.,2020), alignment (Zhang et al.,2020),

multi-view representation (Qu et al.,2020) and

ﬁne-grained reasoning (Qu et al.,2021). The early-

interaction methods (Geigle et al.,2022), like pre-

trained vision-language models (Lu et al.,2019;

Chen et al.,2019;Li et al.,2020;Jia et al.,2021;

Li et al.,2022), focuses on the maximum of perfor-

mance while sacriﬁces efﬁciency.

Our paper focuses on the improvement of feature

aggregation and optimization for VSE. The existing

explorations of those two steps are as follows.

The performance of VSE ultimately depends on

the quality of the joint embedding space, which is

usually learned with simple transformations (e.g.

linear projection or multi-layer perceptron) and

pooling aggregators (e.g. mean pooling (Faghri

et al.,2018;Qu et al.,2020), max pooling (Zhang

and Lu,2018;Li et al.,2021), or a combination of

them (Lee et al.,2018)). Compared to these simple

aggregation methods, more complex aggregators

that introduce a large number of trainable param-

eters have also been explored, e.g. inter-modal at-

tention (Wehrmann et al.,2020) and self-attention

mechanisms (Han et al.,2021). Zhang et al. (2021)

design a cross-modal guided pooling module that

attends to local information dynamically. These

sophisticated aggregators typically require more

time, and don’t always outperform simple pooling

strategies. Perhaps the closest study to our work is

GPO (VSE

∞

) (Chen et al.,2021), which builds a

generalized operator to learn the best pooling strat-

egy that only considers the position information of

the extracted features.

Some studies focus on improving the optimiza-

tion objective, and the most widely adopted ob-

jective is the hinge-based hard triplet ranking loss

(Faghri et al.,2018;Wei et al.,2020b;Messina

et al.,2021), which dynamically selects the “hard-

est” negative sample within a mini-batch. Other

studies explore solutions that choose multiple neg-

ative samples. Zhou et al. (2020) introduce a co-

herence metric to rank the “irrelevant” candidates.

Extending the idea, Wei et al. (2020a) assign dif-

ferent weights for positive and negative pairs. To

tackle the issue of noisy labels which impacts mul-

timodal representation, Hu et al. (2021) propose

maximizing the mutual information between differ-

ent modalities. Huang et al. (2021) separate data

into “clean” and “noisy” partitions by co-teaching.

However, the above methods do not change adap-

tively according to the model performance when

selecting negative samples.

Inputs

All three explosions

being audible within

the stadium

Feature Extraction

Feature Aggregation

ConvNet

SeqModel

Similarity Calculation

Optimization Target

ViT

Regions or Grids

Patches

PLM

… …

All

three

explosions

being

audible

within

the

stadium

Tokens

MaxPool MeanPool K-MaxPool

Aggregate & Normalize X

Cosine

Joint Embedding

Space

Anchor Text

Anchor Image

Negative Image

Figure 2: The framework of VSE. The visual and text encoders process the image and text separately at ﬁrst. The

related images and sentences are then directed to a similar space using an appropriate optimization target.

3 Methodology

3.1 Background of VSE

We ﬁrst discuss the standard formulation of VSE,

before introducing our innovation on improving

feature aggregation (Section 3.2) and optimization

(Section 3.3).

To compute the similarity of given multimodal

instance (image & text), a VSE model (Figure 2)

separately encodes them via a visual encoder

(

VisEnc(·)

) and a text encoder (

TextEnc(·)

There are three widely used visual features pro-

duced by different visual encoders — grid is the

feature maps from convolutional networks (CNNs;

He et al. (2016)), region is the region of interest fea-

tures from object detectors (Anderson et al.,2018)

and patch is the partition from vision transformer

(Dosovitskiy et al.,2021). The text encoders are

usually RNNs (Sutskever et al.,2014)) and BERT

(Devlin et al.,2019). Formally:

Fv=VisEnc(v)

Ft=TextEnc(t)

where vand tare the input image and text.

Assuming the visual feature

has

object vec-

tors (represented either as grids, regions or patches)

dimension, and the text feature

has

to-

ken vectors in

dimension, we next project them

to the same d-dimension:

{vn}N

n=1 =FvWv+bv

{tm}M

m=1 =FtWt+bt

(1)

where vnand tmnow have the same dimension d.

To aggregate the extracted features into ﬁxed-

length vectors, domain aggregators,

fvision(·)

and

ftext(·)

are used to transform

{vn}N

n=1 ∈RN×d

and

{tn}M

m=1 ∈RM×d

into

v∈Rd

and

t∈Rd

respectively:

v=fvision{vn}N

n=1,

t=ftext{tm}M

m=1

And lastly, to measure how related the inputs we

use cosine similarity:

sim(t,v) = tTv

ktk·kvk

Existing optimization strategies generally use

the hinge-based triplet ranking loss to optimize the

VSE model. Given an anchor, it aims to maximize

its similarity with positive samples while minimiz-

ing its similarity with the most “difﬁcult” negative

sample in the mini-batch (i.e. the example that has

the highest similarity with the anchor that is not a

positive example), and includes both text-to-image

and image-to-text retrieval objectives:

LHardTriplet =

(t,v,ˆ

t,ˆ

v)∼Bα−sim(t,v) + sim(t,ˆ

v)+

+α−sim(t,v) + sim(ˆ

t,v)+

(2)

where

is the margin hyper-parameter, and

[x]+=

max(0, x)

(t,v)

is a positive text-image pair

in mini-batch

and

(ˆ

t,v)

and

(t,ˆ

are nega-

tive pairs, where

t= argmaxt06=tsim(t0,v)

and

v= argmaxv06=vsim(t,v0)

are the hardest negative

sentence and image respectively in B.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ImprovingVisual-SemanticEmbeddingwithAdaptivePoolingandOptimizationObjectiveZijianZhang;1,ChangShu;2;3,YaXiao1,YuanShen1,DiZhu1,JingXiao2,YouxinChen2,JeyHanLau4,QianZhang3andZhengLu31Meituan,China2PingAnTechnology(Shenzhen)Co.,Ltd,China3UniversityofNottinghamNingbo,China4TheUniversityofMelbourne,A...

展开>> 收起<<

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective Zijian Zhang1 Chang Shu23 Ya Xiao1 Yuan Shen1 Di Zhu1 Jing Xiao2.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective Zijian Zhang1 Chang Shu23 Ya Xiao1 Yuan Shen1 Di Zhu1 Jing Xiao2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: