Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective Zijian Zhang1 Chang Shu23 Ya Xiao1 Yuan Shen1 Di Zhu1 Jing Xiao2

2025-05-08 0 0 4.01MB 13 页 10玖币
侵权投诉
Improving Visual-Semantic Embedding with Adaptive Pooling and
Optimization Objective
Zijian Zhang,1, Chang Shu,2,3, Ya Xiao1, Yuan Shen1, Di Zhu1, Jing Xiao2,
Youxin Chen2, Jey Han Lau4, Qian Zhang3and Zheng Lu3
1Meituan, China
2Ping An Technology (Shenzhen) Co., Ltd, China
3University of Nottingham Ningbo, China
4The University of Melbourne, Australia
Abstract
Visual-Semantic Embedding (VSE) aims to
learn an embedding space where related vi-
sual and semantic instances are close to each
other. Recent VSE models tend to design
complex structures to pool visual and seman-
tic features into fixed-length vectors and use
hard triplet loss for optimization. However,
we find that: (1) combining simple pooling
methods is no worse than these sophisticated
methods; and (2) only considering the most
difficult-to-distinguish negative sample leads
to slow convergence and poor Recall@K im-
provement. To this end, we propose an adap-
tive pooling strategy that allows the model to
learn how to aggregate features through a com-
bination of simple pooling methods. We also
introduce a strategy to dynamically select a
group of negative samples to make the opti-
mization converge faster and perform better.
Experimental results on Flickr30K and MS-
COCO demonstrate that a standard VSE using
our pooling and optimization strategies outper-
forms current state-of-the-art systems (at least
1.0% on the metrics of recall) in image-to-
text and text-to-image retrieval. Source code
of our experiments is available at https://
github.com/96-Zachary/vse_2ad.
1 Introduction
Visual Semantic Embedding (VSE) (Frome et al.,
2013;Faghri et al.,2018) is representation learning
method that embeds images and texts for efficient
cross-modal retrieval, and typically has the follow-
ing steps (see Figure 1for an illustration). The
image and text are extracted as features by separate
visual and text encoders. These features are then
projected into a joint embedding space and pooled
to form fixed-length vectors. Then, the similarity
calculation is utilized to measure the distance be-
tween instances and a suitable target is chosen for
optimization. Our paper focuses on improving the
steps of feature aggregation and optimization.
These authors contributed equally to this work.
Feature Extraction Feature Aggregation
Similarity CalculationOptimization Target
Image Text
Inputs
Figure 1: Illustration of VSE.
For feature aggregation, the most commonly
used methods are simple pooling aggregators.
MaxPool (Wang et al.,2018) and MeanPool
(Reimers and Gurevych,2019) are designed to de-
tect the salient and mean points of features, and
K-MaxPool (Kalchbrenner et al.,2014) extracts
the mean of top-
K
features. Some complex aggre-
gation techniques have been proposed, e.g. local-
importance projection (Gao et al.,2019), sequence-
to-sequence encoder (Hu et al.,2019), graph convo-
lution network (Li et al.,2019), exponential adap-
tive pooling (Stergiou and Poppe,2021) and self-
attention encoder (Wang et al.,2020). However,
we found that carefully selected pooling functions
can surpass complex methods (see Appendix A.1).
Motivated by this, our paper proposes an approach
that can automatically discover the best pooling
functions. Specifically, we seek to improve the fea-
ture aggregation step by proposing a formulation
that parameterizes the different pooling strategies
and allows the model to learn the best configura-
tion automatically via its objective, alleviating the
need to do manual tuning. In other words, we’ve
turned these hyper-parameters (i.e. choices of pool-
ing functions) into parameters in the model.
For optimization, most VSE models optimize
using the hinge triplet ranking loss with in-batch
negative example (Faghri et al.,2018). The intu-
ition of the objective is to encourage positive pairs
to be embedded in a similar space while widen-
arXiv:2210.02206v1 [cs.MM] 5 Oct 2022
ing the distance between a target with the hardest
in-batch negative sample. In practice, however,
it is often difficult for the model to find a good
negative sample in the early stages of training (as
instances are randomly distributed in space), result-
ing in slow convergence (see Appendix A.2). To
improve optimization, we propose an adaptive op-
timization objective that selects multiple in-batch
negative samples based on model quality during
training. The intuition is that in the early stages of
training we want to sample more negative samples,
and in the later stages fewer negative samples.
Over two public datasets, MS-COCO (Lin et al.,
2014) and Flickr30K (Young et al.,2014), we
show that a standard VSE model using our pro-
posed feature aggregation and optimization strate-
gies outperforms benchmark models substantially.
In particular, our method obtains 1.4% relative
gains on RSUM for MS-COCO and 1.0% for
Flickr30K. Compared with the pre-trained vision-
language model with similar performance Geigle
et al. (2022), our method is 4.3×faster.
2 Related Work
Depending on whether the image and text features
have any form of cross-modal interaction before
similarity calculation, existing image-text retrieval
can be broadly categorized into two types.
The visual semantic embedding (VSE) (Faghri
et al.,2018;Wang et al.,2020;Chun et al.,2021)
methods process the multimodal instances inde-
pendently before projecting them into a joint em-
bedding space for similarity matching. Wang et al.
(2018) design a two-branch neural networks, LIWE
(Wehrmann et al.,2019) considers character-based
alignment and embedding methods for language
encoder and Faghri et al. (2018) extend it by us-
ing hard triplet loss for optimization. Following
these ideas, PVSE (Song and Soleymani,2019)
and CVSE (Wang et al.,2020) are proposed to
consider intra-modal polysemous and consensus
information. Recently, Chun et al. (2021) samples
instances as probabilistic distributions and achieves
further improvement. These VSE-based methods
are fast as they do not consider cross-modal inter-
action and as such the visual and text features can
be pre-computed. The non-VSE methods concen-
trate on the interaction of modalities. Specially,
late-interaction methods explore to fusion multi-
modal information by attention (Lee et al.,2018;
Chen et al.,2020), alignment (Zhang et al.,2020),
multi-view representation (Qu et al.,2020) and
fine-grained reasoning (Qu et al.,2021). The early-
interaction methods (Geigle et al.,2022), like pre-
trained vision-language models (Lu et al.,2019;
Chen et al.,2019;Li et al.,2020;Jia et al.,2021;
Li et al.,2022), focuses on the maximum of perfor-
mance while sacrifices efficiency.
Our paper focuses on the improvement of feature
aggregation and optimization for VSE. The existing
explorations of those two steps are as follows.
The performance of VSE ultimately depends on
the quality of the joint embedding space, which is
usually learned with simple transformations (e.g.
linear projection or multi-layer perceptron) and
pooling aggregators (e.g. mean pooling (Faghri
et al.,2018;Qu et al.,2020), max pooling (Zhang
and Lu,2018;Li et al.,2021), or a combination of
them (Lee et al.,2018)). Compared to these simple
aggregation methods, more complex aggregators
that introduce a large number of trainable param-
eters have also been explored, e.g. inter-modal at-
tention (Wehrmann et al.,2020) and self-attention
mechanisms (Han et al.,2021). Zhang et al. (2021)
design a cross-modal guided pooling module that
attends to local information dynamically. These
sophisticated aggregators typically require more
time, and don’t always outperform simple pooling
strategies. Perhaps the closest study to our work is
GPO (VSE
) (Chen et al.,2021), which builds a
generalized operator to learn the best pooling strat-
egy that only considers the position information of
the extracted features.
Some studies focus on improving the optimiza-
tion objective, and the most widely adopted ob-
jective is the hinge-based hard triplet ranking loss
(Faghri et al.,2018;Wei et al.,2020b;Messina
et al.,2021), which dynamically selects the “hard-
est” negative sample within a mini-batch. Other
studies explore solutions that choose multiple neg-
ative samples. Zhou et al. (2020) introduce a co-
herence metric to rank the “irrelevant” candidates.
Extending the idea, Wei et al. (2020a) assign dif-
ferent weights for positive and negative pairs. To
tackle the issue of noisy labels which impacts mul-
timodal representation, Hu et al. (2021) propose
maximizing the mutual information between differ-
ent modalities. Huang et al. (2021) separate data
into “clean” and “noisy” partitions by co-teaching.
However, the above methods do not change adap-
tively according to the model performance when
selecting negative samples.
Inputs
All three explosions
being audible within
the stadium
Feature Extraction
Feature Aggregation
ConvNet
SeqModel
Similarity Calculation
Optimization Target
ViT
Regions or Grids
Patches
PLM
… …
… …
All
three
explosions
being
audible
within
the
stadium
Tokens
MaxPool MeanPool K-MaxPool
Aggregate & Normalize X
Cosine
Joint Embedding
Space
Anchor Text
Anchor Image
Negative Image
Negative Image
Figure 2: The framework of VSE. The visual and text encoders process the image and text separately at first. The
related images and sentences are then directed to a similar space using an appropriate optimization target.
3 Methodology
3.1 Background of VSE
We first discuss the standard formulation of VSE,
before introducing our innovation on improving
feature aggregation (Section 3.2) and optimization
(Section 3.3).
To compute the similarity of given multimodal
instance (image & text), a VSE model (Figure 2)
separately encodes them via a visual encoder
(
VisEnc(·)
) and a text encoder (
TextEnc(·)
).
There are three widely used visual features pro-
duced by different visual encoders — grid is the
feature maps from convolutional networks (CNNs;
He et al. (2016)), region is the region of interest fea-
tures from object detectors (Anderson et al.,2018)
and patch is the partition from vision transformer
(Dosovitskiy et al.,2021). The text encoders are
usually RNNs (Sutskever et al.,2014)) and BERT
(Devlin et al.,2019). Formally:
Fv=VisEnc(v)
Ft=TextEnc(t)
where vand tare the input image and text.
Assuming the visual feature
Fv
has
N
object vec-
tors (represented either as grids, regions or patches)
in
d1
dimension, and the text feature
Ft
has
M
to-
ken vectors in
d2
dimension, we next project them
to the same d-dimension:
{vn}N
n=1 =FvWv+bv
{tm}M
m=1 =FtWt+bt
(1)
where vnand tmnow have the same dimension d.
To aggregate the extracted features into fixed-
length vectors, domain aggregators,
fvision(·)
and
ftext(·)
are used to transform
{vn}N
n=1 RN×d
and
{tn}M
m=1 RM×d
into
vRd
and
tRd
,
respectively:
v=fvision{vn}N
n=1,
t=ftext{tm}M
m=1
And lastly, to measure how related the inputs we
use cosine similarity:
sim(t,v) = tTv
ktk·kvk
Existing optimization strategies generally use
the hinge-based triplet ranking loss to optimize the
VSE model. Given an anchor, it aims to maximize
its similarity with positive samples while minimiz-
ing its similarity with the most “difficult” negative
sample in the mini-batch (i.e. the example that has
the highest similarity with the anchor that is not a
positive example), and includes both text-to-image
and image-to-text retrieval objectives:
LHardTriplet =
X
(t,v,ˆ
t,ˆ
v)∼Bαsim(t,v) + sim(t,ˆ
v)+
+αsim(t,v) + sim(ˆ
t,v)+
(2)
where
α
is the margin hyper-parameter, and
[x]+=
max(0, x)
.
(t,v)
is a positive text-image pair
in mini-batch
B
and
(ˆ
t,v)
and
(t,ˆ
v)
are nega-
tive pairs, where
ˆ
t= argmaxt06=tsim(t0,v)
and
ˆ
v= argmaxv06=vsim(t,v0)
are the hardest negative
sentence and image respectively in B.
摘要:

ImprovingVisual-SemanticEmbeddingwithAdaptivePoolingandOptimizationObjectiveZijianZhang;1,ChangShu;2;3,YaXiao1,YuanShen1,DiZhu1,JingXiao2,YouxinChen2,JeyHanLau4,QianZhang3andZhengLu31Meituan,China2PingAnTechnology(Shenzhen)Co.,Ltd,China3UniversityofNottinghamNingbo,China4TheUniversityofMelbourne,A...

展开>> 收起<<
Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective Zijian Zhang1 Chang Shu23 Ya Xiao1 Yuan Shen1 Di Zhu1 Jing Xiao2.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:4.01MB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注