Prophet Attention Predicting Attention with Future Attention for Image Captioning Fenglin Liu1 Xuancheng Ren2 Xian Wu3 Wei Fan3 Yuexian Zou14 Xu Sun25

2025-05-02 0 0 1.64MB 13 页 10玖币
侵权投诉
Prophet Attention: Predicting Attention with
Future Attention for Image Captioning
Fenglin Liu1, Xuancheng Ren2, Xian Wu3, Wei Fan3, Yuexian Zou1,4, Xu Sun2,5
1ADSPLAB, School of ECE, Peking University
2MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University
3Tencent, Beijing, China 4Peng Cheng Laboratory, Shenzhen, China
5Center for Data Science, Peking University
{fenglinliu98, renxc, zouyx, xusun}@pku.edu.cn
xuewei.ma12@gmail.com, {kevinxwu, Davidwfan}@tencent.com
Abstract
Recently, attention based models have been used extensively in many sequence-to-
sequence learning systems. Especially for image captioning, the attention based
models are expected to ground correct image regions with proper generated words.
However, for each time step in the decoding process, the attention based models
usually use the hidden state of the current input to attend to the image regions.
Under this setting, these attention models have a “deviated focus” problem that
they calculate the attention weights based on previous words instead of the one to
be generated, impairing the performance of both grounding and captioning. In this
paper, we propose the Prophet Attention, similar to the form of self-supervision. In
the training stage, this module utilizes the future information to calculate the “ideal”
attention weights towards image regions. These calculated “ideal” weights are
further used to regularize the “deviated” attention. In this manner, image regions
are grounded with the correct words. The proposed Prophet Attention can be easily
incorporated into existing image captioning models to improve their performance
of both grounding and captioning. The experiments on the Flickr30k Entities
and the MSCOCO datasets show that the proposed Prophet Attention consistently
outperforms baselines in both automatic metrics and human evaluations. It is
worth noticing that we set new state-of-the-arts on the two benchmark datasets and
achieve the 1st place on the leaderboard of the online MSCOCO benchmark in
terms of the default ranking score, i.e., CIDEr-c40.
1 Introduction
The task of image captioning [
7
] aims to generate a textual description for an input image and has
received extensive research interests. Recently, the attention-enhanced encoder-decoder framework
[
2
,
17
,
20
,
31
,
44
,
60
] have achieved great success in advancing the state-of-the-arts. Specifically,
they use a Faster-RCNN [
2
,
51
] to acquire region-based visual representations and an RNN [
14
,
18
] to
generate the coherent captions, where the attention model [
3
,
38
,
55
,
59
] guides the decoding process
by attending the hidden state to the image regions at each time step. Many sequence-to-sequence
learning systems, including machine translation [
3
,
55
] and text summarization [
64
], have proven the
importance of the attention mechanism in generating meaningful sentences. Especially for image
captioning, the attention model can ground the salient image regions to generate the next word in the
sentence [2, 26, 38, 59].
Current attention model attends to image regions based on current hidden state [
55
,
59
], which
contains the information of past generated words. As a result, the attention model has to predict
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
arXiv:2210.10914v2 [cs.CV] 11 Apr 2023
<bos>
a
a
woman
woman
holding
holding
a
a
yellow
yellow
umbrella
umbrella
wearing
wearing
a
a
yellow
yellow
coat
coat
in
in
the
the
rain
rain
<eos>
Input
Output
Input
Output
Attend
Attend
Figure 1: Illustration of the sequence of the attended image regions from a state-of-the-art system
[
20
] in generating each word for a complete image description. At each time step, only the top-1
attended image region is shown [
65
]. As we can see, the attended image regions are grounded more
on the input words than the output words, such as the timesteps that input yellow and umbrella,
demonstrating poor grounding accuracies of the current attention model.
attention weights without knowing the word it should ground. Figure 1 illustrates a generated caption
and the attended image regions from a state-of-the-art captioning system [
20
]. As we can see, the
attended image regions are more grounded on current input word than the output one. For example,
at the time step to generate the 5
th
word yellow, the attended image region is the woman instead of
the umbrella. As a result, the incorrect adjective yellow is generated rather than the correct adjective
red. This is mainly due to the “focus” of the attention is “deviated” several steps backwards and the
conditioned words are woman and holding; Another example is at the time step to generate the 7
th
word wearing, the attended image region should be the woman instead of the umbrella. Although the
generated word is correct, the unfavorable attended image region impairs the grounding performance
[
65
] and ruins the model interpretability, because the attended image region often serves as a visual
interpretation to qualitative measurement of the captioning model [9, 11, 39, 54, 65].
In this paper, to address the “deviated focus” issue of current attention models, we propose the novel
Prophet Attention to ground the image regions with proper generated words in a manner similar
to self-supervision. As shown in Figure 2, in the training stage, for each time step in the decoding
process, we first employ the words that will be generated in the future, to calculate the “ideal”
attention weights towards image regions. And then the calculated “ideal” attention weights are used
to guide the attention calculation based on the input words that have already been generated (without
future words to be generated). It indicates that the conventional attention model will be regularized by
the calculated attention weights based on future words. We evaluate the proposed Prophet Attention
on two benchmark image captioning datasets. According to both automatic metrics and human
evaluations, the captioning models equipped with Prophet Attentions outperform baselines.
Overall, the contributions of this work are as follows:
We propose Prophet Attention to enable attention models to correctly ground words that are
to be generated to proper image regions. The Prophet Attention can be easily incorporated
into existing models to improve their performance of both grounding and captioning.
We evaluate Prophet Attention for image captioning on the Flickr30k Entities and the
MSCOCO datasets. The captioning models equipped with the Prophet Attention signifi-
cantly outperform the ones without it. Besides automatic metrics, we also conduct human
evaluations to evaluate Prophet Attention from the user experience perspective. At the time
of submission (2 June 2020), we achieve the 1st place on the leaderboard of the MSCOCO
online server benchmark in terms of the default ranking score (CIDEr-c40).
In addition to image captioning task, we also attempt to adapt Prophet Attention to other
language generation tasks. We obtain positive experimental results on paraphrase generation
and video captioning tasks.
2
LSTM
Attention
Linear
Softmax
BiLSTM
Attention
Linear
Softmax
...
BiLSTM
Linear
Softmax
Attention
...
(a) Visual Attention (b) Prophet Attention
BiLSTM
Attention
Linear
Softmax
Word
embeding
Word
embeding
Word
embeding
Word
embeding
Figure 2: Illustration of the conventional attention model (left) and our Prophet Attention (right)
approach. As we can see, our approach calculates “ideal” attention weights
ˆαt
based on future
generated words yi:j(jt) as a target for the attention model based on previous generated words.
2 Approach
We first briefly review the conventional attention-enhanced encoder-decoder framework in image
captioning and then describe the proposed Prophet Attention in detail.
2.1 Background: Attention-Enhanced Encoder-Decoder Framework
The conventional attention-enhanced encoder-decoder framework [
2
,
20
,
38
] usually consists of a
visual encoder and an attention-enhanced language decoder.
Visual Encoder
The visual encoder represents the image with a set of region-based visual feature
vectors
V={v1,v2,...,vN} ∈ Rd×N
, where each feature vector represents a certain aspect of
the image. The visual features serve as a guide for the language decoder to describe the salient
information in the image. In implementation, the Faster-RCNN model [
2
,
51
] is widely adopted as the
region-based visual feature extractor, which has achieved great success in advancing the state-of-the
arts [17, 20, 60, 61].
Attention-Enhanced Caption Decoder
The left sub-figure in Figure 2 shows the widely-used
attention-enhanced LSTM decoder [
20
,
38
]. For each decoding step
t
, the decoder takes the word
embedding of current input word
yt1
, concatenated with the averaged visual features
¯v=1
kPk
i=1 vi
as input to the LSTM:
ht=LSTM (ht1,[Weyt1; ¯v]) ,(1)
where [;] denotes the concatenation operation and
We
denotes the learnable word embedding parame-
ters. Next, the output
ht
of the LSTM is used as a query to attend to the relevant image regions in the
visual feature set Vand generates the attended visual features ct:
αt=fAtt(ht, V ) = softmax (wαtanh (WhhtWVV)) , ct=V αT
t,(2)
where the
wα
,
Wh
and
WV
are the learnable parameters.
denotes the matrix-vector addition, which
is calculated by adding the vector to each column of the matrix. Finally, the
ht
and
ct
are passed to a
linear layer to predict the next word:
ytpt=softmax (Wp[ht;ct] + bp),(3)
where the
Wp
and
bp
are the learnable parameters. It is worth noticing that some works [
2
,
39
,
60
,
61
]
also attempt to append one more LSTM layer to predict the word, please refer Anderson et al.
[2]
for
details. Finally, given a target ground truth sequence
y
1:T
and a captioning model with parameters
θ
,
the objective is to minimize the following cross entropy loss:
LCE(θ) =
T
X
t=1
log pθy
t|y
1:t1.(4)
As we can see from Eq. (2), at each timestep
t
, the attention model relies on
ht
, which contains
the past information of generated caption words
y1:t1
, to calculate the attention weights
αt
. Such
reliance on the past information makes the attended visual features be less grounded on the word to
be generated in the current timestep, which impairs both the captioning and grounding performance.
3
摘要:

ProphetAttention:PredictingAttentionwithFutureAttentionforImageCaptioningFenglinLiu1,XuanchengRen2,XianWu3,WeiFan3,YuexianZou1,4,XuSun2,51ADSPLAB,SchoolofECE,PekingUniversity2MOEKeyLaboratoryofComputationalLinguistics,SchoolofEECS,PekingUniversity3Tencent,Beijing,China4PengChengLaboratory,Shenzhen,C...

展开>> 收起<<
Prophet Attention Predicting Attention with Future Attention for Image Captioning Fenglin Liu1 Xuancheng Ren2 Xian Wu3 Wei Fan3 Yuexian Zou14 Xu Sun25.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.64MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注