Prophet Attention Predicting Attention with Future Attention for Image Captioning Fenglin Liu1 Xuancheng Ren2 Xian Wu3 Wei Fan3 Yuexian Zou14 Xu Sun25

2025-05-02 0 0 1.64MB 13 页 10玖币

侵权投诉

Prophet Attention: Predicting Attention with

Future Attention for Image Captioning

Fenglin Liu1, Xuancheng Ren2, Xian Wu3, Wei Fan3, Yuexian Zou1,4, Xu Sun2,5

1ADSPLAB, School of ECE, Peking University

2MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University

3Tencent, Beijing, China 4Peng Cheng Laboratory, Shenzhen, China

5Center for Data Science, Peking University

{fenglinliu98, renxc, zouyx, xusun}@pku.edu.cn

xuewei.ma12@gmail.com, {kevinxwu, Davidwfan}@tencent.com

Abstract

Recently, attention based models have been used extensively in many sequence-to-

sequence learning systems. Especially for image captioning, the attention based

models are expected to ground correct image regions with proper generated words.

However, for each time step in the decoding process, the attention based models

usually use the hidden state of the current input to attend to the image regions.

Under this setting, these attention models have a “deviated focus” problem that

they calculate the attention weights based on previous words instead of the one to

be generated, impairing the performance of both grounding and captioning. In this

paper, we propose the Prophet Attention, similar to the form of self-supervision. In

the training stage, this module utilizes the future information to calculate the “ideal”

attention weights towards image regions. These calculated “ideal” weights are

further used to regularize the “deviated” attention. In this manner, image regions

are grounded with the correct words. The proposed Prophet Attention can be easily

incorporated into existing image captioning models to improve their performance

of both grounding and captioning. The experiments on the Flickr30k Entities

and the MSCOCO datasets show that the proposed Prophet Attention consistently

outperforms baselines in both automatic metrics and human evaluations. It is

worth noticing that we set new state-of-the-arts on the two benchmark datasets and

achieve the 1st place on the leaderboard of the online MSCOCO benchmark in

terms of the default ranking score, i.e., CIDEr-c40.

1 Introduction

The task of image captioning [

] aims to generate a textual description for an input image and has

received extensive research interests. Recently, the attention-enhanced encoder-decoder framework

[

] have achieved great success in advancing the state-of-the-arts. Speciﬁcally,

they use a Faster-RCNN [

] to acquire region-based visual representations and an RNN [

] to

generate the coherent captions, where the attention model [

] guides the decoding process

by attending the hidden state to the image regions at each time step. Many sequence-to-sequence

learning systems, including machine translation [

] and text summarization [

], have proven the

importance of the attention mechanism in generating meaningful sentences. Especially for image

captioning, the attention model can ground the salient image regions to generate the next word in the

sentence [2, 26, 38, 59].

Current attention model attends to image regions based on current hidden state [

], which

contains the information of past generated words. As a result, the attention model has to predict

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

arXiv:2210.10914v2 [cs.CV] 11 Apr 2023

<bos>

woman

holding

yellow

umbrella

wearing

yellow

coat

the

rain

<eos>

Input

Output

Input

Output

Attend

Figure 1: Illustration of the sequence of the attended image regions from a state-of-the-art system

[

] in generating each word for a complete image description. At each time step, only the top-1

attended image region is shown [

]. As we can see, the attended image regions are grounded more

on the input words than the output words, such as the timesteps that input yellow and umbrella,

demonstrating poor grounding accuracies of the current attention model.

attention weights without knowing the word it should ground. Figure 1 illustrates a generated caption

and the attended image regions from a state-of-the-art captioning system [

]. As we can see, the

attended image regions are more grounded on current input word than the output one. For example,

at the time step to generate the 5

word yellow, the attended image region is the woman instead of

the umbrella. As a result, the incorrect adjective yellow is generated rather than the correct adjective

red. This is mainly due to the “focus” of the attention is “deviated” several steps backwards and the

conditioned words are woman and holding; Another example is at the time step to generate the 7

word wearing, the attended image region should be the woman instead of the umbrella. Although the

generated word is correct, the unfavorable attended image region impairs the grounding performance

[

] and ruins the model interpretability, because the attended image region often serves as a visual

interpretation to qualitative measurement of the captioning model [9, 11, 39, 54, 65].

In this paper, to address the “deviated focus” issue of current attention models, we propose the novel

Prophet Attention to ground the image regions with proper generated words in a manner similar

to self-supervision. As shown in Figure 2, in the training stage, for each time step in the decoding

process, we ﬁrst employ the words that will be generated in the future, to calculate the “ideal”

attention weights towards image regions. And then the calculated “ideal” attention weights are used

to guide the attention calculation based on the input words that have already been generated (without

future words to be generated). It indicates that the conventional attention model will be regularized by

the calculated attention weights based on future words. We evaluate the proposed Prophet Attention

on two benchmark image captioning datasets. According to both automatic metrics and human

evaluations, the captioning models equipped with Prophet Attentions outperform baselines.

Overall, the contributions of this work are as follows:

•

We propose Prophet Attention to enable attention models to correctly ground words that are

to be generated to proper image regions. The Prophet Attention can be easily incorporated

into existing models to improve their performance of both grounding and captioning.

•

We evaluate Prophet Attention for image captioning on the Flickr30k Entities and the

MSCOCO datasets. The captioning models equipped with the Prophet Attention signiﬁ-

cantly outperform the ones without it. Besides automatic metrics, we also conduct human

evaluations to evaluate Prophet Attention from the user experience perspective. At the time

of submission (2 June 2020), we achieve the 1st place on the leaderboard of the MSCOCO

online server benchmark in terms of the default ranking score (CIDEr-c40).

•

In addition to image captioning task, we also attempt to adapt Prophet Attention to other

language generation tasks. We obtain positive experimental results on paraphrase generation

and video captioning tasks.

LSTM

Attention

Linear

Softmax

BiLSTM

Attention

Linear

Softmax

...

BiLSTM

Linear

Softmax

Attention

...

(a) Visual Attention (b) Prophet Attention

BiLSTM

Attention

Linear

Softmax

Word

embeding

Word

embeding

Word

embeding

Word

embeding

Figure 2: Illustration of the conventional attention model (left) and our Prophet Attention (right)

approach. As we can see, our approach calculates “ideal” attention weights

ˆαt

based on future

generated words yi:j(j≥t) as a target for the attention model based on previous generated words.

2 Approach

We ﬁrst brieﬂy review the conventional attention-enhanced encoder-decoder framework in image

captioning and then describe the proposed Prophet Attention in detail.

2.1 Background: Attention-Enhanced Encoder-Decoder Framework

The conventional attention-enhanced encoder-decoder framework [

] usually consists of a

visual encoder and an attention-enhanced language decoder.

Visual Encoder

The visual encoder represents the image with a set of region-based visual feature

vectors

V={v1,v2,...,vN} ∈ Rd×N

, where each feature vector represents a certain aspect of

the image. The visual features serve as a guide for the language decoder to describe the salient

information in the image. In implementation, the Faster-RCNN model [

] is widely adopted as the

region-based visual feature extractor, which has achieved great success in advancing the state-of-the

arts [17, 20, 60, 61].

Attention-Enhanced Caption Decoder

The left sub-ﬁgure in Figure 2 shows the widely-used

attention-enhanced LSTM decoder [

]. For each decoding step

, the decoder takes the word

embedding of current input word

yt−1

, concatenated with the averaged visual features

¯v=1

kPk

i=1 vi

as input to the LSTM:

ht=LSTM (ht−1,[Weyt−1; ¯v]) ,(1)

where [;] denotes the concatenation operation and

denotes the learnable word embedding parame-

ters. Next, the output

of the LSTM is used as a query to attend to the relevant image regions in the

visual feature set Vand generates the attended visual features ct:

αt=fAtt(ht, V ) = softmax (wαtanh (Whht⊕WVV)) , ct=V αT

t,(2)

where the

wα

and

are the learnable parameters.

⊕

denotes the matrix-vector addition, which

is calculated by adding the vector to each column of the matrix. Finally, the

and

are passed to a

linear layer to predict the next word:

yt∼pt=softmax (Wp[ht;ct] + bp),(3)

where the

and

are the learnable parameters. It is worth noticing that some works [

]

also attempt to append one more LSTM layer to predict the word, please refer Anderson et al.

[2]

for

details. Finally, given a target ground truth sequence

y∗

1:T

and a captioning model with parameters

the objective is to minimize the following cross entropy loss:

LCE(θ) = −

t=1

log pθy∗

t|y∗

1:t−1.(4)

As we can see from Eq. (2), at each timestep

, the attention model relies on

, which contains

the past information of generated caption words

y1:t−1

, to calculate the attention weights

αt

. Such

reliance on the past information makes the attended visual features be less grounded on the word to

be generated in the current timestep, which impairs both the captioning and grounding performance.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ProphetAttention:PredictingAttentionwithFutureAttentionforImageCaptioningFenglinLiu1,XuanchengRen2,XianWu3,WeiFan3,YuexianZou1,4,XuSun2,51ADSPLAB,SchoolofECE,PekingUniversity2MOEKeyLaboratoryofComputationalLinguistics,SchoolofEECS,PekingUniversity3Tencent,Beijing,China4PengChengLaboratory,Shenzhen,C...

展开>> 收起<<

Prophet Attention Predicting Attention with Future Attention for Image Captioning Fenglin Liu1 Xuancheng Ren2 Xian Wu3 Wei Fan3 Yuexian Zou14 Xu Sun25.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Prophet Attention Predicting Attention with Future Attention for Image Captioning Fenglin Liu1 Xuancheng Ren2 Xian Wu3 Wei Fan3 Yuexian Zou14 Xu Sun25

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: