
Prophet Attention: Predicting Attention with
Future Attention for Image Captioning
Fenglin Liu1, Xuancheng Ren2, Xian Wu3, Wei Fan3, Yuexian Zou1,4, Xu Sun2,5
1ADSPLAB, School of ECE, Peking University
2MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University
3Tencent, Beijing, China 4Peng Cheng Laboratory, Shenzhen, China
5Center for Data Science, Peking University
{fenglinliu98, renxc, zouyx, xusun}@pku.edu.cn
xuewei.ma12@gmail.com, {kevinxwu, Davidwfan}@tencent.com
Abstract
Recently, attention based models have been used extensively in many sequence-to-
sequence learning systems. Especially for image captioning, the attention based
models are expected to ground correct image regions with proper generated words.
However, for each time step in the decoding process, the attention based models
usually use the hidden state of the current input to attend to the image regions.
Under this setting, these attention models have a “deviated focus” problem that
they calculate the attention weights based on previous words instead of the one to
be generated, impairing the performance of both grounding and captioning. In this
paper, we propose the Prophet Attention, similar to the form of self-supervision. In
the training stage, this module utilizes the future information to calculate the “ideal”
attention weights towards image regions. These calculated “ideal” weights are
further used to regularize the “deviated” attention. In this manner, image regions
are grounded with the correct words. The proposed Prophet Attention can be easily
incorporated into existing image captioning models to improve their performance
of both grounding and captioning. The experiments on the Flickr30k Entities
and the MSCOCO datasets show that the proposed Prophet Attention consistently
outperforms baselines in both automatic metrics and human evaluations. It is
worth noticing that we set new state-of-the-arts on the two benchmark datasets and
achieve the 1st place on the leaderboard of the online MSCOCO benchmark in
terms of the default ranking score, i.e., CIDEr-c40.
1 Introduction
The task of image captioning [
7
] aims to generate a textual description for an input image and has
received extensive research interests. Recently, the attention-enhanced encoder-decoder framework
[
2
,
17
,
20
,
31
,
44
,
60
] have achieved great success in advancing the state-of-the-arts. Specifically,
they use a Faster-RCNN [
2
,
51
] to acquire region-based visual representations and an RNN [
14
,
18
] to
generate the coherent captions, where the attention model [
3
,
38
,
55
,
59
] guides the decoding process
by attending the hidden state to the image regions at each time step. Many sequence-to-sequence
learning systems, including machine translation [
3
,
55
] and text summarization [
64
], have proven the
importance of the attention mechanism in generating meaningful sentences. Especially for image
captioning, the attention model can ground the salient image regions to generate the next word in the
sentence [2, 26, 38, 59].
Current attention model attends to image regions based on current hidden state [
55
,
59
], which
contains the information of past generated words. As a result, the attention model has to predict
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
arXiv:2210.10914v2 [cs.CV] 11 Apr 2023