Generating image captions with external encyclopedic knowledge Sofia Nikiforova Tejaswini Deoskar Denis Paperno Yoad Winter Utrecht University Utrecht the Netherlands

2025-05-06 0 0 5.04MB 14 页 10玖币
侵权投诉
Generating image captions with external encyclopedic knowledge
Sofia Nikiforova, Tejaswini Deoskar, Denis Paperno, Yoad Winter
Utrecht University, Utrecht, the Netherlands
{s.nikiforova, t.deoskar, d.paperno, y.winter}@uu.nl
Abstract
Accurately reporting what objects are depicted
in an image is largely a solved problem in
automatic caption generation. The next big
challenge on the way to truly humanlike cap-
tioning is being able to incorporate the con-
text of the image and related real world knowl-
edge. We tackle this challenge by creating
an end-to-end caption generation system that
makes extensive use of image-specific ency-
clopedic data. Our approach includes a novel
way of using image location to identify rele-
vant open-domain facts in an external knowl-
edge base, with their subsequent integration
into the captioning pipeline at both the encod-
ing and decoding stages. Our system is trained
and tested on a new dataset with naturally pro-
duced knowledge-rich captions, and achieves
significant improvements over multiple base-
lines. We empirically demonstrate that our
approach is effective for generating contextu-
alized captions with encyclopedic knowledge
that is both factually accurate and relevant to
the image.
1 Introduction
Most modern image captioning systems are de-
signed to produce a straightforward description of
the visual content of the image. They abstract away
from the image context and have no access to infor-
mation that is not directly inferred from the image.
By contrast, humans can extend beyond a purely
visual description and produce a caption that is in-
fluenced by the image context and informed by real
world knowledge. As a result, human-generated
and automatically generated captions for the same
image can be drastically different, as seen in Fig-
ure 1. This contrast motivates the task of contex-
tualized image captioning, which sets the goal of
automatically generating captions that would ac-
count for relevant image-external knowledge.
In this paper we propose an approach to contex-
tualizing an image captioning system and enriching
Human: Clock Tower, Palace
of Westminster. Completed in
1859, the clock tower houses
the bell known as Big Ben.
BUTD (Anderson et al., 2018):
a tall clock tower towering over
a city
ClipCap (Mokady et al., 2021):
A clock tower is seen against a
cloudy sky.
Figure 1: Human-generated caption vs. captions gener-
ated by standard automatic captioning systems.
it with a wide range of encyclopedic knowledge.
Our key contribution is in constructing an image-
specific knowledge context that consists of rele-
vant facts from an open-domain knowledge base,
retrieved based on image location metadata. We
bridge the gap between two lines of research in con-
textualized image captioning: one utilizes knowl-
edge from external data sources, establishing its rel-
evance through image recognition (Bai et al.,2021;
Huang et al.,2020), the other uses geographic meta-
data to identify the names of geographic entities in
and around the image (Nikiforova et al.,2020). To
the best of our knowledge, we are the first to com-
bine these two approaches — we use geographic
metadata to access an external knowledge base and
obtain various kinds of non-geographic encyclope-
dic facts relevant to the image, with their subse-
quent integration into the captioning pipeline.
We implement the proposed approach in a
knowledge-aware caption generation system with
image-external knowledge incorporated at both the
encoding and decoding stages. We also present
a new dataset of images with naturally created
knowledge-rich captions and corresponding geo-
graphic metadata. Our experiments on this dataset
show the effectiveness of our approach: the sys-
tem greatly outperforms multiple baselines in com-
monly used captioning metrics and, crucially, the
arXiv:2210.04806v1 [cs.CL] 10 Oct 2022
correctness of the generated facts.
2 Related Work
Two main aspects that define each approach to con-
textualized image captioning are (i) the source of
external knowledge and the way to identify the data
relevant to a particular image, and (ii) the method
of incorporating external knowledge into caption
generation.
External knowledge source
In a popular sub-
task of news image captioning, captions are gener-
ated for images that accompany news articles (Zhao
et al.,2021;Hu et al.,2020;Tran et al.,2020;Jing
et al.,2020;Chen and Zhuge,2020;Biten et al.,
2019). Naturally, the article texts themselves are
the main source of context for captioning, supply-
ing information about important events and entities.
In a more general case images are not paired with
the relevant context directly. A common way to
connect images to an external source of knowledge
is to use an object detection mechanism to iden-
tify objects in the image and then use their labels to
query a database. In Huang et al. (2020); Zhou et al.
(2019); Wu et al. (2017) detected labels are used to
extract useful information about common objects
featured in the images (such as “dog”, “pot”, “surf-
board”) from ConceptNet (Speer et al.,2017) and
DBpedia (Auer et al.,2007). Zhao et al. (2019) use
Google Cloud Vision APIs to identify not only com-
mon objects but also entities (people, car brands,
etc). In Bai et al. (2021) custom classifiers are
trained to detect specific image attributes (e.g. the
author, the artistic style), which are then used to re-
trieve relevant information from Wikipedia. These
approaches exclusively use the visual content of
the images to contextualize the captioning process.
Thus, the extent of contextualization is limited by
the quality of the object detection algorithms, and
the potential benefit of utilizing additional data (e.g.
image metadata) is left unexplored.
Certain types of image metadata can be used to
build upon general object detection and identify
specific entities and events in the image. Lu et al.
(2018) use the time metadata of the image (the date
when a given photo was taken) and its associated
tags to collect similar photographs and to retrieve
the names of relevant entities (e.g. people depicted
in the image) from their captions. In Nikiforova
et al. (2020), geographic metadata (latitude and lon-
gitude coordinates of the image location) is used to
extract information about the surrounding objects
from a geographic database, which allows their
system to refer to concrete locations relevant to
the image in the generated captions. These papers
demonstrate the effectiveness of using image meta-
data for contextualized captioning but are limited
to establishing the names of relevant entities, and
do not utilize them to get further data that could be
useful for generating even more informative cap-
tions.
In contrast to the works described above, we
use image metadata as a grounding “anchor”, with
which we can not only identify entities relevant to
the image, but also retrieve a wide range of related
encyclopedic knowledge from an external database.
Specifically, we use geographic metadata, which
has the benefit of being easily available for many
real-life photographs due to the built-in GPS in
modern cameras and phones, making it easier to
collect the data for training and testing the system.
Incorporating external knowledge into caption
generation
There are two dominant methods of
incorporating external knowledge into the caption
generation process: template-based and context-
based. In template-based approaches, a caption
is generated with placeholder token slots that are
later filled with the most fitting named entities ex-
tracted from an external knowledge source (Bai
et al.,2021;Jing et al.,2020;Hu et al.,2020;Biten
et al.,2019). This is an especially common tech-
nique in news image captioning, where named enti-
ties are taken from the news article associated with
the image. Still, the straightforward fill-in-the-slot
method can be problematic if none of the available
entities fit the already generated placeholder slot.
In context-based approaches, external knowl-
edge informs the caption generation process along
with the image features, influencing the choice of
produced tokens. For example, Zhou et al. (2019)
extract ConceptNet terms related to the image and
use their embeddings to initialize the caption gen-
eration module. Huang et al. (2020) also use Con-
ceptNet to identify relevant external knowledge and
increase the output probabilities of the vocabulary
tokens if they match the extracted entities. The
downside of context-based models is their inability
to generate tokens that are present in the external
knowledge but happen to be out-of-vocabulary for
the generator’s language model, which is common
for named entities.
Our model, like several other approaches (Chen
and Zhuge,2020;Tran et al.,2020;Nikiforova
et al.,2020;Whitehead et al.,2018), combines
characteristics of both template-based and context-
based methods. Similarly to template-based archi-
tectures, some of the tokens produced by our cap-
tion generation model are taken directly from exter-
nal knowledge sources, and as in the context-based
methods, external knowledge in our model influ-
ences the generation of regular vocabulary words.
3 Image caption dataset
For the purposes of training a knowledge-aware
image captioning system, we require a dataset of
images with knowledge-rich captions. Standard
image captioning datasets, such as MSCOCO (Lin
et al.,2014) or Flickr30k (Young et al.,2014), con-
tain heavily curated captions, purposefully stripped
of any contextualization and references to ency-
clopedic knowledge. Moreover, they lack image
metadata, which is critical for our approach. There-
fore, we compile our own dataset of 7128 images
with the associated naturally created captions and
image location metadata. The source of our data is
the website of the Geograph project
1
, which con-
tains a large number of photographs with English-
language captions and rich metadata, including the
geographic coordinates of the photograph locations.
The images we selected for our dataset depict geo-
graphic entities, mostly historical man-made struc-
tures and buildings, to ensure that each caption
contains at least one encyclopedic fact, for exam-
ple, “Theatre Royal Haymarket. Dating back to
1720”. More details of the dataset, including its
split into train, validation and test sets, are given in
Appendix A.
4 Approach
Our approach begins with identifying real world
entities that are related to the image. Specifically,
we use the coordinates of where the photograph
was taken to retrieve a set of geographic entities
around that location (Section 4.1). Then, we collect
a wide range of relevant encyclopedic facts. We
extract them from an open-domain knowledge base
in the form of
<
subject, predicate, object
>
triples,
where subjects are the previously identified geo-
graphic entities (Section 4.2). Finally, we integrate
the collected knowledge into an otherwise standard
encoder-decoder image captioning pipeline to pro-
duce knowledge-aware captions (Section 4.3).
1http://www.geograph.org.uk/
4.1 Geographic Context
We adapt an approach from Nikiforova et al. (2020)
to retrieve geographic entities relevant to the image
from the OpenStreetMap
2
database and thus to con-
struct an image-specific geographic context. For-
mally, the geographic context
G
of a given image
is a set of
n
geographic entities (
e1. . . en
) located
within a radius
r
from the image location. Based on
preliminary experiments, we set
n
at 300 and
r
at
1 kilometer as the hyperparameters of our system.
Each geographic entity
ei
is associated with its
name and a set of geographic features: distance
di
and azimuth
ai
between the entity and the image
location, the entity’s size
si
and type
ti
. In addition
to that, we introduce two new features, intended to
reflect the salience of the entity through the amount
of information available about it in a knowledge
base: a binary indicator
fi
that shows whether or
not the entity corresponds to any facts in the knowl-
edge context, and the number of facts
#fi
that
correspond to the entity in the knowledge context.
A sample fragment of a geographic context, with
the entities mapped to their names and features, is
shown in Figure 2.
the: E(the)
view: E(view)
house: E(house)
theatre: E(theatre)
Decoder
of
haymarket
theatre
in
the
view
Vocabulary
E(word): co-occurrence based
E’(geo): (dist, az, size, type, num_facts))
E’’(fact): E’(geo) E’’’(predicate)
E’’’(predicate): randomly initialized
embedding, fixed inventory of predicates
The view of Haymarket Theatre in London, built in 1720.
The view of
Haymarket Theatre
in London, built in
1720.
o1: E’(haymarket_theatre)
o2: E’(london)
oN: E’(trafalgar_square)
Geocontext
view geo wordthe theatreo1 o2 oN
f1: E’’(<o1, built_in>)
f2: E’’(<o1, rebuilt>)
...
fM: E’’(<oN, closed_in>)
e1 – Theatre Royal – (d1=0.02km, a1= -132°, s1=0.001km2, t1=theatre, f1=1, #f1=2)
e2 – Haymarket – (d2=0.05km, a2= -30°, s2=0.0004km2, t2=primary, f2=0, #f2=0)
...
en Charing Cross – (dn=0.37km, an= -81°, sn=0.0km2, tn=station, fn=1, #fn=1)
Facts
1720
factf1 f2 fM
Figure 2: A fragment of a geographic context.
The features are combined in vector represen-
tations for the entities, called “geographic embed-
dings”. For an entity
ei
a geographic embedding is
computed as follows:
GEOEMB(ei) = Concat[di, norm(ai), si,
fi,#fi, Embt(ti)] (1)
where
norm
is an azimuth normalization func-
tion,
Embt
is an embedding function for the en-
tities’ types. For implementation details, see Ap-
pendix C. Our experiments show that our straight-
forward concatenation strategy achieves the same
performance level as the more complicated ap-
proach employed by Nikiforova et al. (2020).
4.2 Knowledge Context
For a given image with the geographic context
G
,
the knowledge context
K
is defined as a set of
m
facts (f1. . . fm) about the entities in G.
2https://www.openstreetmap.org/
摘要:

GeneratingimagecaptionswithexternalencyclopedicknowledgeSoaNikiforova,TejaswiniDeoskar,DenisPaperno,YoadWinterUtrechtUniversity,Utrecht,theNetherlands{s.nikiforova,t.deoskar,d.paperno,y.winter}@uu.nlAbstractAccuratelyreportingwhatobjectsaredepictedinanimageislargelyasolvedprobleminautomaticcaptiong...

展开>> 收起<<
Generating image captions with external encyclopedic knowledge Sofia Nikiforova Tejaswini Deoskar Denis Paperno Yoad Winter Utrecht University Utrecht the Netherlands.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:5.04MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注