Generating image captions with external encyclopedic knowledge Soﬁa Nikiforova Tejaswini Deoskar Denis Paperno Yoad Winter Utrecht University Utrecht the Netherlands

2025-05-06 0 0 5.04MB 14 页 10玖币

侵权投诉

Generating image captions with external encyclopedic knowledge

Soﬁa Nikiforova, Tejaswini Deoskar, Denis Paperno, Yoad Winter

Utrecht University, Utrecht, the Netherlands

{s.nikiforova, t.deoskar, d.paperno, y.winter}@uu.nl

Abstract

Accurately reporting what objects are depicted

in an image is largely a solved problem in

automatic caption generation. The next big

challenge on the way to truly humanlike cap-

tioning is being able to incorporate the con-

text of the image and related real world knowl-

edge. We tackle this challenge by creating

an end-to-end caption generation system that

makes extensive use of image-speciﬁc ency-

clopedic data. Our approach includes a novel

way of using image location to identify rele-

vant open-domain facts in an external knowl-

edge base, with their subsequent integration

into the captioning pipeline at both the encod-

ing and decoding stages. Our system is trained

and tested on a new dataset with naturally pro-

duced knowledge-rich captions, and achieves

signiﬁcant improvements over multiple base-

lines. We empirically demonstrate that our

approach is effective for generating contextu-

alized captions with encyclopedic knowledge

that is both factually accurate and relevant to

the image.

1 Introduction

Most modern image captioning systems are de-

signed to produce a straightforward description of

the visual content of the image. They abstract away

from the image context and have no access to infor-

mation that is not directly inferred from the image.

By contrast, humans can extend beyond a purely

visual description and produce a caption that is in-

ﬂuenced by the image context and informed by real

world knowledge. As a result, human-generated

and automatically generated captions for the same

image can be drastically different, as seen in Fig-

ure 1. This contrast motivates the task of contex-

tualized image captioning, which sets the goal of

automatically generating captions that would ac-

count for relevant image-external knowledge.

In this paper we propose an approach to contex-

tualizing an image captioning system and enriching

Human: Clock Tower, Palace

of Westminster. Completed in

1859, the clock tower houses

the bell known as Big Ben.

BUTD (Anderson et al., 2018):

a tall clock tower towering over

a city

ClipCap (Mokady et al., 2021):

A clock tower is seen against a

cloudy sky.

Figure 1: Human-generated caption vs. captions gener-

ated by standard automatic captioning systems.

it with a wide range of encyclopedic knowledge.

Our key contribution is in constructing an image-

speciﬁc knowledge context that consists of rele-

vant facts from an open-domain knowledge base,

retrieved based on image location metadata. We

bridge the gap between two lines of research in con-

textualized image captioning: one utilizes knowl-

edge from external data sources, establishing its rel-

evance through image recognition (Bai et al.,2021;

Huang et al.,2020), the other uses geographic meta-

data to identify the names of geographic entities in

and around the image (Nikiforova et al.,2020). To

the best of our knowledge, we are the ﬁrst to com-

bine these two approaches — we use geographic

metadata to access an external knowledge base and

obtain various kinds of non-geographic encyclope-

dic facts relevant to the image, with their subse-

quent integration into the captioning pipeline.

We implement the proposed approach in a

knowledge-aware caption generation system with

image-external knowledge incorporated at both the

encoding and decoding stages. We also present

a new dataset of images with naturally created

knowledge-rich captions and corresponding geo-

graphic metadata. Our experiments on this dataset

show the effectiveness of our approach: the sys-

tem greatly outperforms multiple baselines in com-

monly used captioning metrics and, crucially, the

arXiv:2210.04806v1 [cs.CL] 10 Oct 2022

correctness of the generated facts.

2 Related Work

Two main aspects that deﬁne each approach to con-

textualized image captioning are (i) the source of

external knowledge and the way to identify the data

relevant to a particular image, and (ii) the method

of incorporating external knowledge into caption

generation.

External knowledge source

In a popular sub-

task of news image captioning, captions are gener-

ated for images that accompany news articles (Zhao

et al.,2021;Hu et al.,2020;Tran et al.,2020;Jing

et al.,2020;Chen and Zhuge,2020;Biten et al.,

2019). Naturally, the article texts themselves are

the main source of context for captioning, supply-

ing information about important events and entities.

In a more general case images are not paired with

the relevant context directly. A common way to

connect images to an external source of knowledge

is to use an object detection mechanism to iden-

tify objects in the image and then use their labels to

query a database. In Huang et al. (2020); Zhou et al.

(2019); Wu et al. (2017) detected labels are used to

extract useful information about common objects

featured in the images (such as “dog”, “pot”, “surf-

board”) from ConceptNet (Speer et al.,2017) and

DBpedia (Auer et al.,2007). Zhao et al. (2019) use

Google Cloud Vision APIs to identify not only com-

mon objects but also entities (people, car brands,

etc). In Bai et al. (2021) custom classiﬁers are

trained to detect speciﬁc image attributes (e.g. the

author, the artistic style), which are then used to re-

trieve relevant information from Wikipedia. These

approaches exclusively use the visual content of

the images to contextualize the captioning process.

Thus, the extent of contextualization is limited by

the quality of the object detection algorithms, and

the potential beneﬁt of utilizing additional data (e.g.

image metadata) is left unexplored.

Certain types of image metadata can be used to

build upon general object detection and identify

speciﬁc entities and events in the image. Lu et al.

(2018) use the time metadata of the image (the date

when a given photo was taken) and its associated

tags to collect similar photographs and to retrieve

the names of relevant entities (e.g. people depicted

in the image) from their captions. In Nikiforova

et al. (2020), geographic metadata (latitude and lon-

gitude coordinates of the image location) is used to

extract information about the surrounding objects

from a geographic database, which allows their

system to refer to concrete locations relevant to

the image in the generated captions. These papers

demonstrate the effectiveness of using image meta-

data for contextualized captioning but are limited

to establishing the names of relevant entities, and

do not utilize them to get further data that could be

useful for generating even more informative cap-

tions.

In contrast to the works described above, we

use image metadata as a grounding “anchor”, with

which we can not only identify entities relevant to

the image, but also retrieve a wide range of related

encyclopedic knowledge from an external database.

Speciﬁcally, we use geographic metadata, which

has the beneﬁt of being easily available for many

real-life photographs due to the built-in GPS in

modern cameras and phones, making it easier to

collect the data for training and testing the system.

Incorporating external knowledge into caption

generation

There are two dominant methods of

incorporating external knowledge into the caption

generation process: template-based and context-

based. In template-based approaches, a caption

is generated with placeholder token slots that are

later ﬁlled with the most ﬁtting named entities ex-

tracted from an external knowledge source (Bai

et al.,2021;Jing et al.,2020;Hu et al.,2020;Biten

et al.,2019). This is an especially common tech-

nique in news image captioning, where named enti-

ties are taken from the news article associated with

the image. Still, the straightforward ﬁll-in-the-slot

method can be problematic if none of the available

entities ﬁt the already generated placeholder slot.

In context-based approaches, external knowl-

edge informs the caption generation process along

with the image features, inﬂuencing the choice of

produced tokens. For example, Zhou et al. (2019)

extract ConceptNet terms related to the image and

use their embeddings to initialize the caption gen-

eration module. Huang et al. (2020) also use Con-

ceptNet to identify relevant external knowledge and

increase the output probabilities of the vocabulary

tokens if they match the extracted entities. The

downside of context-based models is their inability

to generate tokens that are present in the external

knowledge but happen to be out-of-vocabulary for

the generator’s language model, which is common

for named entities.

Our model, like several other approaches (Chen

and Zhuge,2020;Tran et al.,2020;Nikiforova

et al.,2020;Whitehead et al.,2018), combines

characteristics of both template-based and context-

based methods. Similarly to template-based archi-

tectures, some of the tokens produced by our cap-

tion generation model are taken directly from exter-

nal knowledge sources, and as in the context-based

methods, external knowledge in our model inﬂu-

ences the generation of regular vocabulary words.

3 Image caption dataset

For the purposes of training a knowledge-aware

image captioning system, we require a dataset of

images with knowledge-rich captions. Standard

image captioning datasets, such as MSCOCO (Lin

et al.,2014) or Flickr30k (Young et al.,2014), con-

tain heavily curated captions, purposefully stripped

of any contextualization and references to ency-

clopedic knowledge. Moreover, they lack image

metadata, which is critical for our approach. There-

fore, we compile our own dataset of 7128 images

with the associated naturally created captions and

image location metadata. The source of our data is

the website of the Geograph project

, which con-

tains a large number of photographs with English-

language captions and rich metadata, including the

geographic coordinates of the photograph locations.

The images we selected for our dataset depict geo-

graphic entities, mostly historical man-made struc-

tures and buildings, to ensure that each caption

contains at least one encyclopedic fact, for exam-

ple, “Theatre Royal Haymarket. Dating back to

1720”. More details of the dataset, including its

split into train, validation and test sets, are given in

Appendix A.

4 Approach

Our approach begins with identifying real world

entities that are related to the image. Speciﬁcally,

we use the coordinates of where the photograph

was taken to retrieve a set of geographic entities

around that location (Section 4.1). Then, we collect

a wide range of relevant encyclopedic facts. We

extract them from an open-domain knowledge base

in the form of

subject, predicate, object

triples,

where subjects are the previously identiﬁed geo-

graphic entities (Section 4.2). Finally, we integrate

the collected knowledge into an otherwise standard

encoder-decoder image captioning pipeline to pro-

duce knowledge-aware captions (Section 4.3).

1http://www.geograph.org.uk/

4.1 Geographic Context

We adapt an approach from Nikiforova et al. (2020)

to retrieve geographic entities relevant to the image

from the OpenStreetMap

database and thus to con-

struct an image-speciﬁc geographic context. For-

mally, the geographic context

of a given image

is a set of

geographic entities (

e1. . . en

) located

within a radius

from the image location. Based on

preliminary experiments, we set

at 300 and

1 kilometer as the hyperparameters of our system.

Each geographic entity

is associated with its

name and a set of geographic features: distance

and azimuth

between the entity and the image

location, the entity’s size

and type

. In addition

to that, we introduce two new features, intended to

reﬂect the salience of the entity through the amount

of information available about it in a knowledge

base: a binary indicator

∃fi

that shows whether or

not the entity corresponds to any facts in the knowl-

edge context, and the number of facts

#fi

that

correspond to the entity in the knowledge context.

A sample fragment of a geographic context, with

the entities mapped to their names and features, is

shown in Figure 2.

the: E(the)

view: E(view)

house: E(house)

…

theatre: E(theatre)

Decoder

haymarket

theatre

the

view

Vocabulary

●E(word): co-occurrence based

●E’(geo): (dist, az, size, type, num_facts))

●E’’(fact): E’(geo) ⨁ E’’’(predicate)

●E’’’(predicate): randomly initialized

embedding, fixed inventory of predicates

The view of Haymarket Theatre in London, built in 1720.

…

The view of

Haymarket Theatre

in London, built in

1720.

o1: E’(haymarket_theatre)

o2: E’(london)

…

oN: E’(trafalgar_square)

Geocontext

view geo wordthe theatreo1 o2 oN

f1: E’’(<o1, built_in>)

f2: E’’(<o1, rebuilt>)

...

fM: E’’(<oN, closed_in>)

e1 – Theatre Royal – (d1=0.02km, a1= -132°, s1=0.001km2, t1=theatre, f1=1, #f1=2)

e2 – Haymarket – (d2=0.05km, a2= -30°, s2=0.0004km2, t2=primary, f2=0, #f2=0)

...

en – Charing Cross – (dn=0.37km, an= -81°, sn=0.0km2, tn=station, fn=1, #fn=1)

Facts

1720

factf1 f2 fM

…

Figure 2: A fragment of a geographic context.

The features are combined in vector represen-

tations for the entities, called “geographic embed-

dings”. For an entity

a geographic embedding is

computed as follows:

GEOEMB(ei) = Concat[di, norm(ai), si,

∃fi,#fi, Embt(ti)] (1)

where

norm

is an azimuth normalization func-

tion,

Embt

is an embedding function for the en-

tities’ types. For implementation details, see Ap-

pendix C. Our experiments show that our straight-

forward concatenation strategy achieves the same

performance level as the more complicated ap-

proach employed by Nikiforova et al. (2020).

4.2 Knowledge Context

For a given image with the geographic context

the knowledge context

is deﬁned as a set of

facts (f1. . . fm) about the entities in G.

2https://www.openstreetmap.org/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeneratingimagecaptionswithexternalencyclopedicknowledgeSoaNikiforova,TejaswiniDeoskar,DenisPaperno,YoadWinterUtrechtUniversity,Utrecht,theNetherlands{s.nikiforova,t.deoskar,d.paperno,y.winter}@uu.nlAbstractAccuratelyreportingwhatobjectsaredepictedinanimageislargelyasolvedprobleminautomaticcaptiong...

展开>> 收起<<

Generating image captions with external encyclopedic knowledge Soﬁa Nikiforova Tejaswini Deoskar Denis Paperno Yoad Winter Utrecht University Utrecht the Netherlands.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generating image captions with external encyclopedic knowledge Soﬁa Nikiforova Tejaswini Deoskar Denis Paperno Yoad Winter Utrecht University Utrecht the Netherlands

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: