correctness of the generated facts.
2 Related Work
Two main aspects that define each approach to con-
textualized image captioning are (i) the source of
external knowledge and the way to identify the data
relevant to a particular image, and (ii) the method
of incorporating external knowledge into caption
generation.
External knowledge source
In a popular sub-
task of news image captioning, captions are gener-
ated for images that accompany news articles (Zhao
et al.,2021;Hu et al.,2020;Tran et al.,2020;Jing
et al.,2020;Chen and Zhuge,2020;Biten et al.,
2019). Naturally, the article texts themselves are
the main source of context for captioning, supply-
ing information about important events and entities.
In a more general case images are not paired with
the relevant context directly. A common way to
connect images to an external source of knowledge
is to use an object detection mechanism to iden-
tify objects in the image and then use their labels to
query a database. In Huang et al. (2020); Zhou et al.
(2019); Wu et al. (2017) detected labels are used to
extract useful information about common objects
featured in the images (such as “dog”, “pot”, “surf-
board”) from ConceptNet (Speer et al.,2017) and
DBpedia (Auer et al.,2007). Zhao et al. (2019) use
Google Cloud Vision APIs to identify not only com-
mon objects but also entities (people, car brands,
etc). In Bai et al. (2021) custom classifiers are
trained to detect specific image attributes (e.g. the
author, the artistic style), which are then used to re-
trieve relevant information from Wikipedia. These
approaches exclusively use the visual content of
the images to contextualize the captioning process.
Thus, the extent of contextualization is limited by
the quality of the object detection algorithms, and
the potential benefit of utilizing additional data (e.g.
image metadata) is left unexplored.
Certain types of image metadata can be used to
build upon general object detection and identify
specific entities and events in the image. Lu et al.
(2018) use the time metadata of the image (the date
when a given photo was taken) and its associated
tags to collect similar photographs and to retrieve
the names of relevant entities (e.g. people depicted
in the image) from their captions. In Nikiforova
et al. (2020), geographic metadata (latitude and lon-
gitude coordinates of the image location) is used to
extract information about the surrounding objects
from a geographic database, which allows their
system to refer to concrete locations relevant to
the image in the generated captions. These papers
demonstrate the effectiveness of using image meta-
data for contextualized captioning but are limited
to establishing the names of relevant entities, and
do not utilize them to get further data that could be
useful for generating even more informative cap-
tions.
In contrast to the works described above, we
use image metadata as a grounding “anchor”, with
which we can not only identify entities relevant to
the image, but also retrieve a wide range of related
encyclopedic knowledge from an external database.
Specifically, we use geographic metadata, which
has the benefit of being easily available for many
real-life photographs due to the built-in GPS in
modern cameras and phones, making it easier to
collect the data for training and testing the system.
Incorporating external knowledge into caption
generation
There are two dominant methods of
incorporating external knowledge into the caption
generation process: template-based and context-
based. In template-based approaches, a caption
is generated with placeholder token slots that are
later filled with the most fitting named entities ex-
tracted from an external knowledge source (Bai
et al.,2021;Jing et al.,2020;Hu et al.,2020;Biten
et al.,2019). This is an especially common tech-
nique in news image captioning, where named enti-
ties are taken from the news article associated with
the image. Still, the straightforward fill-in-the-slot
method can be problematic if none of the available
entities fit the already generated placeholder slot.
In context-based approaches, external knowl-
edge informs the caption generation process along
with the image features, influencing the choice of
produced tokens. For example, Zhou et al. (2019)
extract ConceptNet terms related to the image and
use their embeddings to initialize the caption gen-
eration module. Huang et al. (2020) also use Con-
ceptNet to identify relevant external knowledge and
increase the output probabilities of the vocabulary
tokens if they match the extracted entities. The
downside of context-based models is their inability
to generate tokens that are present in the external
knowledge but happen to be out-of-vocabulary for
the generator’s language model, which is common
for named entities.
Our model, like several other approaches (Chen
and Zhuge,2020;Tran et al.,2020;Nikiforova