
2 Background
2.1 Image meanings
Like texts, images are used in communicative
contexts to convey concepts. Images often con-
vey meaning via resemblance, whereas the cor-
respondence between language and meaning is
largely conventional (“icons” vs “symbols” in the
vocabulary of semiotics (e.g. de Saussure,[1916]
1983;Hartshorne et al.,1958;Jappy,2013;Chan-
dler,2007)). For example, both the English word
“cat” or images of a cat—including photographs,
sketches, etc—can signify the concept of a cat. Fur-
thermore they each can be used in contexts to repre-
sent either the general concept of cats, or a specific
instance of a cat. That is, images can have both i)
concepts/senses, as well as ii) objects/referents in
the world. As such, both images and text can direct
the mind of the viewer/reader towards objects and
affairs in the world (also known as “intentionality”
in the philosophy of language (e.g., Searle,1995)),
albeit in different ways. Despite the adage that a
picture is worth a thousand words, even relatively
simple diagrams may not be reducible to textual
descriptions (Griesemer,1991).
Like texts, images can also indirectly convey
meaning about the agent who produced the image,
or about the technology used to create or transmit it
(cf. the model of communication of Jakobson and
Sebeok,1960). Also like language, the meanings
of images can be at least partly conventional and
cultural, e.g., logos, iconography, tattoos, crests,
hand gestures, etc. can each convey meaning de-
spite having no visual resemblance to the concept
or thing being denoted. Shatford (1986) describes
this in terms of images being Of one thing yet po-
tentially About another thing. Such “aboutness” is
not limited to iconography, for photographic im-
agery can convey cultural meanings too—Barthes
(1977) uses the example of a photograph of a red
chequered tablecloth and fresh produce conveying
the idea of Italianicity.
2.2 Text-image relationships
A variety of relationships between text and image
are possible, and have been widely discussed in cre-
ative and cultural fields (e.g., Barthes,1977;Berger,
2008). The Cooper Hewitt Design Museum has, for
example, published extensive guidelines on acces-
sible image descriptions.
1
These make a fundamen-
1https://www.cooperhewitt.org/
cooper-hewitt-guidelines-for-image-description/
tal distinction between image descriptions, which
provide visual information about what is depicted
in the image, and captions, which explain the im-
age or provide additional information. For example,
the following texts could apply to the same image,
while serving these different purposes:
•description
: “Portrait of former First Lady
Michelle Obama seated looking directly at us.”
•caption
: “Michelle LaVaughn Robinson Obama,
born 1964, Chicago, Illinois.”
This distinction is closely related to that between
conceptual descriptions and non-visual descrip-
tions made by Hodosh et al. (2013), building on
prior work on image indexing (Jaimes and Chang,
2000). Hodosh et al. subdivide conceptual descrip-
tions into concrete or abstract according to whether
they describe the scene and its entities or the over-
all mood, and also further differentiate a category
of perceptual descriptions which concern the vi-
sual properties of the image itself such as color
and shape. van Miltenburg (2019, Chapter 2) has a
more detailed review of these distinctions.
As images have meanings (see §2.1), describ-
ing an image often involves a degree of interpreta-
tion (van Miltenburg,2020). Although often pre-
sented as neutral labels, captions on photographs
commonly tell us how visual elements “ought to
be read” (Hall,2019, p. 229). Literary theorist
Barthes distinguishes two relationships between
texts and images: anchorage and relay. With an-
chorage, the text guides the viewer towards certain
interpretations of the image, whereas for relay, the
text and image complement each other (Barthes,
1977, pp. 38–41). McCloud’s theory of comics
elaborates on this to distinguish four flavours of
word-image combinations (McCloud,1993): (1)
the image supplements the text, (2) the text sup-
plements the image, (3) the text and image con-
tribute the same information, (4) the text and image
operate in parallel without their meanings inter-
secting. Since language is interpreted contextually,
these image-accompanying texts might depend on
the multimodal discourse context, the writer, and
the intended audience. The strong dependence on
the writer, in particular, highlights the socially and
culturally subjective nature of image descriptions
(van Miltenburg et al.,2017;Bhargava and Forsyth,
2019). This subjectivity can result in speculation
(or abductive inference), for example when people
describing images fill in missing details (van Mil-
tenburg,2020), in human reporting biases regard-