Underspecification in Scene Description-to-Depiction Tasks Ben Hutchinson Google Research Australia

2025-04-26 0 0 8.1MB 13 页 10玖币
侵权投诉
Underspecification in Scene Description-to-Depiction Tasks
Ben Hutchinson
Google Research, Australia
benhutch@google.com
Jason Baldridge
Google Research, USA
jasonbaldridge@google.com
Vinodkumar Prabhakaran
Google Research, USA
vinodkpg@google.com
Abstract
Questions regarding implicitness, ambiguity
and underspecification are crucial for under-
standing the task validity and ethical concerns
of multimodal image+text systems, yet have
received little attention to date. This position
paper maps out a conceptual framework to ad-
dress this gap, focusing on systems which gen-
erate images depicting scenes from scene de-
scriptions. In doing so, we account for how
texts and images convey meaning differently.
We outline a set of core challenges concern-
ing textual and visual ambiguity, as well as
risks that may be amplified by ambiguous and
underspecified elements. We propose and dis-
cuss strategies for addressing these challenges,
including generating visually ambiguous im-
ages, and generating a set of diverse images.
1 Introduction
The classic Grounding Problem in AI asks how is
it that language can be interpreted as referring to
things in the world? It has been argued that demon-
strating natural language understanding requires
mapping text to something that is non-text and that
functions as a model of meaning (e.g., Bender and
Koller,2020). In this view, multimodal models that
relate images and language have an important role
in pursuing contextualized language understand-
ing. Indeed, joint modeling of linguistic and visual
signals has been argued to play a critical role in
progress towards this ultimate goal, as precursors
to modeling relationships between language and
the social and physical worlds (Bisk et al.,2020).
Recent text-to-image generation systems have
demonstrated impressive capabilities (Zhang et al.,
2021;Ramesh et al.,2021;Ding et al.,2021;Nichol
et al.,2021;Gafni et al.,2022;Ramesh et al.,2022;
Saharia et al.,2022;Ramesh et al.,2022;Yu et al.,
2022). These employ deep learning methods such
as generative adversarial networks (Goodfellow
et al.,2014), neural discrete representation learning
Figure 1: Generated depictions of the scene “A robot
and its pet in a tree. Many elements are underspecified
in the text, e.g., pet type, perspective, and visual style.
(van den Oord et al.,2017) combined with auto-
regressive models (Brown et al.,2020), and diffu-
sion models (Sohl-Dickstein et al.,2015), trained
on large datasets of images and aligned texts (Rad-
ford et al.,2021;Jia et al.,2021).
With such developments in multimodal model-
ing and further aspirations towards contextualized
language understanding, it is import to better un-
derstand both task validity and construct validity
in text-to-image systems (Raji et al.,2021). Ethi-
cal questions concerning bias, safety and misinfor-
mation are increasingly recognized (Saharia et al.,
2022;Cho et al.,2022); nevertheless, understand-
ing which system behaviors are desirable requires
a vocabulary and framework for understanding the
diverse and quickly expanding capabilities of these
systems. This position paper addresses these issues
by focusing on classic problems (in both linguistic
theory and NLP) of ambiguity and underspecifi-
cation (e.g., Poesio,1994;Copestake et al.,2005;
Frisson,2009). Little previous work has looked
into how underspecification impacts multimodal
systems, or what challenges and risks they pose.
This position paper presents a model of task for-
mulation in text-to-image tasks by considering the
relationships between images and texts. We use this
foundation to identify challenges and risks when
generating images of scenes from text descriptions,
and discuss possible mitigations and strategies for
addressing them.
arXiv:2210.05815v1 [cs.CV] 11 Oct 2022
2 Background
2.1 Image meanings
Like texts, images are used in communicative
contexts to convey concepts. Images often con-
vey meaning via resemblance, whereas the cor-
respondence between language and meaning is
largely conventional (“icons” vs “symbols” in the
vocabulary of semiotics (e.g. de Saussure,[1916]
1983;Hartshorne et al.,1958;Jappy,2013;Chan-
dler,2007)). For example, both the English word
“cat” or images of a cat—including photographs,
sketches, etc—can signify the concept of a cat. Fur-
thermore they each can be used in contexts to repre-
sent either the general concept of cats, or a specific
instance of a cat. That is, images can have both i)
concepts/senses, as well as ii) objects/referents in
the world. As such, both images and text can direct
the mind of the viewer/reader towards objects and
affairs in the world (also known as “intentionality”
in the philosophy of language (e.g., Searle,1995)),
albeit in different ways. Despite the adage that a
picture is worth a thousand words, even relatively
simple diagrams may not be reducible to textual
descriptions (Griesemer,1991).
Like texts, images can also indirectly convey
meaning about the agent who produced the image,
or about the technology used to create or transmit it
(cf. the model of communication of Jakobson and
Sebeok,1960). Also like language, the meanings
of images can be at least partly conventional and
cultural, e.g., logos, iconography, tattoos, crests,
hand gestures, etc. can each convey meaning de-
spite having no visual resemblance to the concept
or thing being denoted. Shatford (1986) describes
this in terms of images being Of one thing yet po-
tentially About another thing. Such “aboutness” is
not limited to iconography, for photographic im-
agery can convey cultural meanings too—Barthes
(1977) uses the example of a photograph of a red
chequered tablecloth and fresh produce conveying
the idea of Italianicity.
2.2 Text-image relationships
A variety of relationships between text and image
are possible, and have been widely discussed in cre-
ative and cultural fields (e.g., Barthes,1977;Berger,
2008). The Cooper Hewitt Design Museum has, for
example, published extensive guidelines on acces-
sible image descriptions.
1
These make a fundamen-
1https://www.cooperhewitt.org/
cooper-hewitt-guidelines-for-image-description/
tal distinction between image descriptions, which
provide visual information about what is depicted
in the image, and captions, which explain the im-
age or provide additional information. For example,
the following texts could apply to the same image,
while serving these different purposes:
description
: “Portrait of former First Lady
Michelle Obama seated looking directly at us.
caption
: “Michelle LaVaughn Robinson Obama,
born 1964, Chicago, Illinois.
This distinction is closely related to that between
conceptual descriptions and non-visual descrip-
tions made by Hodosh et al. (2013), building on
prior work on image indexing (Jaimes and Chang,
2000). Hodosh et al. subdivide conceptual descrip-
tions into concrete or abstract according to whether
they describe the scene and its entities or the over-
all mood, and also further differentiate a category
of perceptual descriptions which concern the vi-
sual properties of the image itself such as color
and shape. van Miltenburg (2019, Chapter 2) has a
more detailed review of these distinctions.
As images have meanings (see §2.1), describ-
ing an image often involves a degree of interpreta-
tion (van Miltenburg,2020). Although often pre-
sented as neutral labels, captions on photographs
commonly tell us how visual elements “ought to
be read” (Hall,2019, p. 229). Literary theorist
Barthes distinguishes two relationships between
texts and images: anchorage and relay. With an-
chorage, the text guides the viewer towards certain
interpretations of the image, whereas for relay, the
text and image complement each other (Barthes,
1977, pp. 38–41). McCloud’s theory of comics
elaborates on this to distinguish four flavours of
word-image combinations (McCloud,1993): (1)
the image supplements the text, (2) the text sup-
plements the image, (3) the text and image con-
tribute the same information, (4) the text and image
operate in parallel without their meanings inter-
secting. Since language is interpreted contextually,
these image-accompanying texts might depend on
the multimodal discourse context, the writer, and
the intended audience. The strong dependence on
the writer, in particular, highlights the socially and
culturally subjective nature of image descriptions
(van Miltenburg et al.,2017;Bhargava and Forsyth,
2019). This subjectivity can result in speculation
(or abductive inference), for example when people
describing images fill in missing details (van Mil-
tenburg,2020), in human reporting biases regard-
Families of multimodal (text and image) tasks
Image-to-text tasks ( )
Generating descriptions of scenes
Optical character recognition
Search index term generation
· · ·
Text-to-image tasks ( )
Generating depictions of scenes
Story illustration
Art generation
· · ·
Image+text-to-text tasks ( +)
Visual question answering
· · ·
Image+text-to-image tasks ( +)
Image editing using verbal prompts
· · ·
Figure 2: Sketch of a taxonomy of text+image tasks. The taxonomy has gaps which suggest novel tasks, e.g.,
“optical character generation” (generating images of texts), or querying text collections using images.
ing what is considered noteworthy or unexpected
(Van Miltenburg et al.,2016;Misra et al.,2016),
in social and cultural stereotyping (van Miltenburg,
2016;Zhao et al.,2017;Otterbacher et al.,2019),
and in derogatory and offensive image associations
(Birhane et al.,2021;Crawford and Paglen,2019).
Despite the frequently stated motivation of
ML-based multimodal image+text technologies
as assisting the visually impaired, the distinction
between captions and descriptions—relevant to
accessibility—is mostly ignored in the text-to-
image literature (van Miltenburg,2019,2020). It is
common for systems that generate image descrip-
tions to be described as “image-captioning” (e.g.,
Nie et al.,2020;Agrawal et al.,2019;Srinivasan
et al.,2021;Lin et al.,2014;Sharma et al.,2018),
without making a distinction between captions and
descriptions. An exception is a recent paper explic-
itly aimed at addressing image accessibility (Kreiss
et al.,2021). Other NLP work uses “caption” to
denote characterizations of image content, using
“depiction” for more general relations between texts
and images (Alikhani and Stone,2019).
Within multimodal NLP, building on annotation
efforts, Alikhani et al. have distinguished five types
of coherence relationships in aligned images and
texts (of which multiple can hold concurrently)
(Alikhani et al.,2020,2019): (1) the text presents
what is depicted in the image, (2) the text describes
the speaker’s reaction to the image, (3) the text de-
scribes a bigger event of which the image captures
only a moment, (4) the text describes background
info or other circumstances relevant to the image,
and (5) the text concerns the production and pre-
sentation of the image itself.
Finally, we also note the case where the image
is of (or contains) text itself. Not only is this rel-
evant to OCR tasks, but also to visual analysis of
web pages (e.g., Mei et al.,2016), memes (e.g.,
Kiela et al.,2020), advertising imagery (e.g., Lim-
Fei et al.,2017), as well as a challenging aspect
of image generation when the image is desired to
have embedded text (for example on a book cover).
(Prior to movable type printing, the distinction be-
tween texts and images-of-texts was likely less cul-
turally important (Ong,2013;Sproat,2010).)
2.3 Text-to-image tasks
Figure 2situates the family of text-to-image tasks
within the greater family of multimodal (text and
image) tasks. One of the important factors distin-
guishing different flavors of text-to-image tasks is
the semantic and pragmatic relationship between
the input text and the output image. Although com-
monly used as if it describes a single task, we posit
that “text-to-image” describes a family of tasks,
since it only denotes a structural relationship: a
text goes in and an image comes out. Although
some relationship between input and output is per-
haps implied, it is just as implicit as if one were to
speak of a “text-to-text” task without mentioning
whether the task involves translation, paraphrase,
summarization, etc. It is important to emphasise
that tasks and models are typically not in a 1:1 re-
lationship: even without multi-head architectures,
a model may be used for many tasks (e.g., Raf-
fel et al.,2020;Chen et al.,2022), while many
(single-task) NLP architectures employ multiple
models in sequence. As van Miltenburg (2020)
argues, the dataset annotations which often act
as extensional definitions of the task of interest
(Schlangen,2021) are often produced via under-
specified crowdsourcing tasks that do not pay full
attention to the rich space of possible text-image re-
lationships described above. Similarly, text-image
pairs repurposed from the web often have poorly
specified relationships: although the Web Content
Accessibility Guidelines recommend that “alt” tags
“convey the same function or purpose as the image”
(Chisholm et al.,2001) (for a survey of guidelines,
摘要:

UnderspecicationinSceneDescription-to-DepictionTasksBenHutchinsonGoogleResearch,Australiabenhutch@google.comJasonBaldridgeGoogleResearch,USAjasonbaldridge@google.comVinodkumarPrabhakaranGoogleResearch,USAvinodkpg@google.comAbstractQuestionsregardingimplicitness,ambiguityandunderspecicationarecruci...

展开>> 收起<<
Underspecification in Scene Description-to-Depiction Tasks Ben Hutchinson Google Research Australia.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:8.1MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注