Underspeciﬁcation in Scene Description-to-Depiction Tasks Ben Hutchinson Google Research Australia

2025-04-26 1 0 8.1MB 13 页 10玖币

侵权投诉

Underspeciﬁcation in Scene Description-to-Depiction Tasks

Ben Hutchinson

Google Research, Australia

benhutch@google.com

Jason Baldridge

Google Research, USA

jasonbaldridge@google.com

Vinodkumar Prabhakaran

Google Research, USA

vinodkpg@google.com

Abstract

Questions regarding implicitness, ambiguity

and underspeciﬁcation are crucial for under-

standing the task validity and ethical concerns

of multimodal image+text systems, yet have

received little attention to date. This position

paper maps out a conceptual framework to ad-

dress this gap, focusing on systems which gen-

erate images depicting scenes from scene de-

scriptions. In doing so, we account for how

texts and images convey meaning differently.

We outline a set of core challenges concern-

ing textual and visual ambiguity, as well as

risks that may be ampliﬁed by ambiguous and

underspeciﬁed elements. We propose and dis-

cuss strategies for addressing these challenges,

including generating visually ambiguous im-

ages, and generating a set of diverse images.

1 Introduction

The classic Grounding Problem in AI asks how is

it that language can be interpreted as referring to

things in the world? It has been argued that demon-

strating natural language understanding requires

mapping text to something that is non-text and that

functions as a model of meaning (e.g., Bender and

Koller,2020). In this view, multimodal models that

relate images and language have an important role

in pursuing contextualized language understand-

ing. Indeed, joint modeling of linguistic and visual

signals has been argued to play a critical role in

progress towards this ultimate goal, as precursors

to modeling relationships between language and

the social and physical worlds (Bisk et al.,2020).

Recent text-to-image generation systems have

demonstrated impressive capabilities (Zhang et al.,

2021;Ramesh et al.,2021;Ding et al.,2021;Nichol

et al.,2021;Gafni et al.,2022;Ramesh et al.,2022;

Saharia et al.,2022;Ramesh et al.,2022;Yu et al.,

2022). These employ deep learning methods such

as generative adversarial networks (Goodfellow

et al.,2014), neural discrete representation learning

Figure 1: Generated depictions of the scene “A robot

and its pet in a tree.” Many elements are underspeciﬁed

in the text, e.g., pet type, perspective, and visual style.

(van den Oord et al.,2017) combined with auto-

regressive models (Brown et al.,2020), and diffu-

sion models (Sohl-Dickstein et al.,2015), trained

on large datasets of images and aligned texts (Rad-

ford et al.,2021;Jia et al.,2021).

With such developments in multimodal model-

ing and further aspirations towards contextualized

language understanding, it is import to better un-

derstand both task validity and construct validity

in text-to-image systems (Raji et al.,2021). Ethi-

cal questions concerning bias, safety and misinfor-

mation are increasingly recognized (Saharia et al.,

2022;Cho et al.,2022); nevertheless, understand-

ing which system behaviors are desirable requires

a vocabulary and framework for understanding the

diverse and quickly expanding capabilities of these

systems. This position paper addresses these issues

by focusing on classic problems (in both linguistic

theory and NLP) of ambiguity and underspeciﬁ-

cation (e.g., Poesio,1994;Copestake et al.,2005;

Frisson,2009). Little previous work has looked

into how underspeciﬁcation impacts multimodal

systems, or what challenges and risks they pose.

This position paper presents a model of task for-

mulation in text-to-image tasks by considering the

relationships between images and texts. We use this

foundation to identify challenges and risks when

generating images of scenes from text descriptions,

and discuss possible mitigations and strategies for

addressing them.

arXiv:2210.05815v1 [cs.CV] 11 Oct 2022

2 Background

2.1 Image meanings

Like texts, images are used in communicative

contexts to convey concepts. Images often con-

vey meaning via resemblance, whereas the cor-

respondence between language and meaning is

largely conventional (“icons” vs “symbols” in the

vocabulary of semiotics (e.g. de Saussure,[1916]

1983;Hartshorne et al.,1958;Jappy,2013;Chan-

dler,2007)). For example, both the English word

“cat” or images of a cat—including photographs,

sketches, etc—can signify the concept of a cat. Fur-

thermore they each can be used in contexts to repre-

sent either the general concept of cats, or a speciﬁc

instance of a cat. That is, images can have both i)

concepts/senses, as well as ii) objects/referents in

the world. As such, both images and text can direct

the mind of the viewer/reader towards objects and

affairs in the world (also known as “intentionality”

in the philosophy of language (e.g., Searle,1995)),

albeit in different ways. Despite the adage that a

picture is worth a thousand words, even relatively

simple diagrams may not be reducible to textual

descriptions (Griesemer,1991).

Like texts, images can also indirectly convey

meaning about the agent who produced the image,

or about the technology used to create or transmit it

(cf. the model of communication of Jakobson and

Sebeok,1960). Also like language, the meanings

of images can be at least partly conventional and

cultural, e.g., logos, iconography, tattoos, crests,

hand gestures, etc. can each convey meaning de-

spite having no visual resemblance to the concept

or thing being denoted. Shatford (1986) describes

this in terms of images being Of one thing yet po-

tentially About another thing. Such “aboutness” is

not limited to iconography, for photographic im-

agery can convey cultural meanings too—Barthes

(1977) uses the example of a photograph of a red

chequered tablecloth and fresh produce conveying

the idea of Italianicity.

2.2 Text-image relationships

A variety of relationships between text and image

are possible, and have been widely discussed in cre-

ative and cultural ﬁelds (e.g., Barthes,1977;Berger,

2008). The Cooper Hewitt Design Museum has, for

example, published extensive guidelines on acces-

sible image descriptions.

These make a fundamen-

1https://www.cooperhewitt.org/

cooper-hewitt-guidelines-for-image-description/

tal distinction between image descriptions, which

provide visual information about what is depicted

in the image, and captions, which explain the im-

age or provide additional information. For example,

the following texts could apply to the same image,

while serving these different purposes:

•description

: “Portrait of former First Lady

Michelle Obama seated looking directly at us.”

•caption

: “Michelle LaVaughn Robinson Obama,

born 1964, Chicago, Illinois.”

This distinction is closely related to that between

conceptual descriptions and non-visual descrip-

tions made by Hodosh et al. (2013), building on

prior work on image indexing (Jaimes and Chang,

2000). Hodosh et al. subdivide conceptual descrip-

tions into concrete or abstract according to whether

they describe the scene and its entities or the over-

all mood, and also further differentiate a category

of perceptual descriptions which concern the vi-

sual properties of the image itself such as color

and shape. van Miltenburg (2019, Chapter 2) has a

more detailed review of these distinctions.

As images have meanings (see §2.1), describ-

ing an image often involves a degree of interpreta-

tion (van Miltenburg,2020). Although often pre-

sented as neutral labels, captions on photographs

commonly tell us how visual elements “ought to

be read” (Hall,2019, p. 229). Literary theorist

Barthes distinguishes two relationships between

texts and images: anchorage and relay. With an-

chorage, the text guides the viewer towards certain

interpretations of the image, whereas for relay, the

text and image complement each other (Barthes,

1977, pp. 38–41). McCloud’s theory of comics

elaborates on this to distinguish four ﬂavours of

word-image combinations (McCloud,1993): (1)

the image supplements the text, (2) the text sup-

plements the image, (3) the text and image con-

tribute the same information, (4) the text and image

operate in parallel without their meanings inter-

secting. Since language is interpreted contextually,

these image-accompanying texts might depend on

the multimodal discourse context, the writer, and

the intended audience. The strong dependence on

the writer, in particular, highlights the socially and

culturally subjective nature of image descriptions

(van Miltenburg et al.,2017;Bhargava and Forsyth,

2019). This subjectivity can result in speculation

(or abductive inference), for example when people

describing images ﬁll in missing details (van Mil-

tenburg,2020), in human reporting biases regard-

Families of multimodal (text and image) tasks

Image-to-text tasks ( →)

Generating descriptions of scenes

Optical character recognition

Search index term generation

· · ·

Text-to-image tasks ( →)

Generating depictions of scenes

Story illustration

Art generation

· · ·

Image+text-to-text tasks ( +→)

Visual question answering

· · ·

Image+text-to-image tasks ( +→)

Image editing using verbal prompts

· · ·

Figure 2: Sketch of a taxonomy of text+image tasks. The taxonomy has gaps which suggest novel tasks, e.g.,

“optical character generation” (generating images of texts), or querying text collections using images.

ing what is considered noteworthy or unexpected

(Van Miltenburg et al.,2016;Misra et al.,2016),

in social and cultural stereotyping (van Miltenburg,

2016;Zhao et al.,2017;Otterbacher et al.,2019),

and in derogatory and offensive image associations

(Birhane et al.,2021;Crawford and Paglen,2019).

Despite the frequently stated motivation of

ML-based multimodal image+text technologies

as assisting the visually impaired, the distinction

between captions and descriptions—relevant to

accessibility—is mostly ignored in the text-to-

image literature (van Miltenburg,2019,2020). It is

common for systems that generate image descrip-

tions to be described as “image-captioning” (e.g.,

Nie et al.,2020;Agrawal et al.,2019;Srinivasan

et al.,2021;Lin et al.,2014;Sharma et al.,2018),

without making a distinction between captions and

descriptions. An exception is a recent paper explic-

itly aimed at addressing image accessibility (Kreiss

et al.,2021). Other NLP work uses “caption” to

denote characterizations of image content, using

“depiction” for more general relations between texts

and images (Alikhani and Stone,2019).

Within multimodal NLP, building on annotation

efforts, Alikhani et al. have distinguished ﬁve types

of coherence relationships in aligned images and

texts (of which multiple can hold concurrently)

(Alikhani et al.,2020,2019): (1) the text presents

what is depicted in the image, (2) the text describes

the speaker’s reaction to the image, (3) the text de-

scribes a bigger event of which the image captures

only a moment, (4) the text describes background

info or other circumstances relevant to the image,

and (5) the text concerns the production and pre-

sentation of the image itself.

Finally, we also note the case where the image

is of (or contains) text itself. Not only is this rel-

evant to OCR tasks, but also to visual analysis of

web pages (e.g., Mei et al.,2016), memes (e.g.,

Kiela et al.,2020), advertising imagery (e.g., Lim-

Fei et al.,2017), as well as a challenging aspect

of image generation when the image is desired to

have embedded text (for example on a book cover).

(Prior to movable type printing, the distinction be-

tween texts and images-of-texts was likely less cul-

turally important (Ong,2013;Sproat,2010).)

2.3 Text-to-image tasks

Figure 2situates the family of text-to-image tasks

within the greater family of multimodal (text and

image) tasks. One of the important factors distin-

guishing different ﬂavors of text-to-image tasks is

the semantic and pragmatic relationship between

the input text and the output image. Although com-

monly used as if it describes a single task, we posit

that “text-to-image” describes a family of tasks,

since it only denotes a structural relationship: a

text goes in and an image comes out. Although

some relationship between input and output is per-

haps implied, it is just as implicit as if one were to

speak of a “text-to-text” task without mentioning

whether the task involves translation, paraphrase,

summarization, etc. It is important to emphasise

that tasks and models are typically not in a 1:1 re-

lationship: even without multi-head architectures,

a model may be used for many tasks (e.g., Raf-

fel et al.,2020;Chen et al.,2022), while many

(single-task) NLP architectures employ multiple

models in sequence. As van Miltenburg (2020)

argues, the dataset annotations which often act

as extensional deﬁnitions of the task of interest

(Schlangen,2021) are often produced via under-

speciﬁed crowdsourcing tasks that do not pay full

attention to the rich space of possible text-image re-

lationships described above. Similarly, text-image

pairs repurposed from the web often have poorly

speciﬁed relationships: although the Web Content

Accessibility Guidelines recommend that “alt” tags

“convey the same function or purpose as the image”

(Chisholm et al.,2001) (for a survey of guidelines,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnderspecicationinSceneDescription-to-DepictionTasksBenHutchinsonGoogleResearch,Australiabenhutch@google.comJasonBaldridgeGoogleResearch,USAjasonbaldridge@google.comVinodkumarPrabhakaranGoogleResearch,USAvinodkpg@google.comAbstractQuestionsregardingimplicitness,ambiguityandunderspecicationarecruci...

展开>> 收起<<

Underspeciﬁcation in Scene Description-to-Depiction Tasks Ben Hutchinson Google Research Australia.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Underspeciﬁcation in Scene Description-to-Depiction Tasks Ben Hutchinson Google Research Australia

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: