Communication breakdown On the low mutual intelligibility between human and neural captioning Roberto Dessì

2025-05-06 0 0 8.36MB 10 页 10玖币

侵权投诉

Communication breakdown: On the low mutual intelligibility between

human and neural captioning

Roberto Dessì

Meta AI

Universitat Pompeu Fabra

rdessi@meta.com

Eleonora Gualdoni

and Francesca Franzon

Universitat Pompeu Fabra

{name.lastname}@upf.edu

Gemma Boleda

and Marco Baroni

ICREA

Universitat Pompeu Fabra

{name.lastname}@upf.edu

Abstract

We compare the 0-shot performance of a neu-

ral caption-based image retriever when given

as input either human-produced captions or

captions generated by a neural captioner. We

conduct this comparison on the recently in-

troduced IMAGECODEdata-set (Krojer et al.,

2022), which contains hard distractors nearly

identical to the images to be retrieved. We ﬁnd

that the neural retriever has much higher per-

formance when fed neural rather than human

captions, despite the fact that the former, un-

like the latter, were generated without aware-

ness of the distractors that make the task hard.

Even more remarkably, when the same neural

captions are given to human subjects, their re-

trieval performance is almost at chance level.

Our results thus add to the growing body of ev-

idence that, even when the “language” of neu-

ral models resembles English, this superﬁcial

resemblance might be deeply misleading.

1 Introduction

Neural vision-and-language models have achieved

impressive results in tasks such as visual common-

sense reasoning and question answering (e.g., Chen

et al.,2019;Lu et al.,2019). However, Krojer et al.

(2022) recently showed, in the context of caption-

based image retrieval, that state-of-the-art multi-

modal models still perform poorly when the candi-

date pool contains very similar distractor images

(such as close frames from the same video).

Here, we show that, when the best pre-trained

image retrieval system of Krojer et al. (2022) is fed

captions produced by an out-of-the box neural cap-

tion generator, its performance makes a big jump

forward. 0-shot image retrieval accuracy improves

by almost 6% compared to the highest previously

reported human-caption-based performance by the

same model, with ﬁne-tuning and various ad-hoc

architectural adaptations. This is remarkable, be-

cause the off-the-shelf caption generator we use

(unlike the humans who wrote the original captions

in the data-set) is not taking the set of distractor im-

ages into account. Even more remarkably, we show

that, when human subjects are tasked with retriev-

ing the right image using the same neural captions

that help the model so much, their performance is

only marginally above chance level.

2 Setup

Data

We use the more challenging video sec-

tion of the IMAGECODEdata-set (Krojer et al.,

2022). Since we do not ﬁne-tune our model, we

only use the validation set, including 1,872 data

points.

Henceforth, when when we employ the

term IMAGECODE, we are referring to this subset.

Each data-point consists of a target image and 9

distractors, where the target and the distractors are

frames from the same (automatically segmented)

scene in a video. We also use the human captions

in the data-set, that were produced by subjects that

had access to the distractors while annotating each

target (they were instructed to take distractors into

account, without explicitly referring to them). Hav-

ing access to this “common ground” (Brennan and

Clark,1996), annotators produced highly context-

dependent descriptions (see example human cap-

tions in Fig. 1). The data-set contains one single

caption per image.

Neural caption generation

We use the ClipCap

caption generation system (Mokady et al.,2021)

without ﬁne-tuning. For details and hyperparame-

ters of the generation process see Appendix A. In

short, ClipCap processes an image with a CLIP

visual encoder (Radford et al.,2021) and learns a

mapping from the resulting visual embedding to a

sequence of embeddings in GPT-2 space (Radford

et al.,2019), that are used to kickstart the genera-

tion of a sequence of tokens. We report experiments

with the ClipCap variant ﬁne-tuned on the COCO

We use the validation set because IMAGECODEtest set

annotations are not publicly available.

arXiv:2210.11512v2 [cs.CL] 27 Apr 2023

setup acc

neural captions, 0-shot 27.9

human captions, 0-shot 17.4

human captions, Krojer et al’s best 22.3

Table 1: Percentage IMAGECODEaccuracy of 0-

shot image retriever when given neural vs. human cap-

tions as input. Last row reports accuracy of best ﬁne-

tuned, architecturally-adjusted model from Krojer et al.

(2022) (featuring a context module, temporal embed-

dings and a ViT-B/16 backbone).

data-set (Lin et al.,2014), where the weights of

the multimodal mapper were updated and those of

the language model (GPT-2) were kept frozen. We

obtained very similar results with the other publicly

available ClipCap variants. We generate a single

neural caption for each IMAGECODEtarget image

by passing it through ClipCap. Note that, as there

is no way to make this out-of-the-box architecture

distractor-aware, the neural captions do not take

distractors into account.

Image retrieval

We use the simplest architecture

proposed by Krojer et al. (2022) (the one without

context module and temporal embeddings), which

amounts to a standard CLIP retriever from Rad-

ford et al. (2021). The caption and each image in

the set are passed through textual and visual en-

coders, respectively. Retrieval is successful if the

dot product between the resulting caption and tar-

get image representations is larger than that of the

embedded caption with any distractor representa-

tion. We use the ResNet-based CLIP visual encoder

(He et al.,2015), whereas Krojer et al. (2022) used

the ViT-B/16 architecture. We found the former

to have a slightly higher 0-shot retrieval accuracy

compared to the one they used (17.4% in Table 1

here vs. 14.9% in their paper).

3 Results and analysis

Neural vs. human caption performance

shown in Table 1, the out-of-the-box neural image

retrieval model has a clear preference for neural

captions. It reaches 27.9% IMAGECODEaccuracy

when taking neural captions as input, vs. 17.4%

with human captions (chance level is at 10%).

For comparison, the best ﬁne-tuned, architecture-

adjusted model of Krojer et al. (2022) reached

22.3% performance with human captions.

A concrete sense of the differences between the

two types of captions is given by the examples in

Fig. 1. The examples in this ﬁgure are picked ran-

domly. Based on manual inspection of a larger

set, we are conﬁdent they are representative of

the full data. Clearly, neural captions are shorter

(avg. length at 11.4 tokens vs. 23.2 for human

captions) and more plainly descriptive (although

the description is mostly only vaguely related to

what’s actually depicted). Since there is no way to

make the out-of-the-box ClipCap system distractor-

aware, the neural captions are not highlighting dis-

criminative aspects of a target image compared to

the distractors. Human captions, on the other hand,

use very articulated language to highlight what is

unique about the target compared to the closest dis-

tractors (often focusing on rather marginal aspects

of the image, because of their discriminativeness,

e.g., for the ﬁrst example in the ﬁgure, the fact that

the blue backpack is hardly visible). It is not sur-

prising that a generic image retriever, that was not

trained to handle this highly context-based linguis-

tic style, would not get much useful information out

of the human captions. It is interesting, however,

that this generic system performs relatively well

with the neural captions, given how off-the-mark

and non-discriminative the latter typically are.

As more quantitative cues of the differences be-

tween caption types, we observe that human cap-

tions are making more use of both rare lemmas

and function words (see frequency plots in Ap-

pendix B).

Extracting the lemmas that are statisti-

cally most strongly associated to the human caption

set (see Appendix Cfor method and full top list),

we observe “meta-visual” words such as visible

and see, pronouns and determiners cuing anaphoric

structure (the,her,his), and function words sig-

naling a more complex sentence structure, such

as auxiliaries, negation and connectives. Among

the most typical neural lemmas, we ﬁnd instead

general terms for concrete entities such as people,

woman,table and food.

Are neural captions really discriminative?

looking at Figure 1, we see that neural captions

might be (very noisily) descriptive of the target, but

they seem hardly discriminative with respect to the

nearest distractors. Recall that each IMAGECODE

set contains a sequence of 10 frames from the same

scene. In general, the frames that are farther away

in time might be easier to discriminate than the

Code to reproduce our analysis with human and model-

generated captions is available at

https://github.com/

franfranz/emecomm_context

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Communicationbreakdown:OnthelowmutualintelligibilitybetweenhumanandneuralcaptioningRobertoDessìMetaAIUniversitatPompeuFabrardessi@meta.comEleonoraGualdoniandFrancescaFranzonUniversitatPompeuFabra{name.lastname}@upf.eduGemmaBoledaandMarcoBaroniICREAUniversitatPompeuFabra{name.lastname}@upf.eduAbstrac...

展开>> 收起<<

Communication breakdown On the low mutual intelligibility between human and neural captioning Roberto Dessì.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Communication breakdown On the low mutual intelligibility between human and neural captioning Roberto Dessì

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: