Communication breakdown On the low mutual intelligibility between human and neural captioning Roberto Dessì

2025-05-06
0
0
8.36MB
10 页
10玖币
侵权投诉
Communication breakdown: On the low mutual intelligibility between
human and neural captioning
Roberto Dessì
Meta AI
Universitat Pompeu Fabra
rdessi@meta.com
Eleonora Gualdoni
and Francesca Franzon
Universitat Pompeu Fabra
{name.lastname}@upf.edu
Gemma Boleda
and Marco Baroni
ICREA
Universitat Pompeu Fabra
{name.lastname}@upf.edu
Abstract
We compare the 0-shot performance of a neu-
ral caption-based image retriever when given
as input either human-produced captions or
captions generated by a neural captioner. We
conduct this comparison on the recently in-
troduced IMAGECODEdata-set (Krojer et al.,
2022), which contains hard distractors nearly
identical to the images to be retrieved. We find
that the neural retriever has much higher per-
formance when fed neural rather than human
captions, despite the fact that the former, un-
like the latter, were generated without aware-
ness of the distractors that make the task hard.
Even more remarkably, when the same neural
captions are given to human subjects, their re-
trieval performance is almost at chance level.
Our results thus add to the growing body of ev-
idence that, even when the “language” of neu-
ral models resembles English, this superficial
resemblance might be deeply misleading.
1 Introduction
Neural vision-and-language models have achieved
impressive results in tasks such as visual common-
sense reasoning and question answering (e.g., Chen
et al.,2019;Lu et al.,2019). However, Krojer et al.
(2022) recently showed, in the context of caption-
based image retrieval, that state-of-the-art multi-
modal models still perform poorly when the candi-
date pool contains very similar distractor images
(such as close frames from the same video).
Here, we show that, when the best pre-trained
image retrieval system of Krojer et al. (2022) is fed
captions produced by an out-of-the box neural cap-
tion generator, its performance makes a big jump
forward. 0-shot image retrieval accuracy improves
by almost 6% compared to the highest previously
reported human-caption-based performance by the
same model, with fine-tuning and various ad-hoc
architectural adaptations. This is remarkable, be-
cause the off-the-shelf caption generator we use
(unlike the humans who wrote the original captions
in the data-set) is not taking the set of distractor im-
ages into account. Even more remarkably, we show
that, when human subjects are tasked with retriev-
ing the right image using the same neural captions
that help the model so much, their performance is
only marginally above chance level.
2 Setup
Data
We use the more challenging video sec-
tion of the IMAGECODEdata-set (Krojer et al.,
2022). Since we do not fine-tune our model, we
only use the validation set, including 1,872 data
points.
1
Henceforth, when when we employ the
term IMAGECODE, we are referring to this subset.
Each data-point consists of a target image and 9
distractors, where the target and the distractors are
frames from the same (automatically segmented)
scene in a video. We also use the human captions
in the data-set, that were produced by subjects that
had access to the distractors while annotating each
target (they were instructed to take distractors into
account, without explicitly referring to them). Hav-
ing access to this “common ground” (Brennan and
Clark,1996), annotators produced highly context-
dependent descriptions (see example human cap-
tions in Fig. 1). The data-set contains one single
caption per image.
Neural caption generation
We use the ClipCap
caption generation system (Mokady et al.,2021)
without fine-tuning. For details and hyperparame-
ters of the generation process see Appendix A. In
short, ClipCap processes an image with a CLIP
visual encoder (Radford et al.,2021) and learns a
mapping from the resulting visual embedding to a
sequence of embeddings in GPT-2 space (Radford
et al.,2019), that are used to kickstart the genera-
tion of a sequence of tokens. We report experiments
with the ClipCap variant fine-tuned on the COCO
1
We use the validation set because IMAGECODEtest set
annotations are not publicly available.
arXiv:2210.11512v2 [cs.CL] 27 Apr 2023
setup acc
neural captions, 0-shot 27.9
human captions, 0-shot 17.4
human captions, Krojer et al’s best 22.3
Table 1: Percentage IMAGECODEaccuracy of 0-
shot image retriever when given neural vs. human cap-
tions as input. Last row reports accuracy of best fine-
tuned, architecturally-adjusted model from Krojer et al.
(2022) (featuring a context module, temporal embed-
dings and a ViT-B/16 backbone).
data-set (Lin et al.,2014), where the weights of
the multimodal mapper were updated and those of
the language model (GPT-2) were kept frozen. We
obtained very similar results with the other publicly
available ClipCap variants. We generate a single
neural caption for each IMAGECODEtarget image
by passing it through ClipCap. Note that, as there
is no way to make this out-of-the-box architecture
distractor-aware, the neural captions do not take
distractors into account.
Image retrieval
We use the simplest architecture
proposed by Krojer et al. (2022) (the one without
context module and temporal embeddings), which
amounts to a standard CLIP retriever from Rad-
ford et al. (2021). The caption and each image in
the set are passed through textual and visual en-
coders, respectively. Retrieval is successful if the
dot product between the resulting caption and tar-
get image representations is larger than that of the
embedded caption with any distractor representa-
tion. We use the ResNet-based CLIP visual encoder
(He et al.,2015), whereas Krojer et al. (2022) used
the ViT-B/16 architecture. We found the former
to have a slightly higher 0-shot retrieval accuracy
compared to the one they used (17.4% in Table 1
here vs. 14.9% in their paper).
3 Results and analysis
Neural vs. human caption performance
As
shown in Table 1, the out-of-the-box neural image
retrieval model has a clear preference for neural
captions. It reaches 27.9% IMAGECODEaccuracy
when taking neural captions as input, vs. 17.4%
with human captions (chance level is at 10%).
For comparison, the best fine-tuned, architecture-
adjusted model of Krojer et al. (2022) reached
22.3% performance with human captions.
A concrete sense of the differences between the
two types of captions is given by the examples in
Fig. 1. The examples in this figure are picked ran-
domly. Based on manual inspection of a larger
set, we are confident they are representative of
the full data. Clearly, neural captions are shorter
(avg. length at 11.4 tokens vs. 23.2 for human
captions) and more plainly descriptive (although
the description is mostly only vaguely related to
what’s actually depicted). Since there is no way to
make the out-of-the-box ClipCap system distractor-
aware, the neural captions are not highlighting dis-
criminative aspects of a target image compared to
the distractors. Human captions, on the other hand,
use very articulated language to highlight what is
unique about the target compared to the closest dis-
tractors (often focusing on rather marginal aspects
of the image, because of their discriminativeness,
e.g., for the first example in the figure, the fact that
the blue backpack is hardly visible). It is not sur-
prising that a generic image retriever, that was not
trained to handle this highly context-based linguis-
tic style, would not get much useful information out
of the human captions. It is interesting, however,
that this generic system performs relatively well
with the neural captions, given how off-the-mark
and non-discriminative the latter typically are.
As more quantitative cues of the differences be-
tween caption types, we observe that human cap-
tions are making more use of both rare lemmas
and function words (see frequency plots in Ap-
pendix B).
2
Extracting the lemmas that are statisti-
cally most strongly associated to the human caption
set (see Appendix Cfor method and full top list),
we observe “meta-visual” words such as visible
and see, pronouns and determiners cuing anaphoric
structure (the,her,his), and function words sig-
naling a more complex sentence structure, such
as auxiliaries, negation and connectives. Among
the most typical neural lemmas, we find instead
general terms for concrete entities such as people,
woman,table and food.
Are neural captions really discriminative?
By
looking at Figure 1, we see that neural captions
might be (very noisily) descriptive of the target, but
they seem hardly discriminative with respect to the
nearest distractors. Recall that each IMAGECODE
set contains a sequence of 10 frames from the same
scene. In general, the frames that are farther away
in time might be easier to discriminate than the
2
Code to reproduce our analysis with human and model-
generated captions is available at
https://github.com/
franfranz/emecomm_context
摘要:
展开>>
收起<<
Communicationbreakdown:OnthelowmutualintelligibilitybetweenhumanandneuralcaptioningRobertoDessìMetaAIUniversitatPompeuFabrardessi@meta.comEleonoraGualdoniandFrancescaFranzonUniversitatPompeuFabra{name.lastname}@upf.eduGemmaBoledaandMarcoBaroniICREAUniversitatPompeuFabra{name.lastname}@upf.eduAbstrac...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-14 22
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 28
-
VIP免费2024-11-23 11
-
VIP免费2024-11-23 21
-
VIP免费2024-11-23 12
-
VIP免费2024-11-23 5
分类:图书资源
价格:10玖币
属性:10 页
大小:8.36MB
格式:PDF
时间:2025-05-06
作者详情
-
IMU2CLIP MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS Seungwhan Moon Andrea Madotto Zhaojiang Lin Alireza Dirafzoon Aparajita Saraf10 玖币0人下载
-
Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective Zijian Zhang1 Chang Shu23 Ya Xiao1 Yuan Shen1 Di Zhu1 Jing Xiao210 玖币0人下载