Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies Mitja Nikolaus12

2025-05-03 0 0 813.41KB 18 页 10玖币
侵权投诉
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun
Dependencies?
Mitja Nikolaus1,2
mitja.nikolaus@univ-amu.fr
Emmanuelle Salin1and Stephane Ayache1and Abdellah Fourtassi1and Benoit Favre1
1Aix Marseille Univ, Université de Toulon, CNRS, LIS, Marseille, France
2Aix-Marseille Univ, CNRS, LPL, Aix-en-Provence, France
Abstract
Recent advances in vision-and-language mod-
eling have seen the development of Trans-
former architectures that achieve remarkable
performance on multimodal reasoning tasks.
Yet, the exact capabilities of these black-box
models are still poorly understood. While
much of previous work has focused on study-
ing their ability to learn meaning at the word-
level, their ability to track syntactic dependen-
cies between words has received less attention.
We take a first step in closing this gap by cre-
ating a new multimodal task targeted at eval-
uating understanding of predicate-noun depen-
dencies in a controlled setup. We evaluate a
range of state-of-the-art models and find that
their performance on the task varies consider-
ably, with some models performing relatively
well and others at chance level. In an effort
to explain this variability, our analyses indi-
cate that the quality (and not only sheer quan-
tity) of pretraining data is essential. Addition-
ally, the best performing models leverage fine-
grained multimodal pretraining objectives in
addition to the standard image-text matching
objectives.
This study highlights that targeted and con-
trolled evaluations are a crucial step for a pre-
cise and rigorous test of the multimodal knowl-
edge of vision-and-language models.
1 Introduction
Vision-and-language (V&L) models have recently
shown substantial improvement on a range of mul-
timodal reasoning tasks. Taking inspiration from
successes in text-only Natural Language Process-
ing (Devlin et al.,2019;Brown et al.,2020), state-
of-the-art V&L models are usually composed of
a Transformer-based architecture pre-trained in a
self-supervised manner on large-scale data, and
then fine-tuned on downstream tasks.
While these models show remarkable perfor-
mance on a range of tasks, more controlled and
systematic analyses are necessary in order to obtain
Target sentence:
A man is wearing a hat.
Distractor Sentence:
A man is wearing glasses.
Figure 1: We evaluate V&L models on their ability to
track predicate-noun dependencies that require a joint
understanding of the linguistic and visual modalities.
The task is to find the correct sentence (choosing be-
tween the target and distractor) that corresponds to the
scene in the image. In this example, the models should
connect the predicate “is wearing a hat” to “man”. A
model that does not track dependencies would judge
the distractor sentence “A man is wearing glasses” as
equally likely, as there is a man is the image, as well as
a person that is wearing glasses.
a better understanding of their exact multimodal
knowledge.
A range of studies has investigated their abil-
ity to map words to their visual referents for
nouns (Kazemzadeh et al.,2014;Mao et al.,2016;
Shekhar et al.,2017) and verbs (Ronchi and Per-
ona,2015;Yatskar et al.,2016;Pratt et al.,2020;
Hendricks and Nematzadeh,2021), but there are
only a few studies on whether recent V&L mod-
els can capture multimodal syntactic dependencies
between words and concepts.
In this paper, we explore how well V&L models
learn predicate-noun dependencies across modal-
ities (see example in Figure 1). To this end, we
create an evaluation set that contains carefully se-
lected images and pairs of sentences with minimal
arXiv:2210.12079v1 [cs.CL] 21 Oct 2022
differences. Given an image and two predicate-
noun sentences, the models need to find the correct
sentence corresponding to the image. Crucially,
they can only succeed by taking into account the
dependencies between the visual concepts in the
image corresponding to the noun and predicate in
the sentence.
As it has been shown that visual reasoning per-
formance in several tasks can be spuriously aug-
mented by capitalizing on textual biases in the
training data (Goyal et al.,2017;Agrawal et al.,
2018;Hendricks et al.,2018;Cao et al.,2020), we
counter-balance our evaluation dataset in a way that
controls for such exploitation of linguistic biases.
We evaluate pre-trained state-of-the-art V&L
models in a zero-shot setting and find that the
ability to track predicate-noun dependencies varies
considerably from model to model. Of all models
tested, UNITER (Chen et al.,2019) and LXMERT
(Tan and Bansal,2019) show the highest scores, but
their performance is still far from optimal. Other
models such as ViLBERT (Lu et al.,2019) and
CLIP (Radford et al.,2021) perform at chance level.
We discuss how differences in the models could ex-
plain their performance variability, highlighting the
role of pretraining data quality and fine-grained
multimodal pretraining objectives.
Code to reproduce the analyses and run the
evaluation on new models is publicly avail-
able at
https://github.com/mitjanikolaus/
multimodal-predicate-noun-dependencies.
2 Related Work
Targeted evaluation of V&L models
Recently,
a growing number of tasks have been created for
targeted evaluation of V&L models’ abilities to
perform various multimodal reasoning.
Shekhar et al. (2017) create sets of distractor
captions to analyze whether V&L models are sen-
sitive to single word replacements (with a focus on
nouns). Similar targeted evaluation datasets have
also been proposed for referring expressions (Chen
et al.,2020), image-sentence matching (Hu et al.,
2019), and Visual Question Answering (VQA; Bo-
gin et al.,2021), with a focus on compositional
reasoning.
Tasks such as visual semantic role labeling or
situation recognition, typically involve classifying
the primary activity depicted in an image, as well
as the semantic roles of involved entities (Ronchi
and Perona,2015;Lu et al.,2016;Chao et al.,2015;
Gupta and Malik,2015;Yatskar et al.,2016;Pratt
et al.,2020). While these studies demonstrate that
V&L models can learn semantic roles to some de-
gree in a supervised learning setup, such tasks do
not allow for a controlled evaluation of models in a
zero-shot setting.
In Hendricks and Nematzadeh (2021), the au-
thors evaluate state-of-the-art V&L models in a
controlled zero-shot setup and find that they still
have more trouble understanding verbs compared
to subjects or objects. They also observe that mod-
els trained on larger datasets with less descriptive
captions perform worse than models trained on
smaller, manually-annotated datasets.
Several works have also tried to shed more light
on the precise multimodal semantic capabilities of
V&L models using probing techniques. Salin et al.
(2022) show that although state-of-the-art V&L
models can grasp some multimodal concepts such
as color, they still do not fully understand more
difficult concepts such as object size and position
in the image. Parcalabescu et al. (2021) use prob-
ing to demonstrate that such models still lack the
capability to correctly count entities in an image.
Evaluation of grounded syntax
Akula et al.
(2020) tests for sensitivity to word order in refer-
ring expressions. Similarly, Thrush et al. (2022)
studies the ability of V&L models to take word or-
der into account by designing adversarial examples
that require differentiating between similar image
and text pairs, while the text pairs only differ in
their word order. Their results suggest that state-
of-the art models still lack precise compositional
reasoning abilities.
Li et al. (2020a) studies so-called syntactic
grounding of VisualBERT. They show that certain
attention heads of the transformer architecture at-
tend to entities that are connected via syntactic
dependency relationships. However, such probing
experiments do not necessarily indicate to what
degree a model is actually using the encoded infor-
mation when making predictions.
In our work, we test a range of state-of-the-
art models specifically on their ability to track
predicate-noun dependencies. Crucially, we test
the models in a much more controlled setting com-
pared to previous work: Our setup involves visual
distractors as well as control task, disentangling the
challenge of understanding syntactic dependencies
from more simple object and predicate recognition.
Additionally, we strictly control for any possible
linguistic bias by counter-balancing all evaluation
examples.
3 Methods
3.1 Evaluation Dataset
We construct an evaluation dataset that is suited
for evaluating the sensitivity to visually grounded
predicate-noun dependencies in a zero-shot setup.
The data consists of pairs of triplets, and each
triplet consists of an Image
I
, a target sentence
S1
,
and a distractor sentence
S2
. Target and distrac-
tor sentences are minimal pairs, i.e. one sentence
differs from the other only with regard to either
the noun (e.g., “A girl is sitting. vs. A man is
sitting.”, Figure 2) or the predicate (e.g., “A man
is wearing a hat. vs. A man is wearing glasses.”,
Figure 1).
Crucially, the images always contain visual dis-
tractors, meaning that both the noun and the pred-
icate of the distractor sentence are present in the
image, but they do not have a noun-predicate rela-
tionship (e.g., for the distractor sentence “A man
is wearing glasses”, there is a man in the image,
who is not wearing glasses, and a person wearing
glasses, who is not a man). Thus, it is necessary
to take into account the dependency between noun
and predicate to distinguish between the target and
distractor sentence (Figure 1).
Controlling for linguistic biases
V&L models
have shown to rely sometimes on textual bias in-
stead of using visual information (Goyal et al.,
2017;Agrawal et al.,2018;Hendricks et al.,2018;
Cao et al.,2020). For example, if a training dataset
contains more often the phrase “a girl is sitting”
than “a man is sitting”, a model might prefer the
caption “a girl is sitting” during evaluation only
based on linguistic co-occurrence heuristics, irre-
spective of the visual content. In our evaluation
dataset, we control for potential linguistic biases in
the training datasets by pairing every triplet with
a corresponding counter-balanced example where
target and distractor sentence are flipped. More
specifically, for every triplet
(I1, S1, S2)
, there ex-
ists a corresponding triplet
(I2, S2, S1)
, as depicted
in Figure 2. In that way, a model that does not take
into account the visual modality cannot succeed in
the task (see also Nikolaus and Fourtassi,2021).
Automatic pre-filtering
Our evaluation dataset
is based on Open Images (Kuznetsova et al.,2020).
Target sentence: A girl is sitting.
Distractor Sentence: A man is sitting.
Target sentence: A man is sitting.
Distractor Sentence: A girl is sitting.
Figure 2: Counter-balanced evaluation: Each triplet
has a corresponding counter-example, where target and
distractor sentence are flipped.
We pre-filter the images based on existing human-
annotated object and relationship labels and bound-
ing boxes. The objects refer to persons, animals,
as well as inanimate objects. The relationships can
either describe an action that an object is engaged
in (e.g., WOMAN -SIT), or an action linking two
objects (e.g., MAN -WEAR -GLASSES). All nouns
in the selected relationships for our dataset refer to
persons, due to lack of sufficient annotations for
other kinds of agents.
We look for images that contain a target object-
relationship pair as well as a distractor object-
relationship pair for which either the target and
distractor object are the same, but the relationships
differ, or vice versa (as in the example in Figure 1).
Additional details on the pre-filtering can be found
in Appendix A.1.
Manual selection
We manually select suitable
images after the automated pre-filtering, in order to
ensure high quality of each example and in particu-
lar to verify that the distractor sentences are indeed
incorrect given the images. This step is crucial,
because many of the annotations in Open Images
are incomplete, and an image may contain, for ex-
ample, a woman that is sitting but not annotated as
such (in this case, we disregard the image for our
evaluation set).
We select pairs of examples and counter-
examples and ensure that there are no duplicate
images within the set of images for each object-
relationship pair.
Sentence generation
We generate target and dis-
tractor sentences based on the verified object and
relationship annotations from Open Images.
We construct English sentences using a template-
based approach. Given an object and a relationship,
we add the indefinite article (a/an) in front of each
noun and use all verbs in present progressive tense
as this is most frequent in image-text datasets.
1
For
example, from WOMAN -IS -SIT we generate “a
woman is sitting.”; and from MAN -HOLD -CAM-
ERA “a man is holding a camera.”.
This template-based approach is necessary for
our controlled evaluation. As the choice of the ex-
act template for the construction of the sentences
may influence the results
2
, we evaluate the mod-
els, additionally, using a slightly different template,
and we show that the overall result patterns remain
largely similar (see Appendix A.4.2).
Final evaluation set
The final evaluation set con-
tains 2584 triplets. For 1486 of these triplets, the
distractor sentence contains an incorrect predicate
and for the other 1098 triplets, the distractor con-
tains an incorrect noun. More detailed statistics
regarding the number of triplets concerning spe-
cific concepts are provided in Appendix A.2.
A note on perceived gender annotations
Our
evaluation dataset uses annotations from the Open
Images dataset, which rely on the physical appear-
ance of persons to annotate their perceived gender.
We use the provided annotations, and the resulting
biases are unfortunately reproduced in our evalua-
tion set. We discuss this issue in further detail in
the Ethics Statement (Section 8).
In Salminen et al. (2018) gender classification
from face pictures by human annotators shows
an inter annotator agreement greater than 95%.
True gender cannot be classified, and high inter-
annotator agreement does not imply a correct gen-
der choice, but we expect the gender annotations
of Open Images to be reliable enough to be used as
a basis for our analyses.
3.2 Metric
We evaluate pre-trained models on their image-text
matching performance in a zero-shot setting, i.e.
without any further training. For each triplet, we
test whether the models give a higher similarity
score for the correct sentence than for the distractor
1
In cases where multiple connecting predicates between
a verb and a noun are plausible (e.g. “a man wearing glasses”
vs. “a man with glasses”), we choose the construction that
occurs most frequently in the Conceptual Captions training
data (Sharma et al.,2018). This dataset is most commonly
used for training V&L transformers.
2
For example, Ravichander et al. (2020) found that results
of some probing experiments can vary substantially with slight
changes in wording.
sentence. We calculate accuracy for each pair, i.e.
the model needs to succeed for both the example
and the counter-balanced example triplet.
For each pair of triplets
(t1, t2) =
([I1, S1, S2],[I2, S2, S1])
, we calculate the
following score:
f(t1, t2) =
1,
if
s(I1, S1)> s(I1, S2)
and s(I2, S2)> s(I2, S1)
0,otherwise
where
s(I, S)
denotes the similarity between an
image
I
and a sentence
S
. To obtain the similarity
score, we use the softmaxed output of the image-
text matching pretraining heads of the models.3
The final accuracy is the average score over all
pairs in the evaluation set. Chance performance is
at 25%.4
As the dataset was manually filtered and requires
only rather simple understanding of the images, we
assume human performance to be close to 100%.
To verify this claim, we had a one person annotate a
randomly sampled subset of 500 triplets. For each
triplet, the annotator was asked to judge which of
the two sentences describes the image better. The
resulting performance was at 100%.
A topline: the cropped task
In order to explore
the effect of the visual distractors on this noun-
predicate dependency task, we additionally evalu-
ate all models in a
cropped
task: We reduce the
image to the bounding box of the target object.
Thus, the cropped image usually
5
only contains
the target object, and no more visual distractors
(i.e., the referent of the noun or the predicate in
the distractor sentence is no longer present in the
cropped image). To succeed at this (simpler) task,
the model no longer needs to capture the predicate-
noun dependency, it just needs to ground the single
words correctly. We use this task to estimate how
much the performance of the models is affected
by the ability to ground nouns and predicates in
our evaluation dataset, in comparison to the (more
3
For the model CLIP, we feed the image and both sen-
tences at the same time, and obtain a similarity score for both
sentences, where s(I1, S1) = 1 s(I1, S2).
4
The model succeeds if the similarity scores fall
into one of four possible configurations:
s(I1, S1)>
s(I1, S2)s(I2, S1)> s(I2, S2)
;
s(I1, S1)< s(I1, S2)
s(I2, S1)< s(I2, S2); s(I1, S1)> s(I1, S2)s(I2, S1)<
s(I2, S2); s(I1, S1)< s(I1, S2)s(I2, S1)> s(I2, S2).
5
If the bounding boxes of the target and visual distractor
object overlap to a high degree, the cropped image might still
contain (parts of) the distractor object.
摘要:

DoVision-and-LanguageTransformersLearnGroundedPredicate-NounDependencies?MitjaNikolaus1,2mitja.nikolaus@univ-amu.frEmmanuelleSalin1andStephaneAyache1andAbdellahFourtassi1andBenoitFavre11AixMarseilleUniv,UniversitédeToulon,CNRS,LIS,Marseille,France2Aix-MarseilleUniv,CNRS,LPL,Aix-en-Provence,FranceAbs...

展开>> 收起<<
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies Mitja Nikolaus12.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:813.41KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注