Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies Mitja Nikolaus12

2025-05-03 0 0 813.41KB 18 页 10玖币

侵权投诉

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun

Dependencies?

Mitja Nikolaus1,2

mitja.nikolaus@univ-amu.fr

Emmanuelle Salin1and Stephane Ayache1and Abdellah Fourtassi1and Benoit Favre1

1Aix Marseille Univ, Université de Toulon, CNRS, LIS, Marseille, France

2Aix-Marseille Univ, CNRS, LPL, Aix-en-Provence, France

Abstract

Recent advances in vision-and-language mod-

eling have seen the development of Trans-

former architectures that achieve remarkable

performance on multimodal reasoning tasks.

Yet, the exact capabilities of these black-box

models are still poorly understood. While

much of previous work has focused on study-

ing their ability to learn meaning at the word-

level, their ability to track syntactic dependen-

cies between words has received less attention.

We take a ﬁrst step in closing this gap by cre-

ating a new multimodal task targeted at eval-

uating understanding of predicate-noun depen-

dencies in a controlled setup. We evaluate a

range of state-of-the-art models and ﬁnd that

their performance on the task varies consider-

ably, with some models performing relatively

well and others at chance level. In an effort

to explain this variability, our analyses indi-

cate that the quality (and not only sheer quan-

tity) of pretraining data is essential. Addition-

ally, the best performing models leverage ﬁne-

grained multimodal pretraining objectives in

addition to the standard image-text matching

objectives.

This study highlights that targeted and con-

trolled evaluations are a crucial step for a pre-

cise and rigorous test of the multimodal knowl-

edge of vision-and-language models.

1 Introduction

Vision-and-language (V&L) models have recently

shown substantial improvement on a range of mul-

timodal reasoning tasks. Taking inspiration from

successes in text-only Natural Language Process-

ing (Devlin et al.,2019;Brown et al.,2020), state-

of-the-art V&L models are usually composed of

a Transformer-based architecture pre-trained in a

self-supervised manner on large-scale data, and

then ﬁne-tuned on downstream tasks.

While these models show remarkable perfor-

mance on a range of tasks, more controlled and

systematic analyses are necessary in order to obtain

Target sentence:

A man is wearing a hat.

Distractor Sentence:

A man is wearing glasses.

Figure 1: We evaluate V&L models on their ability to

track predicate-noun dependencies that require a joint

understanding of the linguistic and visual modalities.

The task is to ﬁnd the correct sentence (choosing be-

tween the target and distractor) that corresponds to the

scene in the image. In this example, the models should

connect the predicate “is wearing a hat” to “man”. A

model that does not track dependencies would judge

the distractor sentence “A man is wearing glasses” as

equally likely, as there is a man is the image, as well as

a person that is wearing glasses.

a better understanding of their exact multimodal

knowledge.

A range of studies has investigated their abil-

ity to map words to their visual referents for

nouns (Kazemzadeh et al.,2014;Mao et al.,2016;

Shekhar et al.,2017) and verbs (Ronchi and Per-

ona,2015;Yatskar et al.,2016;Pratt et al.,2020;

Hendricks and Nematzadeh,2021), but there are

only a few studies on whether recent V&L mod-

els can capture multimodal syntactic dependencies

between words and concepts.

In this paper, we explore how well V&L models

learn predicate-noun dependencies across modal-

ities (see example in Figure 1). To this end, we

create an evaluation set that contains carefully se-

lected images and pairs of sentences with minimal

arXiv:2210.12079v1 [cs.CL] 21 Oct 2022

differences. Given an image and two predicate-

noun sentences, the models need to ﬁnd the correct

sentence corresponding to the image. Crucially,

they can only succeed by taking into account the

dependencies between the visual concepts in the

image corresponding to the noun and predicate in

the sentence.

As it has been shown that visual reasoning per-

formance in several tasks can be spuriously aug-

mented by capitalizing on textual biases in the

training data (Goyal et al.,2017;Agrawal et al.,

2018;Hendricks et al.,2018;Cao et al.,2020), we

counter-balance our evaluation dataset in a way that

controls for such exploitation of linguistic biases.

We evaluate pre-trained state-of-the-art V&L

models in a zero-shot setting and ﬁnd that the

ability to track predicate-noun dependencies varies

considerably from model to model. Of all models

tested, UNITER (Chen et al.,2019) and LXMERT

(Tan and Bansal,2019) show the highest scores, but

their performance is still far from optimal. Other

models such as ViLBERT (Lu et al.,2019) and

CLIP (Radford et al.,2021) perform at chance level.

We discuss how differences in the models could ex-

plain their performance variability, highlighting the

role of pretraining data quality and ﬁne-grained

multimodal pretraining objectives.

Code to reproduce the analyses and run the

evaluation on new models is publicly avail-

able at

https://github.com/mitjanikolaus/

multimodal-predicate-noun-dependencies.

2 Related Work

Targeted evaluation of V&L models

Recently,

a growing number of tasks have been created for

targeted evaluation of V&L models’ abilities to

perform various multimodal reasoning.

Shekhar et al. (2017) create sets of distractor

captions to analyze whether V&L models are sen-

sitive to single word replacements (with a focus on

nouns). Similar targeted evaluation datasets have

also been proposed for referring expressions (Chen

et al.,2020), image-sentence matching (Hu et al.,

2019), and Visual Question Answering (VQA; Bo-

gin et al.,2021), with a focus on compositional

reasoning.

Tasks such as visual semantic role labeling or

situation recognition, typically involve classifying

the primary activity depicted in an image, as well

as the semantic roles of involved entities (Ronchi

and Perona,2015;Lu et al.,2016;Chao et al.,2015;

Gupta and Malik,2015;Yatskar et al.,2016;Pratt

et al.,2020). While these studies demonstrate that

V&L models can learn semantic roles to some de-

gree in a supervised learning setup, such tasks do

not allow for a controlled evaluation of models in a

zero-shot setting.

In Hendricks and Nematzadeh (2021), the au-

thors evaluate state-of-the-art V&L models in a

controlled zero-shot setup and ﬁnd that they still

have more trouble understanding verbs compared

to subjects or objects. They also observe that mod-

els trained on larger datasets with less descriptive

captions perform worse than models trained on

smaller, manually-annotated datasets.

Several works have also tried to shed more light

on the precise multimodal semantic capabilities of

V&L models using probing techniques. Salin et al.

(2022) show that although state-of-the-art V&L

models can grasp some multimodal concepts such

as color, they still do not fully understand more

difﬁcult concepts such as object size and position

in the image. Parcalabescu et al. (2021) use prob-

ing to demonstrate that such models still lack the

capability to correctly count entities in an image.

Evaluation of grounded syntax

Akula et al.

(2020) tests for sensitivity to word order in refer-

ring expressions. Similarly, Thrush et al. (2022)

studies the ability of V&L models to take word or-

der into account by designing adversarial examples

that require differentiating between similar image

and text pairs, while the text pairs only differ in

their word order. Their results suggest that state-

of-the art models still lack precise compositional

reasoning abilities.

Li et al. (2020a) studies so-called syntactic

grounding of VisualBERT. They show that certain

attention heads of the transformer architecture at-

tend to entities that are connected via syntactic

dependency relationships. However, such probing

experiments do not necessarily indicate to what

degree a model is actually using the encoded infor-

mation when making predictions.

In our work, we test a range of state-of-the-

art models speciﬁcally on their ability to track

predicate-noun dependencies. Crucially, we test

the models in a much more controlled setting com-

pared to previous work: Our setup involves visual

distractors as well as control task, disentangling the

challenge of understanding syntactic dependencies

from more simple object and predicate recognition.

Additionally, we strictly control for any possible

linguistic bias by counter-balancing all evaluation

examples.

3 Methods

3.1 Evaluation Dataset

We construct an evaluation dataset that is suited

for evaluating the sensitivity to visually grounded

predicate-noun dependencies in a zero-shot setup.

The data consists of pairs of triplets, and each

triplet consists of an Image

, a target sentence

and a distractor sentence

. Target and distrac-

tor sentences are minimal pairs, i.e. one sentence

differs from the other only with regard to either

the noun (e.g., “A girl is sitting.” vs. “A man is

sitting.”, Figure 2) or the predicate (e.g., “A man

is wearing a hat.” vs. “A man is wearing glasses.”,

Figure 1).

Crucially, the images always contain visual dis-

tractors, meaning that both the noun and the pred-

icate of the distractor sentence are present in the

image, but they do not have a noun-predicate rela-

tionship (e.g., for the distractor sentence “A man

is wearing glasses”, there is a man in the image,

who is not wearing glasses, and a person wearing

glasses, who is not a man). Thus, it is necessary

to take into account the dependency between noun

and predicate to distinguish between the target and

distractor sentence (Figure 1).

Controlling for linguistic biases

V&L models

have shown to rely sometimes on textual bias in-

stead of using visual information (Goyal et al.,

2017;Agrawal et al.,2018;Hendricks et al.,2018;

Cao et al.,2020). For example, if a training dataset

contains more often the phrase “a girl is sitting”

than “a man is sitting”, a model might prefer the

caption “a girl is sitting” during evaluation only

based on linguistic co-occurrence heuristics, irre-

spective of the visual content. In our evaluation

dataset, we control for potential linguistic biases in

the training datasets by pairing every triplet with

a corresponding counter-balanced example where

target and distractor sentence are ﬂipped. More

speciﬁcally, for every triplet

(I1, S1, S2)

, there ex-

ists a corresponding triplet

(I2, S2, S1)

, as depicted

in Figure 2. In that way, a model that does not take

into account the visual modality cannot succeed in

the task (see also Nikolaus and Fourtassi,2021).

Automatic pre-ﬁltering

Our evaluation dataset

is based on Open Images (Kuznetsova et al.,2020).

Target sentence: A girl is sitting.

Distractor Sentence: A man is sitting.

Target sentence: A man is sitting.

Distractor Sentence: A girl is sitting.

Figure 2: Counter-balanced evaluation: Each triplet

has a corresponding counter-example, where target and

distractor sentence are ﬂipped.

We pre-ﬁlter the images based on existing human-

annotated object and relationship labels and bound-

ing boxes. The objects refer to persons, animals,

as well as inanimate objects. The relationships can

either describe an action that an object is engaged

in (e.g., WOMAN -SIT), or an action linking two

objects (e.g., MAN -WEAR -GLASSES). All nouns

in the selected relationships for our dataset refer to

persons, due to lack of sufﬁcient annotations for

other kinds of agents.

We look for images that contain a target object-

relationship pair as well as a distractor object-

relationship pair for which either the target and

distractor object are the same, but the relationships

differ, or vice versa (as in the example in Figure 1).

Additional details on the pre-ﬁltering can be found

in Appendix A.1.

Manual selection

We manually select suitable

images after the automated pre-ﬁltering, in order to

ensure high quality of each example and in particu-

lar to verify that the distractor sentences are indeed

incorrect given the images. This step is crucial,

because many of the annotations in Open Images

are incomplete, and an image may contain, for ex-

ample, a woman that is sitting but not annotated as

such (in this case, we disregard the image for our

evaluation set).

We select pairs of examples and counter-

examples and ensure that there are no duplicate

images within the set of images for each object-

relationship pair.

Sentence generation

We generate target and dis-

tractor sentences based on the veriﬁed object and

relationship annotations from Open Images.

We construct English sentences using a template-

based approach. Given an object and a relationship,

we add the indeﬁnite article (a/an) in front of each

noun and use all verbs in present progressive tense

as this is most frequent in image-text datasets.

For

example, from WOMAN -IS -SIT we generate “a

woman is sitting.”; and from MAN -HOLD -CAM-

ERA “a man is holding a camera.”.

This template-based approach is necessary for

our controlled evaluation. As the choice of the ex-

act template for the construction of the sentences

may inﬂuence the results

, we evaluate the mod-

els, additionally, using a slightly different template,

and we show that the overall result patterns remain

largely similar (see Appendix A.4.2).

Final evaluation set

The ﬁnal evaluation set con-

tains 2584 triplets. For 1486 of these triplets, the

distractor sentence contains an incorrect predicate

and for the other 1098 triplets, the distractor con-

tains an incorrect noun. More detailed statistics

regarding the number of triplets concerning spe-

ciﬁc concepts are provided in Appendix A.2.

A note on perceived gender annotations

Our

evaluation dataset uses annotations from the Open

Images dataset, which rely on the physical appear-

ance of persons to annotate their perceived gender.

We use the provided annotations, and the resulting

biases are unfortunately reproduced in our evalua-

tion set. We discuss this issue in further detail in

the Ethics Statement (Section 8).

In Salminen et al. (2018) gender classiﬁcation

from face pictures by human annotators shows

an inter annotator agreement greater than 95%.

True gender cannot be classiﬁed, and high inter-

annotator agreement does not imply a correct gen-

der choice, but we expect the gender annotations

of Open Images to be reliable enough to be used as

a basis for our analyses.

3.2 Metric

We evaluate pre-trained models on their image-text

matching performance in a zero-shot setting, i.e.

without any further training. For each triplet, we

test whether the models give a higher similarity

score for the correct sentence than for the distractor

In cases where multiple connecting predicates between

a verb and a noun are plausible (e.g. “a man wearing glasses”

vs. “a man with glasses”), we choose the construction that

occurs most frequently in the Conceptual Captions training

data (Sharma et al.,2018). This dataset is most commonly

used for training V&L transformers.

For example, Ravichander et al. (2020) found that results

of some probing experiments can vary substantially with slight

changes in wording.

sentence. We calculate accuracy for each pair, i.e.

the model needs to succeed for both the example

and the counter-balanced example triplet.

For each pair of triplets

(t1, t2) =

([I1, S1, S2],[I2, S2, S1])

, we calculate the

following score:

f(t1, t2) = 





s(I1, S1)> s(I1, S2)

and s(I2, S2)> s(I2, S1)

0,otherwise

where

s(I, S)

denotes the similarity between an

image

and a sentence

. To obtain the similarity

score, we use the softmaxed output of the image-

text matching pretraining heads of the models.3

The ﬁnal accuracy is the average score over all

pairs in the evaluation set. Chance performance is

at 25%.4

As the dataset was manually ﬁltered and requires

only rather simple understanding of the images, we

assume human performance to be close to 100%.

To verify this claim, we had a one person annotate a

randomly sampled subset of 500 triplets. For each

triplet, the annotator was asked to judge which of

the two sentences describes the image better. The

resulting performance was at 100%.

A topline: the cropped task

In order to explore

the effect of the visual distractors on this noun-

predicate dependency task, we additionally evalu-

ate all models in a

cropped

task: We reduce the

image to the bounding box of the target object.

Thus, the cropped image usually

only contains

the target object, and no more visual distractors

(i.e., the referent of the noun or the predicate in

the distractor sentence is no longer present in the

cropped image). To succeed at this (simpler) task,

the model no longer needs to capture the predicate-

noun dependency, it just needs to ground the single

words correctly. We use this task to estimate how

much the performance of the models is affected

by the ability to ground nouns and predicates in

our evaluation dataset, in comparison to the (more

For the model CLIP, we feed the image and both sen-

tences at the same time, and obtain a similarity score for both

sentences, where s(I1, S1) = 1 −s(I1, S2).

The model succeeds if the similarity scores fall

into one of four possible conﬁgurations:

s(I1, S1)>

s(I1, S2)∧s(I2, S1)> s(I2, S2)

;

s(I1, S1)< s(I1, S2)∧

s(I2, S1)< s(I2, S2); s(I1, S1)> s(I1, S2)∧s(I2, S1)<

s(I2, S2); s(I1, S1)< s(I1, S2)∧s(I2, S1)> s(I2, S2).

If the bounding boxes of the target and visual distractor

object overlap to a high degree, the cropped image might still

contain (parts of) the distractor object.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DoVision-and-LanguageTransformersLearnGroundedPredicate-NounDependencies?MitjaNikolaus1,2mitja.nikolaus@univ-amu.frEmmanuelleSalin1andStephaneAyache1andAbdellahFourtassi1andBenoitFavre11AixMarseilleUniv,UniversitédeToulon,CNRS,LIS,Marseille,France2Aix-MarseilleUniv,CNRS,LPL,Aix-en-Provence,FranceAbs...

展开>> 收起<<

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies Mitja Nikolaus12.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies Mitja Nikolaus12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: