
we add the indefinite article (a/an) in front of each
noun and use all verbs in present progressive tense
as this is most frequent in image-text datasets.
1
For
example, from WOMAN -IS -SIT we generate “a
woman is sitting.”; and from MAN -HOLD -CAM-
ERA “a man is holding a camera.”.
This template-based approach is necessary for
our controlled evaluation. As the choice of the ex-
act template for the construction of the sentences
may influence the results
2
, we evaluate the mod-
els, additionally, using a slightly different template,
and we show that the overall result patterns remain
largely similar (see Appendix A.4.2).
Final evaluation set
The final evaluation set con-
tains 2584 triplets. For 1486 of these triplets, the
distractor sentence contains an incorrect predicate
and for the other 1098 triplets, the distractor con-
tains an incorrect noun. More detailed statistics
regarding the number of triplets concerning spe-
cific concepts are provided in Appendix A.2.
A note on perceived gender annotations
Our
evaluation dataset uses annotations from the Open
Images dataset, which rely on the physical appear-
ance of persons to annotate their perceived gender.
We use the provided annotations, and the resulting
biases are unfortunately reproduced in our evalua-
tion set. We discuss this issue in further detail in
the Ethics Statement (Section 8).
In Salminen et al. (2018) gender classification
from face pictures by human annotators shows
an inter annotator agreement greater than 95%.
True gender cannot be classified, and high inter-
annotator agreement does not imply a correct gen-
der choice, but we expect the gender annotations
of Open Images to be reliable enough to be used as
a basis for our analyses.
3.2 Metric
We evaluate pre-trained models on their image-text
matching performance in a zero-shot setting, i.e.
without any further training. For each triplet, we
test whether the models give a higher similarity
score for the correct sentence than for the distractor
1
In cases where multiple connecting predicates between
a verb and a noun are plausible (e.g. “a man wearing glasses”
vs. “a man with glasses”), we choose the construction that
occurs most frequently in the Conceptual Captions training
data (Sharma et al.,2018). This dataset is most commonly
used for training V&L transformers.
2
For example, Ravichander et al. (2020) found that results
of some probing experiments can vary substantially with slight
changes in wording.
sentence. We calculate accuracy for each pair, i.e.
the model needs to succeed for both the example
and the counter-balanced example triplet.
For each pair of triplets
(t1, t2) =
([I1, S1, S2],[I2, S2, S1])
, we calculate the
following score:
f(t1, t2) =
1,
if
s(I1, S1)> s(I1, S2)
and s(I2, S2)> s(I2, S1)
0,otherwise
where
s(I, S)
denotes the similarity between an
image
I
and a sentence
S
. To obtain the similarity
score, we use the softmaxed output of the image-
text matching pretraining heads of the models.3
The final accuracy is the average score over all
pairs in the evaluation set. Chance performance is
at 25%.4
As the dataset was manually filtered and requires
only rather simple understanding of the images, we
assume human performance to be close to 100%.
To verify this claim, we had a one person annotate a
randomly sampled subset of 500 triplets. For each
triplet, the annotator was asked to judge which of
the two sentences describes the image better. The
resulting performance was at 100%.
A topline: the cropped task
In order to explore
the effect of the visual distractors on this noun-
predicate dependency task, we additionally evalu-
ate all models in a
cropped
task: We reduce the
image to the bounding box of the target object.
Thus, the cropped image usually
5
only contains
the target object, and no more visual distractors
(i.e., the referent of the noun or the predicate in
the distractor sentence is no longer present in the
cropped image). To succeed at this (simpler) task,
the model no longer needs to capture the predicate-
noun dependency, it just needs to ground the single
words correctly. We use this task to estimate how
much the performance of the models is affected
by the ability to ground nouns and predicates in
our evaluation dataset, in comparison to the (more
3
For the model CLIP, we feed the image and both sen-
tences at the same time, and obtain a similarity score for both
sentences, where s(I1, S1) = 1 −s(I1, S2).
4
The model succeeds if the similarity scores fall
into one of four possible configurations:
s(I1, S1)>
s(I1, S2)∧s(I2, S1)> s(I2, S2)
;
s(I1, S1)< s(I1, S2)∧
s(I2, S1)< s(I2, S2); s(I1, S1)> s(I1, S2)∧s(I2, S1)<
s(I2, S2); s(I1, S1)< s(I1, S2)∧s(I2, S1)> s(I2, S2).
5
If the bounding boxes of the target and visual distractor
object overlap to a high degree, the cropped image might still
contain (parts of) the distractor object.