improve performance in downstream NLP tasks
(Hewitt et al.,2018). Inspired by this line of work,
we expect concrete visual properties of nouns to
be more accessible through images, and text-based
language models to better encode abstract semantic
properties. We propose an ensemble model which
combines information from these two sources for
English noun property prediction.
We frame property identification as a ranking
task, where relevant properties for a noun need
to be retrieved from a set of candidate properties
found in association norm datasets (McRae et al.,
2005;Devereux et al.,2014;Norlund et al.,2021).
We experiment with text-based language models
(Devlin et al.,2019;Radford et al.,2019;Liu et al.,
2019) and with CLIP (Radford et al.,2021) which
we query using a slot filling task, as shown in Fig-
ures 1(a) and (b). Our ensemble model (Figure
1(c)) combines the strengths of the language and
vision models, by specifically privileging the for-
mer or latter type of representation depending on
the concreteness of the processed properties (Brys-
baert et al.,2014). Given that concrete properties
are characterized by a higher degree of imageabil-
ity (Friendly et al.,1982), our model trusts the
visual model for perceptual and highly concrete
properties (e.g., color adjectives: red,green), and
the language model for abstract properties (e.g.,
free,infinite). Our results confirm that CLIP can
identify nouns’ perceptual properties better than
language models, which contain higher-quality in-
formation about abstract properties. Our ensemble
model, which combines the two sources of knowl-
edge, outperforms the individual models on the
property ranking task by a significant margin.
2 Related Work
Probing has been widely used in previous work
for exploring the semantic knowledge that is en-
coded in language models. A common approach
has been to convert the facts, properties, and rela-
tions found in external knowledge sources into “fill-
in-the-blank” cloze statements, and to use them to
query language models. Apidianaki and Garí Soler
(2021) do so for nouns’ semantic properties and
highlight how challenging it is to retrieve this kind
of information from BERT representations (Devlin
et al.,2019).
Furthermore, slightly different prompts tend to
retrieve different semantic information (Ettinger,
2020), compromising the robustness of semantic
probing tasks. We propose to mitigate these prob-
lems by also relying on images.
Features extracted from different modalities can
complement the information found in texts. Mul-
timodal distributional models, for example, have
been shown to outperform text-based approaches
on semantic benchmarks (Silberer et al.,2013;
Bruni et al.,2014;Lazaridou et al.,2015). Simi-
larly, ensemble models that integrate multimodal
and text-based models outperform models that only
rely on one modality in tasks such as visual ques-
tion answering (Tsimpoukelli et al.,2021;Alayrac
et al.,2022;Yang et al.,2021b), visual entailment
(Song et al.,2022), reading comprehension, natural
language inference (Zhang et al.,2021;Kiros et al.,
2018), text generation (Su et al.,2022), word sense
disambiguation (Barnard and Johnson,2005), and
video retrieval (Yang et al.,2021a).
We extend this investigation to noun property
prediction. We propose a novel noun property re-
trieval model which combines information from
language and vision models, and tunes their respec-
tive contributions based on property concreteness
(Brysbaert et al.,2014).
Concreteness is a graded notion that strongly
correlates with the degree of imageability (Friendly
et al.,1982;Byrne,1974); concrete words generally
tend to refer to tangible objects that the senses can
easily perceive (Paivio et al.,1968). We extend this
idea to noun properties and hypothesize that vision
models would have better knowledge of perceptual,
and more concrete, properties (e.g., red,flat,round)
than text-based language models, which would bet-
ter capture abstract properties (e.g., free,inspiring,
promising). We evaluate our ensemble model us-
ing concreteness scores automatically predicted
by a regression model (Charbonnier and Wartena,
2019). We compare these results to the perfor-
mance of the ensemble model with manual (gold)
concreteness ratings (Brysbaert et al.,2014). In
previous work, concreteness was measured based
on the idea that abstract concepts relate to varied
and composite situations (Barsalou and Wiemer-
Hastings,2005). Consequently, visually grounded
representations of abstract concepts (e.g., freedom)
should be more complex and diverse than those of
concrete words (e.g., dog) (Lazaridou et al.,2015;
Kiela et al.,2014).
Lazaridou et al. (2015) specifically measure the
entropy of the vectors induced by multimodal mod-
els which serve as an expression of how varied the