Visualizing the Obvious A Concreteness-based Ensemble Model for Noun Property Prediction Yue Yang Artemis Panagopoulou

2025-05-06 0 0 5.14MB 19 页 10玖币
侵权投诉
Visualizing the Obvious: A Concreteness-based Ensemble Model
for Noun Property Prediction
Yue Yang, Artemis Panagopoulou,
Marianna Apidianaki,Mark Yatskar,Chris Callison-Burch
Department of Computer and Information Science, University of Pennsylvania
{yueyang1, artemisp, marapi, ccb, myatskar}@seas.upenn.edu
Abstract
Neural language models encode rich knowl-
edge about entities and their relationships
which can be extracted from their represen-
tations using probing. Common properties
of nouns (e.g., red strawberries,small ant)
are, however, more challenging to extract com-
pared to other types of knowledge because
they are rarely explicitly stated in texts. We
hypothesize this to mainly be the case for per-
ceptual properties which are obvious to the
participants in the communication. We pro-
pose to extract these properties from images
and use them in an ensemble model, in or-
der to complement the information that is ex-
tracted from language models. We consider
perceptual properties to be more concrete than
abstract properties (e.g., interesting,flawless).
We propose to use the adjectives’ concreteness
score as a lever to calibrate the contribution
of each source (text vs. images). We eval-
uate our ensemble model in a ranking task
where the actual properties of a noun need to
be ranked higher than other non-relevant prop-
erties. Our results show that the proposed com-
bination of text and images greatly improves
noun property prediction compared to power-
ful text-based language models.1
1 Introduction
Common properties of concepts or entities (e.g.,
These strawberries are
red
”) are rarely explicitly
stated in texts, contrary to more specific properties
which bring new information in the communication
(e.g., “These strawberries are
delicious
”). This
phenomenon, known as “reporting bias” (Gordon
and Van Durme,2013;Shwartz and Choi,2020),
makes it difficult to learn, or retrieve, perceptual
properties from text. However, noun property iden-
tification is an important task which may allow AI
applications to perform commonsense reasoning
in a way that matches people’s psychological or
Equal Contribution.
Task: Retrieve Relevant Properties of Nouns
(a)
Language
Model
(b)
CLIP
(c)
CEM
1.male
2.friendly
3.harmless
4.female
5.available
... ...
LM
A cat is
generally [MASK].
candidate properties
LM Predictions
CLIP Predictions
CEM
1.feral
2.cute
3.domesticated
4.affectionate
5.friendly
... ...
An object with the
property of [MASK].
candidate properties
CLIP
1.whiskered
2.adoptable
3.hypoallergenic
4.catlike
5.stray
... ...
Figure 1: Our task is to retrieve relevant properties of
nouns from a set of candidates. We tackle the task us-
ing (a) Cloze-task probing; (b) CLIP to compute the
similarity between the properties and images of the
noun; (c) a Concreteness Ensemble Model (CEM) to
ensemble language and CLIP predictions which relies
on properties’ concreteness ratings.
cognitive predispositions and can improve agent
communication (Lazaridou et al.,2016). Further-
more, identifying noun properties can contribute
to better modeling concepts and entities, learning
affordances (i.e. defining the possible uses of an
object based on its qualities or properties), and un-
derstanding models’ knowledge about the world.
Models that combine different modalities provide
a sort of grounding which helps to alleviate the
reporting bias problem (Kiela et al.,2014;Lazari-
dou et al.,2015;Zhang et al.,2022). For example,
multimodal models are better at predicting color
attributes compared to text-based language models
(Paik et al.,2021;Norlund et al.,2021). Further-
more, visual representations of concrete objects
1
Code and data are available at
https://github.
com/artemisp/semantic-norms
arXiv:2210.12905v1 [cs.CL] 24 Oct 2022
improve performance in downstream NLP tasks
(Hewitt et al.,2018). Inspired by this line of work,
we expect concrete visual properties of nouns to
be more accessible through images, and text-based
language models to better encode abstract semantic
properties. We propose an ensemble model which
combines information from these two sources for
English noun property prediction.
We frame property identification as a ranking
task, where relevant properties for a noun need
to be retrieved from a set of candidate properties
found in association norm datasets (McRae et al.,
2005;Devereux et al.,2014;Norlund et al.,2021).
We experiment with text-based language models
(Devlin et al.,2019;Radford et al.,2019;Liu et al.,
2019) and with CLIP (Radford et al.,2021) which
we query using a slot filling task, as shown in Fig-
ures 1(a) and (b). Our ensemble model (Figure
1(c)) combines the strengths of the language and
vision models, by specifically privileging the for-
mer or latter type of representation depending on
the concreteness of the processed properties (Brys-
baert et al.,2014). Given that concrete properties
are characterized by a higher degree of imageabil-
ity (Friendly et al.,1982), our model trusts the
visual model for perceptual and highly concrete
properties (e.g., color adjectives: red,green), and
the language model for abstract properties (e.g.,
free,infinite). Our results confirm that CLIP can
identify nouns’ perceptual properties better than
language models, which contain higher-quality in-
formation about abstract properties. Our ensemble
model, which combines the two sources of knowl-
edge, outperforms the individual models on the
property ranking task by a significant margin.
2 Related Work
Probing has been widely used in previous work
for exploring the semantic knowledge that is en-
coded in language models. A common approach
has been to convert the facts, properties, and rela-
tions found in external knowledge sources into “fill-
in-the-blank” cloze statements, and to use them to
query language models. Apidianaki and Garí Soler
(2021) do so for nouns’ semantic properties and
highlight how challenging it is to retrieve this kind
of information from BERT representations (Devlin
et al.,2019).
Furthermore, slightly different prompts tend to
retrieve different semantic information (Ettinger,
2020), compromising the robustness of semantic
probing tasks. We propose to mitigate these prob-
lems by also relying on images.
Features extracted from different modalities can
complement the information found in texts. Mul-
timodal distributional models, for example, have
been shown to outperform text-based approaches
on semantic benchmarks (Silberer et al.,2013;
Bruni et al.,2014;Lazaridou et al.,2015). Simi-
larly, ensemble models that integrate multimodal
and text-based models outperform models that only
rely on one modality in tasks such as visual ques-
tion answering (Tsimpoukelli et al.,2021;Alayrac
et al.,2022;Yang et al.,2021b), visual entailment
(Song et al.,2022), reading comprehension, natural
language inference (Zhang et al.,2021;Kiros et al.,
2018), text generation (Su et al.,2022), word sense
disambiguation (Barnard and Johnson,2005), and
video retrieval (Yang et al.,2021a).
We extend this investigation to noun property
prediction. We propose a novel noun property re-
trieval model which combines information from
language and vision models, and tunes their respec-
tive contributions based on property concreteness
(Brysbaert et al.,2014).
Concreteness is a graded notion that strongly
correlates with the degree of imageability (Friendly
et al.,1982;Byrne,1974); concrete words generally
tend to refer to tangible objects that the senses can
easily perceive (Paivio et al.,1968). We extend this
idea to noun properties and hypothesize that vision
models would have better knowledge of perceptual,
and more concrete, properties (e.g., red,flat,round)
than text-based language models, which would bet-
ter capture abstract properties (e.g., free,inspiring,
promising). We evaluate our ensemble model us-
ing concreteness scores automatically predicted
by a regression model (Charbonnier and Wartena,
2019). We compare these results to the perfor-
mance of the ensemble model with manual (gold)
concreteness ratings (Brysbaert et al.,2014). In
previous work, concreteness was measured based
on the idea that abstract concepts relate to varied
and composite situations (Barsalou and Wiemer-
Hastings,2005). Consequently, visually grounded
representations of abstract concepts (e.g., freedom)
should be more complex and diverse than those of
concrete words (e.g., dog) (Lazaridou et al.,2015;
Kiela et al.,2014).
Lazaridou et al. (2015) specifically measure the
entropy of the vectors induced by multimodal mod-
els which serve as an expression of how varied the
information they encode is. They demonstrate that
the entropy of multimodal vectors strongly corre-
lates with the degree of abstractness of words.
3 Experimental Setup
3.1 Task Formulation
Given a noun
N
and a set of candidate properties
P
,
a model needs to select the properties
PNP
that
apply to
N
. The candidate properties are the set of
all adjectives retained from a resource (cf. Section
3.2), which characterize different nouns. A model
needs to rank properties that apply to
N
higher than
properties that apply to other nouns in the resource.
We consider that a property correctly characterizes
a noun, if this property has been proposed for that
noun by the annotators.
3.2 Datasets
FEATURE NORMS: The McRae et al. (2005)
dataset contains feature norms for 541 objects an-
notated by 725 participants. We follow Apidianaki
and Garí Soler (2021) and only use the IS_ADJ
features of noun concepts, where the adjective de-
scribes a noun property. In total, there are 509 noun
concepts with at least one IS_ADJ feature, and 209
unique properties. The FEATURE NORMS dataset
contains both perceptual properties (e.g., tall,fluffy)
and non-perceptual ones (e.g., intelligent,expen-
sive).
MEMORY COLORS: The dataset contains 109
nouns with an associated image and its correspond-
ing prototypical color. There are 11 colors in total.
(Norlund et al.,2021). The data were scraped from
existing knowledge bases on the web.
CONCEPT PROPERTIES: This dataset was created
at the Centre for Speech, Language and Brain (De-
vereux et al.,2014). It contains concept property
norm annotations collected from 30 participants.
The data comprise 601 nouns with 400 unique prop-
erties. We keep aside 50 nouns (which are not in
FEATURE NORMS and MEMORY COLORS) as our
development set (
dev
). We use the
dev
for prompt
selection and hyper-parameter tuning. We call the
rest of the dataset CONCEPT PROPERTIES-
test
and use it for evaluation.
CONCRETENESS DATASET: The Brysbaert et al.
(2014) dataset contains manual concreteness rat-
ings for 37,058 English word lemmas and 2,896
two-word expressions, gathered through crowd-
sourcing. The original concreteness scores range
Dataset # Ns # PsN-Ppairs Ps per N
FEATURE NORMS 509 209 1592 3.1
CONCEPT PROPERTIES 601 400 3983 6.6
MEMORY COLORS 109 11 109 1.0
Table 1: Statistics of the ground-truth datasets. We
show the number of nouns (# Ns), properties (# Ps)
and noun-property pairs (N-Ppairs), as well as the av-
erage number of properties per noun in each dataset.
from 0 to 5. We map them to
[0,1]
by dividing
each score by 5.
3.3 Models
3.3.1 Language Models (LMs)
We query language models about their knowl-
edge of noun properties using cloze-style prompts
(cf. Appendix A.1). These contain the nouns
in singular or plural form, and the
[MASK]
to-
ken at the position where the property should ap-
pear (e.g., Strawberries are
[MASK]
”). A lan-
guage model assigns a probability score to a can-
didate property by relying on the wordpieces pre-
ceding and following the
[MASK]
token,
W\t=
(w1, ..., wt1, wt+1, ..., w|W|):2
ScoreLM(P) = log PLM(wt=P|W\t),(1)
where
PLM(·)
is the probability from language
model. We experiment with BERT-LARGE (De-
vlin et al.,2019), ROBERTA-LARGE (Liu et al.,
2019), GPT2-LARGE (Radford et al.,2019) and
GPT3-DAVINCI, which have been shown to deliver
impressive performance in Natural Language Un-
derstanding tasks (Yamada et al.,2020;Takase and
Kiyono,2021;Aghajanyan et al.,2021).
Our property ranking setup allows to consider
multi-piece adjectives (properties)
3
which were
excluded from open-vocabulary masking experi-
ments (Petroni et al.,2019;Bouraoui et al.,2020;
Apidianaki and Garí Soler,2021). Since the can-
didate properties are known, we can obtain a
score for a property composed of
k
pieces (
P=
(wt, ..., wt+k)
,
k1
) by taking the average of the
scores assigned by the LM to each piece:
ScoreLM(P) = 1
k
k
X
i=0
log PLM(wt+i|W\t+i)(2)
2
We also experiment with the Unidirectional Language
Model (ULM) which yields the probability of the masked
token conditioned on the past tokens W<t = (w1, ..., wt1)
3
BERT-type models split some words into multiple word
pieces during tokenization (e.g., colorful
[‘color’,‘ful’])
(Wu et al.,2016).
We report the results in Appendix E.4 and show that
our model is better than other models at retrieving
multi-piece properties.
3.3.2 Multimodal Language Models (MLMs)
Vision Encoder-Decoder
MLMs are language
models conditioned on other modalities than text,
for example, images. For each noun
N
in our
datasets, we collect a set of images
I
from the
web.
4
We probe an MLM similarly to LMs, us-
ing the same set of prompts. An MLM yields a
score for each property given an image
iI
using
Formula 3.
ScoreMLM(P, i) = log PMLM(wt=P|W\t, i),
(3)
where
PMLM(·)
is the probability from multimodal
language model. In addition to the context
W\t
,
the MLM conditions on the image
i
.
5
Then we
aggregate over all the images
I
for the noun
N
to
get the score for the property:
ScoreMLM(P) = 1
|I|X
iI
ScoreMLM(P, i)(4)
ViLT
We experiment with the Transformer-based
(Vaswani et al.,2017) VILT model (Kim et al.,
2021) as an MLM.VILT uses the same tokenizer as
BERT and is pretrained on the Google Conceptual
Captions (GCC) dataset which contains more than
3 million image-caption pairs for about 50k words
(Sharma et al.,2018). Most other vision-language
datasets contain a significantly smaller vocabulary
(10k words).
6
In addition, VILT requires minimal
image pre-processing and is an open visual vocabu-
lary model.
7
This contrasts with other multimodal
architectures which require visual predictions be-
fore passing the images on to the multimodal layers
(Li et al.,2019;Lu et al.,2019;Tan and Bansal,
2019). These have been shown to only marginally
surpass text-only models (Yun et al.,2021).
CLIP
We also use the CLIP model which is pre-
trained on 400M image-caption pairs (Radford
4
More details about the image collection procedure are
given in Section 3.5.
5
We use the same averaging method shown in equation 2
to handle multi-piece adjectives for MLMs.
6
The vocabulary size is much smaller than in BERT-like
models which are trained on a minimum of 8M words.
7
Open visual vocabulary models do not need elaborate
image pre-processing via an image detection pipeline. As
such, they are not restricted to the object classes that are
recognized by the pre-processing pipeline.
peacock
sunflower
Top-1:
An object with the property of showy.
Top-1:
An object with the property of yellow.
Bottom-1:
An object with the property of tartan.
Bottom-1:
An object with the property of kneaded.
Figure 2: Examples of Top-1 and Bottom-1 prompts
ranked by CLIP.
et al.,2021). CLIP is trained to align the embed-
ding spaces learned from images and text using
contrastive loss as a learning objective. The CLIP
model integrates a text encoder
fT
and a visual
encoder
fV
which separately encode the text and
image to vectors with the same dimension. Given a
batch of image-text pairs, CLIP maximizes the co-
sine similarity for matched pairs while minimizing
the cosine similarity for unmatched pairs.
We use CLIP to compute the cosine similarity
of an image
iI
and this text prompt (
sP
): “An
object with the property of
[MASK]
”, where the
[MASK]
token is replaced with a candidate prop-
erty
P P
. The score for each property
P
is the
mean similarity between the sentence prompt
sP
and all images Icollected for a noun:
ScoreCLIP(P) = 1
|I|X
iI
cos(fT(sP), fV(i)) (5)
This score serves to rank the candidate properties
according to their relevance for a specific noun.
Figure 2shows the most and least relevant proper-
ties for the nouns peacock and sunflower.
3.3.3 Concreteness Ensemble Model (CEM)
The concreteness score for a property guides CEM
towards “trusting” the language or the vision model
more. We propose two CEM flavors which we
describe as CEM-PRED and CEM-GOLD. CEM-
PRED uses the score (
cP[0,1]
) that is proposed
by our concreteness prediction model for every can-
didate property
P P
, while CEM-GOLD uses the
score for
P
in the Brysbaert et al. (2014) dataset.
8
If there is no gold score for a property, we use
8
Properties in MEMORY COLORS have the highest av-
erage concreteness scores (0.82), followed by properties in
FEATURE NORMS (0.64) and CONCEPT PROPERTIES (0.62).
摘要:

VisualizingtheObvious:AConcreteness-basedEnsembleModelforNounPropertyPredictionYueYang,ArtemisPanagopoulou,MariannaApidianaki,MarkYatskar,ChrisCallison-BurchDepartmentofComputerandInformationScience,UniversityofPennsylvania{yueyang1,artemisp,marapi,ccb,myatskar}@seas.upenn.eduAbstractNeurallanguag...

展开>> 收起<<
Visualizing the Obvious A Concreteness-based Ensemble Model for Noun Property Prediction Yue Yang Artemis Panagopoulou.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:5.14MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注