Visualizing the Obvious A Concreteness-based Ensemble Model for Noun Property Prediction Yue Yang Artemis Panagopoulou

2025-05-06 0 0 5.14MB 19 页 10玖币

侵权投诉

Visualizing the Obvious: A Concreteness-based Ensemble Model

for Noun Property Prediction

Yue Yang∗, Artemis Panagopoulou∗,

Marianna Apidianaki,Mark Yatskar,Chris Callison-Burch

Department of Computer and Information Science, University of Pennsylvania

{yueyang1, artemisp, marapi, ccb, myatskar}@seas.upenn.edu

Abstract

Neural language models encode rich knowl-

edge about entities and their relationships

which can be extracted from their represen-

tations using probing. Common properties

of nouns (e.g., red strawberries,small ant)

are, however, more challenging to extract com-

pared to other types of knowledge because

they are rarely explicitly stated in texts. We

hypothesize this to mainly be the case for per-

ceptual properties which are obvious to the

participants in the communication. We pro-

pose to extract these properties from images

and use them in an ensemble model, in or-

der to complement the information that is ex-

tracted from language models. We consider

perceptual properties to be more concrete than

abstract properties (e.g., interesting,ﬂawless).

We propose to use the adjectives’ concreteness

score as a lever to calibrate the contribution

of each source (text vs. images). We eval-

uate our ensemble model in a ranking task

where the actual properties of a noun need to

be ranked higher than other non-relevant prop-

erties. Our results show that the proposed com-

bination of text and images greatly improves

noun property prediction compared to power-

ful text-based language models.1

1 Introduction

Common properties of concepts or entities (e.g.,

“These strawberries are

red

”) are rarely explicitly

stated in texts, contrary to more speciﬁc properties

which bring new information in the communication

(e.g., “These strawberries are

delicious

”). This

phenomenon, known as “reporting bias” (Gordon

and Van Durme,2013;Shwartz and Choi,2020),

makes it difﬁcult to learn, or retrieve, perceptual

properties from text. However, noun property iden-

tiﬁcation is an important task which may allow AI

applications to perform commonsense reasoning

in a way that matches people’s psychological or

∗Equal Contribution.

Task: Retrieve Relevant Properties of Nouns

(a)

Language

Model

(b)

CLIP

(c)

CEM

1.male

2.friendly

3.harmless

4.female

5.available

... ...

A cat is

generally [MASK].

candidate properties

concreteness score

LM Predictions

CLIP Predictions

CEM

1.feral

2.cute

3.domesticated

4.affectionate

5.friendly

... ...

An object with the

property of [MASK].

candidate properties

CLIP

1.whiskered

2.adoptable

3.hypoallergenic

4.catlike

5.stray

... ...

Figure 1: Our task is to retrieve relevant properties of

nouns from a set of candidates. We tackle the task us-

ing (a) Cloze-task probing; (b) CLIP to compute the

similarity between the properties and images of the

noun; (c) a Concreteness Ensemble Model (CEM) to

ensemble language and CLIP predictions which relies

on properties’ concreteness ratings.

cognitive predispositions and can improve agent

communication (Lazaridou et al.,2016). Further-

more, identifying noun properties can contribute

to better modeling concepts and entities, learning

affordances (i.e. deﬁning the possible uses of an

object based on its qualities or properties), and un-

derstanding models’ knowledge about the world.

Models that combine different modalities provide

a sort of grounding which helps to alleviate the

reporting bias problem (Kiela et al.,2014;Lazari-

dou et al.,2015;Zhang et al.,2022). For example,

multimodal models are better at predicting color

attributes compared to text-based language models

(Paik et al.,2021;Norlund et al.,2021). Further-

more, visual representations of concrete objects

Code and data are available at

https://github.

com/artemisp/semantic-norms

arXiv:2210.12905v1 [cs.CL] 24 Oct 2022

improve performance in downstream NLP tasks

(Hewitt et al.,2018). Inspired by this line of work,

we expect concrete visual properties of nouns to

be more accessible through images, and text-based

language models to better encode abstract semantic

properties. We propose an ensemble model which

combines information from these two sources for

English noun property prediction.

We frame property identiﬁcation as a ranking

task, where relevant properties for a noun need

to be retrieved from a set of candidate properties

found in association norm datasets (McRae et al.,

2005;Devereux et al.,2014;Norlund et al.,2021).

We experiment with text-based language models

(Devlin et al.,2019;Radford et al.,2019;Liu et al.,

2019) and with CLIP (Radford et al.,2021) which

we query using a slot ﬁlling task, as shown in Fig-

ures 1(a) and (b). Our ensemble model (Figure

1(c)) combines the strengths of the language and

vision models, by speciﬁcally privileging the for-

mer or latter type of representation depending on

the concreteness of the processed properties (Brys-

baert et al.,2014). Given that concrete properties

are characterized by a higher degree of imageabil-

ity (Friendly et al.,1982), our model trusts the

visual model for perceptual and highly concrete

properties (e.g., color adjectives: red,green), and

the language model for abstract properties (e.g.,

free,inﬁnite). Our results conﬁrm that CLIP can

identify nouns’ perceptual properties better than

language models, which contain higher-quality in-

formation about abstract properties. Our ensemble

model, which combines the two sources of knowl-

edge, outperforms the individual models on the

property ranking task by a signiﬁcant margin.

2 Related Work

Probing has been widely used in previous work

for exploring the semantic knowledge that is en-

coded in language models. A common approach

has been to convert the facts, properties, and rela-

tions found in external knowledge sources into “ﬁll-

in-the-blank” cloze statements, and to use them to

query language models. Apidianaki and Garí Soler

(2021) do so for nouns’ semantic properties and

highlight how challenging it is to retrieve this kind

of information from BERT representations (Devlin

et al.,2019).

Furthermore, slightly different prompts tend to

retrieve different semantic information (Ettinger,

2020), compromising the robustness of semantic

probing tasks. We propose to mitigate these prob-

lems by also relying on images.

Features extracted from different modalities can

complement the information found in texts. Mul-

timodal distributional models, for example, have

been shown to outperform text-based approaches

on semantic benchmarks (Silberer et al.,2013;

Bruni et al.,2014;Lazaridou et al.,2015). Simi-

larly, ensemble models that integrate multimodal

and text-based models outperform models that only

rely on one modality in tasks such as visual ques-

tion answering (Tsimpoukelli et al.,2021;Alayrac

et al.,2022;Yang et al.,2021b), visual entailment

(Song et al.,2022), reading comprehension, natural

language inference (Zhang et al.,2021;Kiros et al.,

2018), text generation (Su et al.,2022), word sense

disambiguation (Barnard and Johnson,2005), and

video retrieval (Yang et al.,2021a).

We extend this investigation to noun property

prediction. We propose a novel noun property re-

trieval model which combines information from

language and vision models, and tunes their respec-

tive contributions based on property concreteness

(Brysbaert et al.,2014).

Concreteness is a graded notion that strongly

correlates with the degree of imageability (Friendly

et al.,1982;Byrne,1974); concrete words generally

tend to refer to tangible objects that the senses can

easily perceive (Paivio et al.,1968). We extend this

idea to noun properties and hypothesize that vision

models would have better knowledge of perceptual,

and more concrete, properties (e.g., red,ﬂat,round)

than text-based language models, which would bet-

ter capture abstract properties (e.g., free,inspiring,

promising). We evaluate our ensemble model us-

ing concreteness scores automatically predicted

by a regression model (Charbonnier and Wartena,

2019). We compare these results to the perfor-

mance of the ensemble model with manual (gold)

concreteness ratings (Brysbaert et al.,2014). In

previous work, concreteness was measured based

on the idea that abstract concepts relate to varied

and composite situations (Barsalou and Wiemer-

Hastings,2005). Consequently, visually grounded

representations of abstract concepts (e.g., freedom)

should be more complex and diverse than those of

concrete words (e.g., dog) (Lazaridou et al.,2015;

Kiela et al.,2014).

Lazaridou et al. (2015) speciﬁcally measure the

entropy of the vectors induced by multimodal mod-

els which serve as an expression of how varied the

information they encode is. They demonstrate that

the entropy of multimodal vectors strongly corre-

lates with the degree of abstractness of words.

3 Experimental Setup

3.1 Task Formulation

Given a noun

and a set of candidate properties

a model needs to select the properties

PN⊆P

that

apply to

. The candidate properties are the set of

all adjectives retained from a resource (cf. Section

3.2), which characterize different nouns. A model

needs to rank properties that apply to

higher than

properties that apply to other nouns in the resource.

We consider that a property correctly characterizes

a noun, if this property has been proposed for that

noun by the annotators.

3.2 Datasets

FEATURE NORMS: The McRae et al. (2005)

dataset contains feature norms for 541 objects an-

notated by 725 participants. We follow Apidianaki

and Garí Soler (2021) and only use the IS_ADJ

features of noun concepts, where the adjective de-

scribes a noun property. In total, there are 509 noun

concepts with at least one IS_ADJ feature, and 209

unique properties. The FEATURE NORMS dataset

contains both perceptual properties (e.g., tall,ﬂuffy)

and non-perceptual ones (e.g., intelligent,expen-

sive).

MEMORY COLORS: The dataset contains 109

nouns with an associated image and its correspond-

ing prototypical color. There are 11 colors in total.

(Norlund et al.,2021). The data were scraped from

existing knowledge bases on the web.

CONCEPT PROPERTIES: This dataset was created

at the Centre for Speech, Language and Brain (De-

vereux et al.,2014). It contains concept property

norm annotations collected from 30 participants.

The data comprise 601 nouns with 400 unique prop-

erties. We keep aside 50 nouns (which are not in

FEATURE NORMS and MEMORY COLORS) as our

development set (

dev

). We use the

dev

for prompt

selection and hyper-parameter tuning. We call the

rest of the dataset CONCEPT PROPERTIES-

test

and use it for evaluation.

CONCRETENESS DATASET: The Brysbaert et al.

(2014) dataset contains manual concreteness rat-

ings for 37,058 English word lemmas and 2,896

two-word expressions, gathered through crowd-

sourcing. The original concreteness scores range

Dataset # Ns # PsN-Ppairs Ps per N

FEATURE NORMS 509 209 1592 3.1

CONCEPT PROPERTIES 601 400 3983 6.6

MEMORY COLORS 109 11 109 1.0

Table 1: Statistics of the ground-truth datasets. We

show the number of nouns (# Ns), properties (# Ps)

and noun-property pairs (N-Ppairs), as well as the av-

erage number of properties per noun in each dataset.

from 0 to 5. We map them to

[0,1]

by dividing

each score by 5.

3.3 Models

3.3.1 Language Models (LMs)

We query language models about their knowl-

edge of noun properties using cloze-style prompts

(cf. Appendix A.1). These contain the nouns

in singular or plural form, and the

[MASK]

to-

ken at the position where the property should ap-

pear (e.g., “Strawberries are

[MASK]

”). A lan-

guage model assigns a probability score to a can-

didate property by relying on the wordpieces pre-

ceding and following the

[MASK]

token,

W\t=

(w1, ..., wt−1, wt+1, ..., w|W|):2

ScoreLM(P) = log PLM(wt=P|W\t),(1)

where

PLM(·)

is the probability from language

model. We experiment with BERT-LARGE (De-

vlin et al.,2019), ROBERTA-LARGE (Liu et al.,

2019), GPT2-LARGE (Radford et al.,2019) and

GPT3-DAVINCI, which have been shown to deliver

impressive performance in Natural Language Un-

derstanding tasks (Yamada et al.,2020;Takase and

Kiyono,2021;Aghajanyan et al.,2021).

Our property ranking setup allows to consider

multi-piece adjectives (properties)

which were

excluded from open-vocabulary masking experi-

ments (Petroni et al.,2019;Bouraoui et al.,2020;

Apidianaki and Garí Soler,2021). Since the can-

didate properties are known, we can obtain a

score for a property composed of

pieces (

(wt, ..., wt+k)

k≥1

) by taking the average of the

scores assigned by the LM to each piece:

ScoreLM(P) = 1

i=0

log PLM(wt+i|W\t+i)(2)

We also experiment with the Unidirectional Language

Model (ULM) which yields the probability of the masked

token conditioned on the past tokens W<t = (w1, ..., wt−1)

BERT-type models split some words into multiple word

pieces during tokenization (e.g., colorful

→

[‘color’,‘ful’])

(Wu et al.,2016).

We report the results in Appendix E.4 and show that

our model is better than other models at retrieving

multi-piece properties.

3.3.2 Multimodal Language Models (MLMs)

Vision Encoder-Decoder

MLMs are language

models conditioned on other modalities than text,

for example, images. For each noun

in our

datasets, we collect a set of images

from the

web.

We probe an MLM similarly to LMs, us-

ing the same set of prompts. An MLM yields a

score for each property given an image

i∈I

using

Formula 3.

ScoreMLM(P, i) = log PMLM(wt=P|W\t, i),

(3)

where

PMLM(·)

is the probability from multimodal

language model. In addition to the context

W\t

the MLM conditions on the image

Then we

aggregate over all the images

for the noun

get the score for the property:

ScoreMLM(P) = 1

|I|X

i∈I

ScoreMLM(P, i)(4)

ViLT

We experiment with the Transformer-based

(Vaswani et al.,2017) VILT model (Kim et al.,

2021) as an MLM.VILT uses the same tokenizer as

BERT and is pretrained on the Google Conceptual

Captions (GCC) dataset which contains more than

3 million image-caption pairs for about 50k words

(Sharma et al.,2018). Most other vision-language

datasets contain a signiﬁcantly smaller vocabulary

(10k words).

In addition, VILT requires minimal

image pre-processing and is an open visual vocabu-

lary model.

This contrasts with other multimodal

architectures which require visual predictions be-

fore passing the images on to the multimodal layers

(Li et al.,2019;Lu et al.,2019;Tan and Bansal,

2019). These have been shown to only marginally

surpass text-only models (Yun et al.,2021).

CLIP

We also use the CLIP model which is pre-

trained on 400M image-caption pairs (Radford

More details about the image collection procedure are

given in Section 3.5.

We use the same averaging method shown in equation 2

to handle multi-piece adjectives for MLMs.

The vocabulary size is much smaller than in BERT-like

models which are trained on a minimum of 8M words.

Open visual vocabulary models do not need elaborate

image pre-processing via an image detection pipeline. As

such, they are not restricted to the object classes that are

recognized by the pre-processing pipeline.

peacock

sunflower

Top-1:

An object with the property of showy.

Top-1:

An object with the property of yellow.

Bottom-1:

An object with the property of tartan.

Bottom-1:

An object with the property of kneaded.

Figure 2: Examples of Top-1 and Bottom-1 prompts

ranked by CLIP.

et al.,2021). CLIP is trained to align the embed-

ding spaces learned from images and text using

contrastive loss as a learning objective. The CLIP

model integrates a text encoder

and a visual

encoder

which separately encode the text and

image to vectors with the same dimension. Given a

batch of image-text pairs, CLIP maximizes the co-

sine similarity for matched pairs while minimizing

the cosine similarity for unmatched pairs.

We use CLIP to compute the cosine similarity

of an image

i∈I

and this text prompt (

): “An

object with the property of

[MASK]

”, where the

[MASK]

token is replaced with a candidate prop-

erty

P ∈ P

. The score for each property

is the

mean similarity between the sentence prompt

and all images Icollected for a noun:

ScoreCLIP(P) = 1

|I|X

i∈I

cos(fT(sP), fV(i)) (5)

This score serves to rank the candidate properties

according to their relevance for a speciﬁc noun.

Figure 2shows the most and least relevant proper-

ties for the nouns peacock and sunﬂower.

3.3.3 Concreteness Ensemble Model (CEM)

The concreteness score for a property guides CEM

towards “trusting” the language or the vision model

more. We propose two CEM ﬂavors which we

describe as CEM-PRED and CEM-GOLD. CEM-

PRED uses the score (

cP∈[0,1]

) that is proposed

by our concreteness prediction model for every can-

didate property

P ∈ P

, while CEM-GOLD uses the

score for

in the Brysbaert et al. (2014) dataset.

If there is no gold score for a property, we use

Properties in MEMORY COLORS have the highest av-

erage concreteness scores (0.82), followed by properties in

FEATURE NORMS (0.64) and CONCEPT PROPERTIES (0.62).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VisualizingtheObvious:AConcreteness-basedEnsembleModelforNounPropertyPredictionYueYang,ArtemisPanagopoulou,MariannaApidianaki,MarkYatskar,ChrisCallison-BurchDepartmentofComputerandInformationScience,UniversityofPennsylvania{yueyang1,artemisp,marapi,ccb,myatskar}@seas.upenn.eduAbstractNeurallanguag...

展开>> 收起<<

Visualizing the Obvious A Concreteness-based Ensemble Model for Noun Property Prediction Yue Yang Artemis Panagopoulou.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Visualizing the Obvious A Concreteness-based Ensemble Model for Noun Property Prediction Yue Yang Artemis Panagopoulou

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: