Modelling Commonsense Properties using Pre-Trained Bi-Encoders Amit Gajbhiye Luis Espinosa-Anke Steven Schockaert CardiffNLP Cardiff University United Kingdom

2025-05-06 1 0 329.94KB 13 页 10玖币

侵权投诉

Modelling Commonsense Properties using Pre-Trained Bi-Encoders

Amit Gajbhiye∗, Luis Espinosa-Anke∗♦, Steven Schockaert∗

∗CardiffNLP, Cardiff University, United Kingdom

♦AMPLYFI, United Kingdom

{gajbhiyea, espinosa-ankel, schockaerts1}@cardiff.ac.uk

Abstract

Grasping the commonsense properties of ev-

eryday concepts is an important prerequisite

to language understanding. While contextu-

alised language models are reportedly capa-

ble of predicting such commonsense proper-

ties with human-level accuracy, we argue that

such results have been inﬂated because of the

high similarity between training and test con-

cepts. This means that models which capture

concept similarity can perform well, even if

they do not capture any knowledge of the com-

monsense properties themselves. In settings

where there is no overlap between the prop-

erties that are considered during training and

testing, we ﬁnd that the empirical performance

of standard language models drops dramati-

cally. To address this, we study the possibil-

ity of ﬁne-tuning language models to explic-

itly model concepts and their properties. In

particular, we train separate concept and prop-

erty encoders on two types of readily avail-

able data: extracted hyponym-hypernym pairs

and generic sentences. Our experimental1re-

sults show that the resulting encoders allow us

to predict commonsense properties with much

higher accuracy than is possible by directly

ﬁne-tuning language models. We also present

experimental results for the related task of un-

supervised hypernym discovery.

1 Introduction

Pre-trained language models (Devlin et al.,2019)

have been found to capture a surprisingly rich

amount of knowledge about the world (Petroni

et al.,2019). Focusing on commonsense knowl-

edge, Forbes et al. (2019) used BERT to predict

whether a given concept (e.g. teddy bear) satisﬁes

a given commonsense property (e.g. is dangerous).

To this end, they convert the input into a simple

sentence (e.g. “A teddy bear is dangerous”) and

Code and datasets are available at

https:

//github.com/amitgajbhiye/biencoder_

concept_property

treat the task as a standard sentence classiﬁcation

task. Remarkably, they found this approach to sur-

pass human performance. Shwartz and Choi (2020)

moreover found that language models can, to some

extent, capture commonsense properties that are

rarely expressed in text, thus mitigating the issue

of reporting bias that has traditionally plagued ini-

tiatives for learning commonsense knowledge from

text (Gordon and Durme,2013).

Despite these encouraging signs, however, mod-

elling commonsense properties remains highly

challenging. A key concern is that language mod-

els are typically ﬁne-tuned on a training set that

contains the same properties as those in the test

set. For instance, the test data from Forbes et al.

(2019) includes the question whether peach has

the property eaten in summer, while the training

data asserts that apple,banana,orange and pear

all have this property. To do well on this task,

the model does not actually need to capture the

knowledge that peaches are eaten in summer; it is

sufﬁcient to capture that peach is similar to apple,

banana,orange and pear. For this reason, we pro-

pose new training-test splits, which ensure that the

properties occurring in the test data do not occur

in the training data. Our experiments show that the

ability of language models to predict commonsense

properties drops dramatically in this setting.

Our aim is to develop a strategy for modelling

the commonsense properties of concepts. Given

the limitations that arise when language models

are used directly, a natural approach is to pre-train

a language model on some kind of auxiliary data.

Unfortunately, resources encoding the common-

sense properties of concepts tend to be prohibitively

noisy. To illustrate this point, Table 1lists the prop-

erties of some everyday concepts according to three

well-known resources: ConceptNet (Speer et al.,

2017), which is a large commonsense knowledge

graph, COMET-2020

(Hwang et al.,2021), which

We used the demo at

https://mosaickg.apps.

arXiv:2210.02771v1 [cs.CL] 6 Oct 2022

ConceptNet COMET-2020 Ascent++

banana

yellow, good to eat

one of the main ingredients, eaten

as a snack, one of many fruits,

found in garden, black

rich, ripe, yellow, green, brown,

sweet, great, black, useful, safe,

delicious, healthy, nutricious, ...

lion

a feline

found in jungle, one of many an-

imals, one of many species, two

legs, very large

free, extinct, hungry, close,

unique, active, nocturnal, old, dan-

gerous, great, happy, right, ...

airplane

good for quickly travelling long distances

ﬂying, air travel, ﬂying machine,

very small, ﬂight

heavy, new, important, white, safe,

unique, full, larger, clean, slow,

low, unstable, electric, ...

Table 1: Properties of some example concepts, according to three commonsense knowledge resources.

predicts triples using a generative language model

that was trained on several commonsense knowl-

edge graphs, and Ascent++ (Nguyen et al.,2021),

which is a commonsense knowledge base that was

extracted from web text. Given the noisy nature of

such resources, we rely on a database with hyper-

nyms instead. The underlying intuition is that hy-

pernyms can be extracted from text relatively easily,

while ﬁne-grained hypernyms often implicitly de-

scribe commonsense properties. For instance, Mi-

crosoft Concept Graph (Ji et al.,2019) lists potas-

sium rich food as a hypernym of banana and large

and dangerous carnivore as a hypernym of lion.

We also experiment with GenericsKB (Bhakthavat-

salam et al.,2020), a large collection of generic

sentences (e.g. “Coffee contains minerals and an-

tioxidants which help prevent diabetes”), to ob-

tain concept-property pairs for pre-training. Given

such pre-training data, we then train a concept en-

coder

Φcon

and a property encoder

Φprop

such that

σ(Φcon(c)·Φprop(p))

indicates the probability that

concept chas property p.

In summary, our main contributions are as fol-

lows: (i) we propose a new evaluation setting which

is more realistic than the standard benchmarks for

predicting commonsense properties; (ii) we anal-

yse the potential of hypernymy datasets and generic

sentences to act as pre-training data; and (iii) we de-

velop a simple but effective bi-encoder architecture

for modelling commonsense properties.

2 Related Work

Several authors have analysed the extent to which

language models such as BERT capture common-

sense knowledge. As already mentioned, Forbes

et al. (2019) evaluated the ability of BERT to

predict commonsense properties from the McRae

dataset (McRae et al.,2005), which we also use

allenai.org/model_comet2020_entities.

in our experiments. The same dataset was used by

Weir et al. (2020) to analyse whether BERT-based

language models could generate concept names

from their associated properties; e.g. given the

input “A

hmaski

has fur, is big, and has claws”,

the model is expected to predict that

hmaski

cor-

responds to the word bear. Conversely, Apidi-

anaki and Garí Soler (2021) considered the problem

of generating adjectival properties from prompts

such as “mittens are generally

hmaski

”. Note that

the latter two works evaluated pre-trained models

directly, without ﬁne-tuning, whereas the experi-

ments Forbes et al. (2019) involved ﬁne-tuning the

language model on a task-speciﬁc training set ﬁrst.

When the main motivation is to probe the abilities

of language models, avoiding ﬁne-tuning has the

advantage that any observed abilities reﬂect what is

captured by the pre-trained language model itself,

rather than learned during the ﬁne-tuning phase.

However, Li et al. (2021) argue that the extent to

which pre-trained language models capture com-

monsense knowledge is limited, suggesting that

some form of ﬁne-tuning is essential in practice.

Interestingly, this remains the case for large lan-

guage models. For instance, their model had 7 bil-

lion parameters, while West et al. (2021) report that

the predictions from GPT-3 (Brown et al.,2020)

had to be ﬁltered by a so-called critic model when

distilling a commonsense knowledge graph.

The strategy taken by COMET (Bosselut et al.,

2019) is to ﬁne-tune a GPT model (Radford et al.)

on triples from commonsense knowledge graphs.

Being based on an autoregressive language model,

COMET can be used to predict concepts that take

the form of short phrases, which is often needed

when reasoning about events (e.g. to express moti-

vations or effects). However, as illustrated in Table

1, COMET is less suitable for modelling the com-

monsense properties of concepts. Other approaches

have focused on improving the commonsense rea-

soning abilities of general purpose language mod-

els. For instance, Zhou et al. (2021) introduce a

self-supervised pre-training tasks to encourage lan-

guage models to better capture the commonsense

relations between everyday concepts.

A ﬁnal line of related work concerns the mod-

elling of hypernymy. Several authors have pro-

posed specialised embedding models for this task

(Dasgupta et al.,2021;Le et al.,2019). Most rele-

vant to our work, Takeoka et al. (2021) ﬁne-tune a

BERT-based language model to predict the validity

of a concept–hypernym pair. Inspired by the ef-

fectiveness of Hearst patterns (Hearst,1992), they

use prompts of the form “[HYPERNYM] such as

[CONCEPT]” (and similar for other Hearst pat-

terns). The extent to which the pre-trained BERT

model captures hypnernymy has also been stud-

ied. For instance, Hanna and Mareˇ

cek (2021) use

prompts where the prediction of the

hmaski

token

can be interpreted as the prediction of a hypernym,

to avoid the need for ﬁne-tuning the model.

3 Methodology

Let a set of concept–property pairs

be given,

where

(c, p)∈ K

means that concept

is asserted

to have the property

. We write

and

for the

sets of concepts and properties in

, i.e.

C={c|

(c, p)∈ K}

and

P={p|(c, p)∈ K}

. We use

the term “property” in a broad sense, covering both

semantic attributes, as in the pair

(banana,sweet)

and hypernyms, as in the pair

(banana,fruit)

. This

is motivated by the fact that hypernyms often en-

code knowledge about semantic attributes, as in

the pair

(banana,sweet fruit)

. In particular, our

hypothesis is that, by treating hypernyms and se-

mantic attributes in a uniﬁed way, we can pre-train

a model on readily available hypernym datasets and

use it to predict semantic attributes.

We want to train a model that can predict for a

given pair

(c, p)

whether

has property

. Two

general strategies can be pursued when develop-

ing such models. The ﬁrst strategy is to use a so-

called cross-encoder, which amounts to ﬁne-tuning

a single language model to predict whether a given

input

(c, p)

represents a valid pair or not. The sec-

ond strategy is to use a so-called bi-encoder, which

amounts to the idea that

and

are separately en-

coded, with the resulting vectors then being used

to predict whether

(c, p)

is a valid pair. In this pa-

per, we pursue the latter strategy. This is primarily

motivated by the fact that the concept and property

encoders enable a wider range of applications. A

cross-encoder can only be used to predict whether

a given pair

(c, p)

is valid or not. In contrast, a bi-

encoder model can also be used to efﬁciently ﬁnd

the properties

of a given concept

. Moreover, the

resulting concept and property embeddings may

themselves be useful as static representations of

word meaning, e.g. as label embeddings for zero-

shot or few-shot learning (Socher et al.,2013;Ma

et al.;Xing et al.,2019;Li et al.,2020;Yan et al.,

2021). Finally, bi-encoders can be trained more

efﬁciently than cross-encoders.

Datasets

To train our model, we need a large

set of concept–property pairs

. Unfortunately,

high-quality knowledge of this kind is not read-

ily available. Part of the underlying issue is that

properties of concepts are rarely explicitly stated

in text, which is why directly using information

extraction techniques is not straightforward. How-

ever, initiatives for extracting hypernyms from text

have been much more successful, starting with the

seminal work of Hearst (1992). A key observation

is that ﬁne-grained hypernyms often express com-

monsense properties, typically as a mechanism for

reﬁning hypernyms that would otherwise be too

broad. For instance, Microsoft Concept Graph (Ji

et al.,2019) lists vitamin C rich food as a hyper-

nym of strawberry, as a reﬁnement of the more

general hypernym food. By pre-training our model

on concept–hypernym pairs, we may thus expect

it to learn about commonsense properties as well.

To directly test this hypothesis, we use a set of

such concept–hypernym pairs as our pre-training

set

. Speciﬁcally, we collect the 100K concept–

hypernym pairs from Microsoft Concept Graph

with the highest conﬁdence score

We will refer to

this dataset as MSCG.

As a second strategy, we attempt to convert the

MSCG dataset into a set of concept–property pairs.

To this end, we look for pairs

(c, h1)

and

(c, h2)

where

is a sufﬁx of

. Speciﬁcally, if

h1=

mh2

and

is an adjectival phrase, then we assume

that

describes a property of

. For instance,

MSCG contains the pairs (strawberry,vitamin C

rich food) and (strawberry,food). Based on this,

we would include the pair (strawberry,vitamin C

3https://concept.research.microsoft.

com/Home/Download

Speciﬁcally, we used those pairs maximising the Rela-

tions frequency.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ModellingCommonsensePropertiesusingPre-TrainedBi-EncodersAmitGajbhiye,LuisEspinosa-Anke},StevenSchockaertCardiffNLP,CardiffUniversity,UnitedKingdom}AMPLYFI,UnitedKingdom{gajbhiyea,espinosa-ankel,schockaerts1}@cardiff.ac.ukAbstractGraspingthecommonsensepropertiesofev-erydayconceptsisanimportantpr...

展开>> 收起<<

Modelling Commonsense Properties using Pre-Trained Bi-Encoders Amit Gajbhiye Luis Espinosa-Anke Steven Schockaert CardiffNLP Cardiff University United Kingdom.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Modelling Commonsense Properties using Pre-Trained Bi-Encoders Amit Gajbhiye Luis Espinosa-Anke Steven Schockaert CardiffNLP Cardiff University United Kingdom

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: