Modelling Commonsense Properties using Pre-Trained Bi-Encoders Amit Gajbhiye Luis Espinosa-Anke Steven Schockaert CardiffNLP Cardiff University United Kingdom

2025-05-06 0 0 329.94KB 13 页 10玖币
侵权投诉
Modelling Commonsense Properties using Pre-Trained Bi-Encoders
Amit Gajbhiye, Luis Espinosa-Anke∗♦, Steven Schockaert
CardiffNLP, Cardiff University, United Kingdom
AMPLYFI, United Kingdom
{gajbhiyea, espinosa-ankel, schockaerts1}@cardiff.ac.uk
Abstract
Grasping the commonsense properties of ev-
eryday concepts is an important prerequisite
to language understanding. While contextu-
alised language models are reportedly capa-
ble of predicting such commonsense proper-
ties with human-level accuracy, we argue that
such results have been inflated because of the
high similarity between training and test con-
cepts. This means that models which capture
concept similarity can perform well, even if
they do not capture any knowledge of the com-
monsense properties themselves. In settings
where there is no overlap between the prop-
erties that are considered during training and
testing, we find that the empirical performance
of standard language models drops dramati-
cally. To address this, we study the possibil-
ity of fine-tuning language models to explic-
itly model concepts and their properties. In
particular, we train separate concept and prop-
erty encoders on two types of readily avail-
able data: extracted hyponym-hypernym pairs
and generic sentences. Our experimental1re-
sults show that the resulting encoders allow us
to predict commonsense properties with much
higher accuracy than is possible by directly
fine-tuning language models. We also present
experimental results for the related task of un-
supervised hypernym discovery.
1 Introduction
Pre-trained language models (Devlin et al.,2019)
have been found to capture a surprisingly rich
amount of knowledge about the world (Petroni
et al.,2019). Focusing on commonsense knowl-
edge, Forbes et al. (2019) used BERT to predict
whether a given concept (e.g. teddy bear) satisfies
a given commonsense property (e.g. is dangerous).
To this end, they convert the input into a simple
sentence (e.g. “A teddy bear is dangerous”) and
1
Code and datasets are available at
https:
//github.com/amitgajbhiye/biencoder_
concept_property
treat the task as a standard sentence classification
task. Remarkably, they found this approach to sur-
pass human performance. Shwartz and Choi (2020)
moreover found that language models can, to some
extent, capture commonsense properties that are
rarely expressed in text, thus mitigating the issue
of reporting bias that has traditionally plagued ini-
tiatives for learning commonsense knowledge from
text (Gordon and Durme,2013).
Despite these encouraging signs, however, mod-
elling commonsense properties remains highly
challenging. A key concern is that language mod-
els are typically fine-tuned on a training set that
contains the same properties as those in the test
set. For instance, the test data from Forbes et al.
(2019) includes the question whether peach has
the property eaten in summer, while the training
data asserts that apple,banana,orange and pear
all have this property. To do well on this task,
the model does not actually need to capture the
knowledge that peaches are eaten in summer; it is
sufficient to capture that peach is similar to apple,
banana,orange and pear. For this reason, we pro-
pose new training-test splits, which ensure that the
properties occurring in the test data do not occur
in the training data. Our experiments show that the
ability of language models to predict commonsense
properties drops dramatically in this setting.
Our aim is to develop a strategy for modelling
the commonsense properties of concepts. Given
the limitations that arise when language models
are used directly, a natural approach is to pre-train
a language model on some kind of auxiliary data.
Unfortunately, resources encoding the common-
sense properties of concepts tend to be prohibitively
noisy. To illustrate this point, Table 1lists the prop-
erties of some everyday concepts according to three
well-known resources: ConceptNet (Speer et al.,
2017), which is a large commonsense knowledge
graph, COMET-2020
2
(Hwang et al.,2021), which
2
We used the demo at
https://mosaickg.apps.
arXiv:2210.02771v1 [cs.CL] 6 Oct 2022
ConceptNet COMET-2020 Ascent++
banana
yellow, good to eat
one of the main ingredients, eaten
as a snack, one of many fruits,
found in garden, black
rich, ripe, yellow, green, brown,
sweet, great, black, useful, safe,
delicious, healthy, nutricious, ...
lion
a feline
found in jungle, one of many an-
imals, one of many species, two
legs, very large
free, extinct, hungry, close,
unique, active, nocturnal, old, dan-
gerous, great, happy, right, ...
airplane
good for quickly travelling long distances
flying, air travel, flying machine,
very small, flight
heavy, new, important, white, safe,
unique, full, larger, clean, slow,
low, unstable, electric, ...
Table 1: Properties of some example concepts, according to three commonsense knowledge resources.
predicts triples using a generative language model
that was trained on several commonsense knowl-
edge graphs, and Ascent++ (Nguyen et al.,2021),
which is a commonsense knowledge base that was
extracted from web text. Given the noisy nature of
such resources, we rely on a database with hyper-
nyms instead. The underlying intuition is that hy-
pernyms can be extracted from text relatively easily,
while fine-grained hypernyms often implicitly de-
scribe commonsense properties. For instance, Mi-
crosoft Concept Graph (Ji et al.,2019) lists potas-
sium rich food as a hypernym of banana and large
and dangerous carnivore as a hypernym of lion.
We also experiment with GenericsKB (Bhakthavat-
salam et al.,2020), a large collection of generic
sentences (e.g. “Coffee contains minerals and an-
tioxidants which help prevent diabetes”), to ob-
tain concept-property pairs for pre-training. Given
such pre-training data, we then train a concept en-
coder
Φcon
and a property encoder
Φprop
such that
σcon(c)·Φprop(p))
indicates the probability that
concept chas property p.
In summary, our main contributions are as fol-
lows: (i) we propose a new evaluation setting which
is more realistic than the standard benchmarks for
predicting commonsense properties; (ii) we anal-
yse the potential of hypernymy datasets and generic
sentences to act as pre-training data; and (iii) we de-
velop a simple but effective bi-encoder architecture
for modelling commonsense properties.
2 Related Work
Several authors have analysed the extent to which
language models such as BERT capture common-
sense knowledge. As already mentioned, Forbes
et al. (2019) evaluated the ability of BERT to
predict commonsense properties from the McRae
dataset (McRae et al.,2005), which we also use
allenai.org/model_comet2020_entities.
in our experiments. The same dataset was used by
Weir et al. (2020) to analyse whether BERT-based
language models could generate concept names
from their associated properties; e.g. given the
input “A
hmaski
has fur, is big, and has claws”,
the model is expected to predict that
hmaski
cor-
responds to the word bear. Conversely, Apidi-
anaki and Garí Soler (2021) considered the problem
of generating adjectival properties from prompts
such as “mittens are generally
hmaski
”. Note that
the latter two works evaluated pre-trained models
directly, without fine-tuning, whereas the experi-
ments Forbes et al. (2019) involved fine-tuning the
language model on a task-specific training set first.
When the main motivation is to probe the abilities
of language models, avoiding fine-tuning has the
advantage that any observed abilities reflect what is
captured by the pre-trained language model itself,
rather than learned during the fine-tuning phase.
However, Li et al. (2021) argue that the extent to
which pre-trained language models capture com-
monsense knowledge is limited, suggesting that
some form of fine-tuning is essential in practice.
Interestingly, this remains the case for large lan-
guage models. For instance, their model had 7 bil-
lion parameters, while West et al. (2021) report that
the predictions from GPT-3 (Brown et al.,2020)
had to be filtered by a so-called critic model when
distilling a commonsense knowledge graph.
The strategy taken by COMET (Bosselut et al.,
2019) is to fine-tune a GPT model (Radford et al.)
on triples from commonsense knowledge graphs.
Being based on an autoregressive language model,
COMET can be used to predict concepts that take
the form of short phrases, which is often needed
when reasoning about events (e.g. to express moti-
vations or effects). However, as illustrated in Table
1, COMET is less suitable for modelling the com-
monsense properties of concepts. Other approaches
have focused on improving the commonsense rea-
soning abilities of general purpose language mod-
els. For instance, Zhou et al. (2021) introduce a
self-supervised pre-training tasks to encourage lan-
guage models to better capture the commonsense
relations between everyday concepts.
A final line of related work concerns the mod-
elling of hypernymy. Several authors have pro-
posed specialised embedding models for this task
(Dasgupta et al.,2021;Le et al.,2019). Most rele-
vant to our work, Takeoka et al. (2021) fine-tune a
BERT-based language model to predict the validity
of a concept–hypernym pair. Inspired by the ef-
fectiveness of Hearst patterns (Hearst,1992), they
use prompts of the form “[HYPERNYM] such as
[CONCEPT]” (and similar for other Hearst pat-
terns). The extent to which the pre-trained BERT
model captures hypnernymy has also been stud-
ied. For instance, Hanna and Mareˇ
cek (2021) use
prompts where the prediction of the
hmaski
token
can be interpreted as the prediction of a hypernym,
to avoid the need for fine-tuning the model.
3 Methodology
Let a set of concept–property pairs
K
be given,
where
(c, p)∈ K
means that concept
c
is asserted
to have the property
p
. We write
C
and
P
for the
sets of concepts and properties in
K
, i.e.
C={c|
(c, p)∈ K}
and
P={p|(c, p)∈ K}
. We use
the term “property” in a broad sense, covering both
semantic attributes, as in the pair
(banana,sweet)
,
and hypernyms, as in the pair
(banana,fruit)
. This
is motivated by the fact that hypernyms often en-
code knowledge about semantic attributes, as in
the pair
(banana,sweet fruit)
. In particular, our
hypothesis is that, by treating hypernyms and se-
mantic attributes in a unified way, we can pre-train
a model on readily available hypernym datasets and
use it to predict semantic attributes.
We want to train a model that can predict for a
given pair
(c, p)
whether
c
has property
p
. Two
general strategies can be pursued when develop-
ing such models. The first strategy is to use a so-
called cross-encoder, which amounts to fine-tuning
a single language model to predict whether a given
input
(c, p)
represents a valid pair or not. The sec-
ond strategy is to use a so-called bi-encoder, which
amounts to the idea that
c
and
p
are separately en-
coded, with the resulting vectors then being used
to predict whether
(c, p)
is a valid pair. In this pa-
per, we pursue the latter strategy. This is primarily
motivated by the fact that the concept and property
encoders enable a wider range of applications. A
cross-encoder can only be used to predict whether
a given pair
(c, p)
is valid or not. In contrast, a bi-
encoder model can also be used to efficiently find
the properties
p
of a given concept
c
. Moreover, the
resulting concept and property embeddings may
themselves be useful as static representations of
word meaning, e.g. as label embeddings for zero-
shot or few-shot learning (Socher et al.,2013;Ma
et al.;Xing et al.,2019;Li et al.,2020;Yan et al.,
2021). Finally, bi-encoders can be trained more
efficiently than cross-encoders.
Datasets
To train our model, we need a large
set of concept–property pairs
K
. Unfortunately,
high-quality knowledge of this kind is not read-
ily available. Part of the underlying issue is that
properties of concepts are rarely explicitly stated
in text, which is why directly using information
extraction techniques is not straightforward. How-
ever, initiatives for extracting hypernyms from text
have been much more successful, starting with the
seminal work of Hearst (1992). A key observation
is that fine-grained hypernyms often express com-
monsense properties, typically as a mechanism for
refining hypernyms that would otherwise be too
broad. For instance, Microsoft Concept Graph (Ji
et al.,2019) lists vitamin C rich food as a hyper-
nym of strawberry, as a refinement of the more
general hypernym food. By pre-training our model
on concept–hypernym pairs, we may thus expect
it to learn about commonsense properties as well.
To directly test this hypothesis, we use a set of
such concept–hypernym pairs as our pre-training
set
K
. Specifically, we collect the 100K concept–
hypernym pairs from Microsoft Concept Graph
3
with the highest confidence score
4
We will refer to
this dataset as MSCG.
As a second strategy, we attempt to convert the
MSCG dataset into a set of concept–property pairs.
To this end, we look for pairs
(c, h1)
and
(c, h2)
where
h2
is a suffix of
h1
. Specifically, if
h1=
mh2
and
m
is an adjectival phrase, then we assume
that
m
describes a property of
c
. For instance,
MSCG contains the pairs (strawberry,vitamin C
rich food) and (strawberry,food). Based on this,
we would include the pair (strawberry,vitamin C
3https://concept.research.microsoft.
com/Home/Download
4
Specifically, we used those pairs maximising the Rela-
tions frequency.
摘要:

ModellingCommonsensePropertiesusingPre-TrainedBi-EncodersAmitGajbhiye,LuisEspinosa-Anke},StevenSchockaertCardiffNLP,CardiffUniversity,UnitedKingdom}AMPLYFI,UnitedKingdom{gajbhiyea,espinosa-ankel,schockaerts1}@cardiff.ac.ukAbstractGraspingthecommonsensepropertiesofev-erydayconceptsisanimportantpr...

展开>> 收起<<
Modelling Commonsense Properties using Pre-Trained Bi-Encoders Amit Gajbhiye Luis Espinosa-Anke Steven Schockaert CardiffNLP Cardiff University United Kingdom.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:329.94KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注