COMPS Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models Kanishka Misra

2025-04-27 0 0 785.7KB 22 页 10玖币
侵权投诉
COMPS: Conceptual Minimal Pair Sentences for testing Robust Property
Knowledge and its Inheritance in Pre-trained Language Models
Kanishka Misra
Purdue University
kmisra@purdue.edu
Julia Rayz
Purdue University
jtaylor1@purdue.edu
Allyson Ettinger
University of Chicago
aettinger@uchicago.edu
Abstract
A characteristic feature of human semantic
cognition is its ability to not only store and
retrieve the properties of concepts observed
through experience, but to also facilitate the in-
heritance of properties (can breathe) from su-
perordinate concepts (ANIMAL) to their subor-
dinates (DOG)—i.e. demonstrate property in-
heritance. In this paper, we present COMPS,
a collection of English minimal pair sentences
that jointly tests pre-trained language models
(PLMs) on their ability to attribute properties
to concepts and their ability to demonstrate
property inheritance behavior. Analyses of
22 different PLMs on COMPS reveal that they
can easily distinguish between concepts on
the basis of a property when they are trivially
different, but find it relatively difficult when
concepts are related on the basis of nuanced
knowledge representations. Furthermore, we
find that PLMs can show behaviors suggest-
ing successful property inheritance in simple
contexts, but fail in the presence of distracting
information, which decreases the performance
of many models sometimes even below chance.
This lack of robustness in demonstrating sim-
ple reasoning raises important questions about
PLMs’ capacity to make correct inferences
even when they appear to possess the prereq-
uisite knowledge.
1 Introduction
The ability to learn, update and deploy one’s knowl-
edge about concepts (ROBIN,CHAIR) and their
properties (can fly,can be sat on), observed dur-
ing everyday experience is fundamental to human
semantic cognition (Murphy,2002;Rogers and Mc-
Clelland,2004;Rips et al.,2012). Knowledge of
a concept’s properties, combined with the ability
to infer the
IsA
relation (Sloman,1998;Murphy,
2003) leads to an important behavior known as
property inheritance (Quillian,1967;Smith and
Estes,1978;Murphy,2002), where subordinates
of a concept inherit its properties. For instance,
one is likely to infer that an entity called luna can
meow, has a tail, is a mammal, etc., even if all
they know is that it is a cat. The close connection
between a word’s meaning and its conceptual repre-
sentation makes these abilities crucial to language
understanding (Murphy,2002;Lake and Murphy,
2021), making it critical for computational mod-
els of language processing to also exhibit behav-
ior consistent with these capacities. Indeed, mod-
ern pre-trained language models (PLMs; Devlin
et al.,2019;Brown et al.,2020, etc.) have made
impressive empirical strides in eliciting general
knowledge about real world concepts and entities
(Petroni et al.,2019;Weir et al.,2020,i.a.), as well
as in demonstrating isomorphism with real world
abstractions like direction and color (Abdou et al.,
2021;Patel and Pavlick,2022), often times without
even having been explicitly trained to do so. At
the same time, their ability to robustly demonstrate
such capacities has recently been called to question,
owing to failures due to reporting bias (Gordon and
Van Durme,2013;Shwartz and Choi,2020), lack
of consistency (Elazar et al.,2021;Ravichander
et al.,2020), and sensitivity to lexical cues (Kass-
ner and Schütze,2020;Misra et al.,2020;Pandia
and Ettinger,2021).
In this work, we cast further light on PLMs’
ability to robustly demonstrate knowledge about
concepts and their properties. To this end, we intro-
duce Conceptual Minimal Pair Sentences (COMPS),
a collection of English minimal pair sentences,
where each pair attributes a property (can fly) to
two noun concepts: one which actually possesses
the property (ROBIN), and one which does not
(PENGUIN). Following standard practice in the
minimal pairs evaluation paradigm (Warstadt et al.,
2020, etc.), we test whether PLMs prefer sentence
stimuli expressing correct property knowledge over
those expressing incorrect ones. COMPS can be de-
composed into three subsets, each containing stim-
uli that progressively isolate deeper understanding
arXiv:2210.01963v4 [cs.CL] 9 Feb 2023
of the task of attributing properties to concepts,
by adding controls for more superficial heuristics.
Our first subset—COMPS-BASE—measures the ex-
tent to which PLMs attribute properties to the right
concepts, while varying the similarity of the posi-
tive (ROBIN) and the negative concepts (PENGUIN
[high] vs. TABLE [low]). This controls for the pos-
sibility that models are relying on coarse-grained
concept distinctions. For instance, in this setup a
model should prefer (
1a
) over both versions of (
1b
).
(1) a. A robin can fly.
b. *A (penguin/table) can fly.
Next, drawing on the phenomenon of property in-
heritance, the COMPS-WUGS set introduces a novel
concept, WUG, expressed as the subordinate of the
positive and negative concepts from a subset of
the COMPS-BASE set, and tests the extent to which
PLMs successfully attribute it the given property
when it is associated with the positive concept. This
increases the complexity of the reasoning task, as
well as the distance between the associated concept
(ROBIN) and property (can fly). These manipula-
tions help to control for memorization of the literal
phrases being tested, forcing models to judge prop-
erties for a novel concept that inherits the property
from a known concept. In this task, given that a
model successfully prefers (
1a
) over (
1b
), it should
also prefer (2a) over (2b):
(2) a. A wug is a robin. Therefore, a wug can fly.
b. *A wug is a penguin. Therefore, a wug can fly.
The final subset—COMPS-WUGS-DIST, combines
the aforementioned controls by using negative con-
cepts as distracting content and inserting them into
the COMPS-WUGS stimuli. Specifically, we trans-
form the stimuli of COMPS-WUGS by creating two
subordinates for every minimal pair; one for the
positive concept (ROBIN, subordinate: WUG) and
the other for the negative concept (PENGUIN, sub-
ordinate: DAX), which acts as a distractor. This
way, we control for the possibility that models may
be relying on simple word associations between
content words—of which there are only two in the
prior tests—by introducing additional, irrelevant
but contentful words into the context. Here, we
consider models to be correct if they prefer (
3a
)
over (3b), given that they prefer (1a) over (1b):
(3) a.
A
wug
is a robin. A
dax
is a penguin. Therefore, a
wug can fly.
b.
*A
wug
is a robin. A
dax
is a penguin. Therefore,
adax can fly.
Together, the three sets of stimuli tease apart more
superficial predictive behaviors, such as contex-
tual word associations, from more robust reasoning
behaviors based on understanding of concept prop-
erties. While we can expect superficial predictive
strategies to be brittle in the face of shallow pertur-
bations and irrelevant distractions, robust property
knowledge and reasoning behaviors should not.
We use COMPS to analyze robust property knowl-
edge and its inheritance in 22 different PLMs,
ranging from small masked language models to
billion-parameter autoregressive language models.
In our experiments with COMPS-BASE, we find
PLMs to demonstrate strong performance in at-
tributing properties to the correct concepts in our
minimal pairs. However, we observe this strong
performance largely when the concepts in the min-
imal pairs are trivially different (e.g., LION and
TEA for the property is a mammal). When the
concept pairs are similar (on the basis of differ-
ent knowledge representations), we find models’
performance to degrade substantially, by as much
as 25 points. We observe a similar trend in our
analyses on COMPS-WUGS—models first appear
to show desirable behavior, potentially indicating
proficiency in the more complex property inher-
itance reasoning. However, their overall perfor-
mance declines drastically when investigated in
the presence of distractors (i.e., on COMPS-WUGS-
DIST). This failure is particularly pronounced in
larger autoregressive PLMs, whose performance
in fact drops below chance in cases where distract-
ing information is proximal to the queried prop-
erty, indicating the presence of a proximity ef-
fect. Together, our findings highlight brittleness
of PLMs with conceptual knowledge and reason-
ing, as evidenced by failures in the face of simple
controls. We make our code and data available at:
https://github.com/kanishkamisra/comps.
2 Conceptual Minimal Pair Sentences
(COMPS)
2.1 Connections to prior work
Prior work in exploring property knowledge in
PLMs has adopted two different paradigms: one
which uses probing classifiers to test if the applica-
bility of a property can be decoded from the repre-
sentations of LMs (Forbes et al.,2019;Da and Ka-
sai,2019;Derby et al.,2021); and the other which
uses cloze-testing, in which LMs are tasked to fill
in the blank in prompts that describe specific prop-
erties/factual knowledge about the world (Petroni
et al.,2019;Weir et al.,2020). We argue that both
approaches—though insightful—have key limita-
tions for evaluating property knowledge, and that
minimal pair testing overcomes these limitations to
a beneficial extent.
Apart from ongoing debates surrounding the va-
lidity of probing classifiers (see Hewitt and Liang,
2019;Ravichander et al.,2021;Belinkov,2022),
the probing setup does not allow the testing of prop-
erty knowledge in a precise manner. Specifically,
several properties are often perfectly correlated in
datasets such as the one we use here (see §2.2). For
example, the property of being an animal and being
able to breathe and grow, etc., are all perfectly cor-
related with one another. Even if the model’s true
knowledge of these properties is highly variable,
probing its representations for them would yield the
exact same result, leading to conclusions that over-
estimate the model’s capacity for some properties,
while underestimating for others. Evaluation using
minimal pair sentences overcomes this limitation
by allowing us to explicitly represent the proper-
ties of interest in language form, thereby allowing
precise testing of property knowledge.
Similarly, standard cloze-testing of PLMs
(Petroni et al.,2019;Weir et al.,2020;Jiang et al.,
2021) also faces multiple limitations. First, it does
not allow for testing of multi-word expressions,
as by definition, it involves prediction of a sin-
gle word/token. Second, it does not yield faithful
conclusions about one-to-many or many-to-many
relations: e.g. the cloze prompts “Ravens can .
and “ can fly.” do not have a single correct
answer. This makes our conclusions about mod-
els’ knowledge contingent on choice of one correct
completion over the other. The minimal pair eval-
uation paradigm overcomes these issues by gen-
eralizing the cloze-testing method to multi-word
expressions—by focusing on entire sentences—
and at the same time, pairing every prompt with
a negative instance. This allows for a straightfor-
ward way to assess correctness: the choice between
multiple correct completions is transformed into
one between correct and incorrect, at the cost of
having several different instances (pairs) for test-
ing knowledge of the same property. Additionally,
the minimal pairs paradigm allows us also to shed
light on how the nature of negative samples affects
model behavior, which has been missing in ap-
proaches using probing and cloze-testing. The us-
age of minimal pairs is a well-established practice
in the literature, having been widely used in works
that analyze syntactic knowledge of LMs (Marvin
and Linzen,2018;Futrell et al.,2019;Warstadt
et al.,2020). We complement this growing liter-
ature by introducing minimal-pair testing to the
study of conceptual knowledge in PLMs.
Our property inheritance analyses closely relate
to the ‘Leap-of-Thought’ (LoT) framework of Tal-
mor et al. (2020). In particular, LoT holds the
taxonomic relations between concepts implicit and
tests whether models can abstract over them to
make property inferences—e.g., testing the extent
to which models assign Whales have bellybuttons
the ‘True’ label, given that Mammals have belly-
buttons (with the implicit knowledge here being
Whales are mammals). With COMPS-WUGS (and
COMPS-WUGS-DIST), we instead explicitly pro-
vide the relevant taxonomic knowledge in the con-
text and target whether PLMs can behave consis-
tently with knowledge they have already demon-
strated (in the base case, COMPS-BASE) and at-
tribute the property in question to the correct subor-
dinate concept. This also relates to recent work that
measures consistency of PLMs’ word prediction
capacities in eliciting factual knowledge (Elazar
et al.,2021;Ravichander et al.,2020).
2.2 Ground-truth Property Knowledge data
For our ground-truth property knowledge resource,
we use a subset of the CSLB property norms col-
lected by Devereux et al. (2014), which was fur-
ther extended by Misra et al. (2022). The origi-
nal dataset was constructed by asking 123 human
participants to generate properties for 638 every-
day concepts. Contemporary work has used this
dataset by taking as positive instances all concepts
for which a property was generated, while taking
the rest as negative instances (Lucy and Gauthier,
2017;Da and Kasai,2019, etc.) for each prop-
erty. While this dataset has been popularly used in
related literature, Misra et al. (2022) recently dis-
covered striking gaps in coverage among the prop-
erties included in the dataset.
1
For example, the
property can breathe was only generated for 6 out
of 152 animal concepts, despite being applicable
1
See also Sommerauer and Fokkens (2018) and Sommer-
auer (2022), who also discuss this limitation.
for all of them—as a result, contemporary work can
be expected to have wrongfully penalized models
that attributed this property to animals that could
indeed breathe, and similarly for other properties.
To remedy this issue, Misra et al. (2022) manually
extended CSLB’s coverage for 521 concepts and
3,645 properties. We refer to this extended CSLB
dataset as XCSLB, and we use it as our source for
ground-truth property knowledge.
2.3 Choosing negative samples
We rely on a diverse set of knowledge represen-
tation sources to construct negative samples for
COMPS. Each source has a unique representational
structure which gives rise to different pairwise sim-
ilarity metrics, on the basis of which we pick out
negative samples for each property:
Taxonomy
We consider a hierarchical organiza-
tion of our concepts, by taking a subset of WordNet
(Miller,1995) consisting of our 521 concepts. We
use the
wup
similarity (Wu and Palmer,1994) as
our choice of taxonomic similarity.
Property Norms
We use the XCSLB dataset and
organize it as a matrix whose rows indicate con-
cepts and columns indicate properties that are ei-
ther present (indicated as 1) or absent (indicated
as 0) for each concept. As our similarity measure,
we consider the jaccard similarity between the row
vectors of concepts. This reflects the overlap in
properties between concepts, and is prevalent in
studies utilizing conceptual similarity in cognitive
science (Tversky,1977;Sloman,1993, etc.).
Co-occurrence
We use the co-occurrence be-
tween concept words as an unstructured knowledge
representation. For quantifying similarity, we use
the cosine similarity of the GloVe vectors (Penning-
ton et al.,2014) of our concept words.
Sampling Strategy
Each property (
pi
) in our
dataset splits the set of concepts into two: a set
of concepts that possess the property (
Qpi
), and
a set of concepts that do not (
¬Qpi
). We sample
min(|Qpi|,10)
—i.e., at most 10—concepts from
Qpi
and take them to be our positive set. Then for
each concept in the positive set, we sample from
¬Qpi
the concept that is most similar (depending
on the source) to the positive concept and take it as
a negative concept for the property. We addition-
ally include a negative concept that is randomly
sampled from
¬Qpi
, leaving out the concepts sam-
pled on the basis of the three previously described
Knowledge Rep. Negative Concept Similarity
Taxonomy HORSE 0.88
Property Norms DEER 0.63
Co-occurrence GIRAFFE 0.75
Random BAT -
Table 1: Negatively sampled concepts selected on the
basis of various knowledge representational mecha-
nisms, where the property is has striped patterns, and
the positive concept is ZEBRA.
knowledge sources. Examples of the four types of
negative samples for the concept ZEBRA and the
property has striped patterns are shown in Table 1.
2.4 Minimal Pair Construction
Following our negative sample generation process,
we end up with total of 49,280 pairs of positive and
negative concepts that span across 3,645 properties
(14 pairs per property, on average). Every prop-
erty is associated with a property phrase—a verb
phrase which expresses the property in English, as
provided in XCSLB. Using these materials, we con-
struct our three datasets of minimal pair sentence
stimuli, examples of which are shown in Figure 1.
COMPS-BASE
The COMPS-BASE dataset con-
tains minimal pair sentences that follow the tem-
plate: “
[DET] [CONCEPT] [property-phrase]
.
where
[DET]
is an optional determiner, and
[CONCEPT]
is the noun concept. Applying this
template to our generated pairs results in 49,280
instances. See Figure 1a for an example.
COMPS-WUGS
We test property inheritance in
PLMs using only the animal kingdom subset of
COMPS-BASE (152 concepts, 944 properties, and
13,888 pairs), keeping the same negative samples.
We convert the original minimal pair sentences in
COMPS-BASE, in which the positive concept is an
animal, into pairs of two-sentence stimuli by first
introducing a new concept (WUG) to be the sub-
ordinate of the concepts in the original minimal
pair. We then express its property inheritance in
a separate sentence. Our two sentence stimuli fol-
low the template: A wug is a
[CONCEPT]
. There-
fore, a wug
[property-phrase]
. Although we
use wug as our running example for the subordi-
nate concept, we use four different nonsense words
{wug,dax,blicket,fep} equal numbers of times,
to avoid making spurious conclusions based on a
single nonsense word.
2
Introducing an intervening
2
As we describe in §4, we also tried a different set of nonce
words, to address concerns about possible impacts of using
Property: can fly
Positive: ROBIN
Negative: PENGUIN
Subordinate: WUG
COMPS-BASE:A (robin/penguin) can fly.
COMPS-WUGS:
A wug is a (
robin/penguin
).
Therefore, a wug can fly.
(a) Instances of COMPS-BASE and COMPS-WUGS.
A dax is a penguin.
A wug is a robin. Therefore, a (wug/dax) can fly.
in-between
before
(b) Distraction scheme for stimuli in COMPS-WUGS-DIST, where
the distractor is inserted either
before
or
in between
each COMPS-
WUGS stimulus.
Figure 1: Examples of materials used in our experiments. In this example, ROBIN is the positive concept.
novel concept allows us to robustly control for sim-
ple word-level associations between concepts and
properties that models might have picked up during
training. Figure 1a shows an example.
COMPS-WUGS-DIST
To add distracting informa-
tion, we follow Pandia and Ettinger (2021) and
convert the COMPS-WUGS stimuli by associating
a different subordinate concept (DAX) with the
negative concept (
[NEG-CONCEPT]
), and inserting
it
before
or
in-between
the sentence containing
the positive concept and its subordinate, sepa-
rately. This results in two subsets (
before
and
in-
between
) of three-sentence minimal pair stimuli,
which differ in the subordinate to which the prop-
erty is attributed. We use the following template
to create our stimuli: A
wug
is a
[CONCEPT]
. A
dax
is a
[NEG-CONCEPT]
. Therefore, a (
wug
/
dax
)
[property-phrase]
. That is, we have stimuli
that resemble COMPS-WUGS but instead deal with
a pair of competing subordinate concepts in con-
text.3See Figure 1b for an example.
3 Methodology
3.1 Models Investigated
We investigate property knowledge and property
inheritance capacities of 22 different PLMs, be-
longing to six different families. We evaluate four
widely used masked language modeling (MLM)
families: (1) ALBERT (Lan et al.,2020), (2) BERT
(Devlin et al.,2019), (3) ELECTRA (Clark et al.,
2020), and (4) RoBERTa (Liu et al.,2019); as well
as two auto-regressive language modeling families:
(1) GPT2 (Radford et al.,2019), and (2) the GPT-
Neo (Black et al.,2021) and GPT-J models (Wang
and Komatsuzaki,2021) from EleutherAI. We also
use distilled versions of BERT-base, RoBERTa-
base, and GPT2, trained using the method de-
nonce words from existing literature (e.g., wug).
3
We again choose from our list of four nonsense words
(wug,dax,blicket, and fep), which amounts to 12 unique
ordered pairs, after accounting for counterbalancing.
scribed by Sanh et al. (2019). We list each model’s
parameters, vocabulary size, and training corpora
in Table 3(Appendix A).
3.2 Measuring Performance
To evaluate models on COMPS, we compare
their log-probabilities for the property phrase—
conditioned on contexts (to the left) containing the
positive and negative noun concepts. That is, we
hold the property phrase constant, and compare
across minimally differing conditions to evaluate
the probability with which a property is attributed
to each concept. For example, we score stimuli in
COMPS-BASE, e.g., “A dog can bark.” as:
log p(can bark. |A dog),
its corresponding stimulus in COMPS-WUGS, “A
wug is a dog. Therefore, a wug can bark.” as:
log p(can bark. |A wug is a dog. Therefore, a wug),
and similarly—assuming CAT as the negative
concept—the corresponding stimuli in our COMPS-
WUGS-DIST subset, “A wug is a dog. A dax is a cat.
Therefore, a wug can bark.” as:4
log p(can bark. |A wug is a dog. A dax is a cat. There-
fore, a wug).
This approach to eliciting conditional LM judg-
ments is equivalent to the “scoring by premise”
method (Holtzman et al.,2021), which has been
shown to result in stable comparisons across items.
Additionally, this also takes into account the poten-
tial noise due to frequency effects or tokenization
differences (Misra et al.,2021). Estimating these
conditional log-probabilities using auto-regressive
PLMs can be directly computed in a left-to-right
manner. For MLMs, we use their conditional
4
Here we show an example where the distractor is added
in-between
the context specifying the positive concept, and
the queried property knowledge.
摘要:

COMPS:ConceptualMinimalPairSentencesfortestingRobustPropertyKnowledgeanditsInheritanceinPre-trainedLanguageModelsKanishkaMisraPurdueUniversitykmisra@purdue.eduJuliaRayzPurdueUniversityjtaylor1@purdue.eduAllysonEttingerUniversityofChicagoaettinger@uchicago.eduAbstractAcharacteristicfeatureofhumansema...

展开>> 收起<<
COMPS Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models Kanishka Misra.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:785.7KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注