COMPS Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models Kanishka Misra

2025-04-27 0 0 785.7KB 22 页 10玖币

侵权投诉

COMPS: Conceptual Minimal Pair Sentences for testing Robust Property

Knowledge and its Inheritance in Pre-trained Language Models

Kanishka Misra

Purdue University

kmisra@purdue.edu

Julia Rayz

Purdue University

jtaylor1@purdue.edu

Allyson Ettinger

University of Chicago

aettinger@uchicago.edu

Abstract

A characteristic feature of human semantic

cognition is its ability to not only store and

retrieve the properties of concepts observed

through experience, but to also facilitate the in-

heritance of properties (can breathe) from su-

perordinate concepts (ANIMAL) to their subor-

dinates (DOG)—i.e. demonstrate property in-

heritance. In this paper, we present COMPS,

a collection of English minimal pair sentences

that jointly tests pre-trained language models

(PLMs) on their ability to attribute properties

to concepts and their ability to demonstrate

property inheritance behavior. Analyses of

22 different PLMs on COMPS reveal that they

can easily distinguish between concepts on

the basis of a property when they are trivially

different, but ﬁnd it relatively difﬁcult when

concepts are related on the basis of nuanced

knowledge representations. Furthermore, we

ﬁnd that PLMs can show behaviors suggest-

ing successful property inheritance in simple

contexts, but fail in the presence of distracting

information, which decreases the performance

of many models sometimes even below chance.

This lack of robustness in demonstrating sim-

ple reasoning raises important questions about

PLMs’ capacity to make correct inferences

even when they appear to possess the prereq-

uisite knowledge.

1 Introduction

The ability to learn, update and deploy one’s knowl-

edge about concepts (ROBIN,CHAIR) and their

properties (can ﬂy,can be sat on), observed dur-

ing everyday experience is fundamental to human

semantic cognition (Murphy,2002;Rogers and Mc-

Clelland,2004;Rips et al.,2012). Knowledge of

a concept’s properties, combined with the ability

to infer the

IsA

relation (Sloman,1998;Murphy,

2003) leads to an important behavior known as

property inheritance (Quillian,1967;Smith and

Estes,1978;Murphy,2002), where subordinates

of a concept inherit its properties. For instance,

one is likely to infer that an entity called luna can

meow, has a tail, is a mammal, etc., even if all

they know is that it is a cat. The close connection

between a word’s meaning and its conceptual repre-

sentation makes these abilities crucial to language

understanding (Murphy,2002;Lake and Murphy,

2021), making it critical for computational mod-

els of language processing to also exhibit behav-

ior consistent with these capacities. Indeed, mod-

ern pre-trained language models (PLMs; Devlin

et al.,2019;Brown et al.,2020, etc.) have made

impressive empirical strides in eliciting general

knowledge about real world concepts and entities

(Petroni et al.,2019;Weir et al.,2020,i.a.), as well

as in demonstrating isomorphism with real world

abstractions like direction and color (Abdou et al.,

2021;Patel and Pavlick,2022), often times without

even having been explicitly trained to do so. At

the same time, their ability to robustly demonstrate

such capacities has recently been called to question,

owing to failures due to reporting bias (Gordon and

Van Durme,2013;Shwartz and Choi,2020), lack

of consistency (Elazar et al.,2021;Ravichander

et al.,2020), and sensitivity to lexical cues (Kass-

ner and Schütze,2020;Misra et al.,2020;Pandia

and Ettinger,2021).

In this work, we cast further light on PLMs’

ability to robustly demonstrate knowledge about

concepts and their properties. To this end, we intro-

duce Conceptual Minimal Pair Sentences (COMPS),

a collection of English minimal pair sentences,

where each pair attributes a property (can ﬂy) to

two noun concepts: one which actually possesses

the property (ROBIN), and one which does not

(PENGUIN). Following standard practice in the

minimal pairs evaluation paradigm (Warstadt et al.,

2020, etc.), we test whether PLMs prefer sentence

stimuli expressing correct property knowledge over

those expressing incorrect ones. COMPS can be de-

composed into three subsets, each containing stim-

uli that progressively isolate deeper understanding

arXiv:2210.01963v4 [cs.CL] 9 Feb 2023

of the task of attributing properties to concepts,

by adding controls for more superﬁcial heuristics.

Our ﬁrst subset—COMPS-BASE—measures the ex-

tent to which PLMs attribute properties to the right

concepts, while varying the similarity of the posi-

tive (ROBIN) and the negative concepts (PENGUIN

[high] vs. TABLE [low]). This controls for the pos-

sibility that models are relying on coarse-grained

concept distinctions. For instance, in this setup a

model should prefer (

) over both versions of (

(1) a. A robin can ﬂy.

b. *A (penguin/table) can ﬂy.

Next, drawing on the phenomenon of property in-

heritance, the COMPS-WUGS set introduces a novel

concept, WUG, expressed as the subordinate of the

positive and negative concepts from a subset of

the COMPS-BASE set, and tests the extent to which

PLMs successfully attribute it the given property

when it is associated with the positive concept. This

increases the complexity of the reasoning task, as

well as the distance between the associated concept

(ROBIN) and property (can ﬂy). These manipula-

tions help to control for memorization of the literal

phrases being tested, forcing models to judge prop-

erties for a novel concept that inherits the property

from a known concept. In this task, given that a

model successfully prefers (

) over (

), it should

also prefer (2a) over (2b):

(2) a. A wug is a robin. Therefore, a wug can ﬂy.

b. *A wug is a penguin. Therefore, a wug can ﬂy.

The ﬁnal subset—COMPS-WUGS-DIST, combines

the aforementioned controls by using negative con-

cepts as distracting content and inserting them into

the COMPS-WUGS stimuli. Speciﬁcally, we trans-

form the stimuli of COMPS-WUGS by creating two

subordinates for every minimal pair; one for the

positive concept (ROBIN, subordinate: WUG) and

the other for the negative concept (PENGUIN, sub-

ordinate: DAX), which acts as a distractor. This

way, we control for the possibility that models may

be relying on simple word associations between

content words—of which there are only two in the

prior tests—by introducing additional, irrelevant

but contentful words into the context. Here, we

consider models to be correct if they prefer (

)

over (3b), given that they prefer (1a) over (1b):

(3) a.

wug

is a robin. A

dax

is a penguin. Therefore, a

wug can ﬂy.

wug

is a robin. A

dax

is a penguin. Therefore,

adax can ﬂy.

Together, the three sets of stimuli tease apart more

superﬁcial predictive behaviors, such as contex-

tual word associations, from more robust reasoning

behaviors based on understanding of concept prop-

erties. While we can expect superﬁcial predictive

strategies to be brittle in the face of shallow pertur-

bations and irrelevant distractions, robust property

knowledge and reasoning behaviors should not.

We use COMPS to analyze robust property knowl-

edge and its inheritance in 22 different PLMs,

ranging from small masked language models to

billion-parameter autoregressive language models.

In our experiments with COMPS-BASE, we ﬁnd

PLMs to demonstrate strong performance in at-

tributing properties to the correct concepts in our

minimal pairs. However, we observe this strong

performance largely when the concepts in the min-

imal pairs are trivially different (e.g., LION and

TEA for the property is a mammal). When the

concept pairs are similar (on the basis of differ-

ent knowledge representations), we ﬁnd models’

performance to degrade substantially, by as much

as 25 points. We observe a similar trend in our

analyses on COMPS-WUGS—models ﬁrst appear

to show desirable behavior, potentially indicating

proﬁciency in the more complex property inher-

itance reasoning. However, their overall perfor-

mance declines drastically when investigated in

the presence of distractors (i.e., on COMPS-WUGS-

DIST). This failure is particularly pronounced in

larger autoregressive PLMs, whose performance

in fact drops below chance in cases where distract-

ing information is proximal to the queried prop-

erty, indicating the presence of a proximity ef-

fect. Together, our ﬁndings highlight brittleness

of PLMs with conceptual knowledge and reason-

ing, as evidenced by failures in the face of simple

controls. We make our code and data available at:

https://github.com/kanishkamisra/comps.

2 Conceptual Minimal Pair Sentences

(COMPS)

2.1 Connections to prior work

Prior work in exploring property knowledge in

PLMs has adopted two different paradigms: one

which uses probing classiﬁers to test if the applica-

bility of a property can be decoded from the repre-

sentations of LMs (Forbes et al.,2019;Da and Ka-

sai,2019;Derby et al.,2021); and the other which

uses cloze-testing, in which LMs are tasked to ﬁll

in the blank in prompts that describe speciﬁc prop-

erties/factual knowledge about the world (Petroni

et al.,2019;Weir et al.,2020). We argue that both

approaches—though insightful—have key limita-

tions for evaluating property knowledge, and that

minimal pair testing overcomes these limitations to

a beneﬁcial extent.

Apart from ongoing debates surrounding the va-

lidity of probing classiﬁers (see Hewitt and Liang,

2019;Ravichander et al.,2021;Belinkov,2022),

the probing setup does not allow the testing of prop-

erty knowledge in a precise manner. Speciﬁcally,

several properties are often perfectly correlated in

datasets such as the one we use here (see §2.2). For

example, the property of being an animal and being

able to breathe and grow, etc., are all perfectly cor-

related with one another. Even if the model’s true

knowledge of these properties is highly variable,

probing its representations for them would yield the

exact same result, leading to conclusions that over-

estimate the model’s capacity for some properties,

while underestimating for others. Evaluation using

minimal pair sentences overcomes this limitation

by allowing us to explicitly represent the proper-

ties of interest in language form, thereby allowing

precise testing of property knowledge.

Similarly, standard cloze-testing of PLMs

(Petroni et al.,2019;Weir et al.,2020;Jiang et al.,

2021) also faces multiple limitations. First, it does

not allow for testing of multi-word expressions,

as by deﬁnition, it involves prediction of a sin-

gle word/token. Second, it does not yield faithful

conclusions about one-to-many or many-to-many

relations: e.g. the cloze prompts “Ravens can .”

and “ can ﬂy.” do not have a single correct

answer. This makes our conclusions about mod-

els’ knowledge contingent on choice of one correct

completion over the other. The minimal pair eval-

uation paradigm overcomes these issues by gen-

eralizing the cloze-testing method to multi-word

expressions—by focusing on entire sentences—

and at the same time, pairing every prompt with

a negative instance. This allows for a straightfor-

ward way to assess correctness: the choice between

multiple correct completions is transformed into

one between correct and incorrect, at the cost of

having several different instances (pairs) for test-

ing knowledge of the same property. Additionally,

the minimal pairs paradigm allows us also to shed

light on how the nature of negative samples affects

model behavior, which has been missing in ap-

proaches using probing and cloze-testing. The us-

age of minimal pairs is a well-established practice

in the literature, having been widely used in works

that analyze syntactic knowledge of LMs (Marvin

and Linzen,2018;Futrell et al.,2019;Warstadt

et al.,2020). We complement this growing liter-

ature by introducing minimal-pair testing to the

study of conceptual knowledge in PLMs.

Our property inheritance analyses closely relate

to the ‘Leap-of-Thought’ (LoT) framework of Tal-

mor et al. (2020). In particular, LoT holds the

taxonomic relations between concepts implicit and

tests whether models can abstract over them to

make property inferences—e.g., testing the extent

to which models assign Whales have bellybuttons

the ‘True’ label, given that Mammals have belly-

buttons (with the implicit knowledge here being

Whales are mammals). With COMPS-WUGS (and

COMPS-WUGS-DIST), we instead explicitly pro-

vide the relevant taxonomic knowledge in the con-

text and target whether PLMs can behave consis-

tently with knowledge they have already demon-

strated (in the base case, COMPS-BASE) and at-

tribute the property in question to the correct subor-

dinate concept. This also relates to recent work that

measures consistency of PLMs’ word prediction

capacities in eliciting factual knowledge (Elazar

et al.,2021;Ravichander et al.,2020).

2.2 Ground-truth Property Knowledge data

For our ground-truth property knowledge resource,

we use a subset of the CSLB property norms col-

lected by Devereux et al. (2014), which was fur-

ther extended by Misra et al. (2022). The origi-

nal dataset was constructed by asking 123 human

participants to generate properties for 638 every-

day concepts. Contemporary work has used this

dataset by taking as positive instances all concepts

for which a property was generated, while taking

the rest as negative instances (Lucy and Gauthier,

2017;Da and Kasai,2019, etc.) for each prop-

erty. While this dataset has been popularly used in

related literature, Misra et al. (2022) recently dis-

covered striking gaps in coverage among the prop-

erties included in the dataset.

For example, the

property can breathe was only generated for 6 out

of 152 animal concepts, despite being applicable

See also Sommerauer and Fokkens (2018) and Sommer-

auer (2022), who also discuss this limitation.

for all of them—as a result, contemporary work can

be expected to have wrongfully penalized models

that attributed this property to animals that could

indeed breathe, and similarly for other properties.

To remedy this issue, Misra et al. (2022) manually

extended CSLB’s coverage for 521 concepts and

3,645 properties. We refer to this extended CSLB

dataset as XCSLB, and we use it as our source for

ground-truth property knowledge.

2.3 Choosing negative samples

We rely on a diverse set of knowledge represen-

tation sources to construct negative samples for

COMPS. Each source has a unique representational

structure which gives rise to different pairwise sim-

ilarity metrics, on the basis of which we pick out

negative samples for each property:

Taxonomy

We consider a hierarchical organiza-

tion of our concepts, by taking a subset of WordNet

(Miller,1995) consisting of our 521 concepts. We

use the

wup

similarity (Wu and Palmer,1994) as

our choice of taxonomic similarity.

Property Norms

We use the XCSLB dataset and

organize it as a matrix whose rows indicate con-

cepts and columns indicate properties that are ei-

ther present (indicated as 1) or absent (indicated

as 0) for each concept. As our similarity measure,

we consider the jaccard similarity between the row

vectors of concepts. This reﬂects the overlap in

properties between concepts, and is prevalent in

studies utilizing conceptual similarity in cognitive

science (Tversky,1977;Sloman,1993, etc.).

Co-occurrence

We use the co-occurrence be-

tween concept words as an unstructured knowledge

representation. For quantifying similarity, we use

the cosine similarity of the GloVe vectors (Penning-

ton et al.,2014) of our concept words.

Sampling Strategy

Each property (

) in our

dataset splits the set of concepts into two: a set

of concepts that possess the property (

Qpi

), and

a set of concepts that do not (

¬Qpi

). We sample

min(|Qpi|,10)

—i.e., at most 10—concepts from

Qpi

and take them to be our positive set. Then for

each concept in the positive set, we sample from

¬Qpi

the concept that is most similar (depending

on the source) to the positive concept and take it as

a negative concept for the property. We addition-

ally include a negative concept that is randomly

sampled from

¬Qpi

, leaving out the concepts sam-

pled on the basis of the three previously described

Knowledge Rep. Negative Concept Similarity

Taxonomy HORSE 0.88

Property Norms DEER 0.63

Co-occurrence GIRAFFE 0.75

Random BAT -

Table 1: Negatively sampled concepts selected on the

basis of various knowledge representational mecha-

nisms, where the property is has striped patterns, and

the positive concept is ZEBRA.

knowledge sources. Examples of the four types of

negative samples for the concept ZEBRA and the

property has striped patterns are shown in Table 1.

2.4 Minimal Pair Construction

Following our negative sample generation process,

we end up with total of 49,280 pairs of positive and

negative concepts that span across 3,645 properties

(14 pairs per property, on average). Every prop-

erty is associated with a property phrase—a verb

phrase which expresses the property in English, as

provided in XCSLB. Using these materials, we con-

struct our three datasets of minimal pair sentence

stimuli, examples of which are shown in Figure 1.

COMPS-BASE

The COMPS-BASE dataset con-

tains minimal pair sentences that follow the tem-

plate: “

[DET] [CONCEPT] [property-phrase]

.”

where

[DET]

is an optional determiner, and

[CONCEPT]

is the noun concept. Applying this

template to our generated pairs results in 49,280

instances. See Figure 1a for an example.

COMPS-WUGS

We test property inheritance in

PLMs using only the animal kingdom subset of

COMPS-BASE (152 concepts, 944 properties, and

13,888 pairs), keeping the same negative samples.

We convert the original minimal pair sentences in

COMPS-BASE, in which the positive concept is an

animal, into pairs of two-sentence stimuli by ﬁrst

introducing a new concept (WUG) to be the sub-

ordinate of the concepts in the original minimal

pair. We then express its property inheritance in

a separate sentence. Our two sentence stimuli fol-

low the template: “A wug is a

[CONCEPT]

. There-

fore, a wug

[property-phrase]

.” Although we

use wug as our running example for the subordi-

nate concept, we use four different nonsense words

{wug,dax,blicket,fep} equal numbers of times,

to avoid making spurious conclusions based on a

single nonsense word.

Introducing an intervening

As we describe in §4, we also tried a different set of nonce

words, to address concerns about possible impacts of using

Property: can ﬂy

Positive: ROBIN

Negative: PENGUIN

Subordinate: WUG

COMPS-BASE:A (robin/penguin) can ﬂy.

COMPS-WUGS:

A wug is a (

robin/penguin

Therefore, a wug can ﬂy.

(a) Instances of COMPS-BASE and COMPS-WUGS.

A dax is a penguin.

A wug is a robin. Therefore, a (wug/dax) can fly.

in-between

before

(b) Distraction scheme for stimuli in COMPS-WUGS-DIST, where

the distractor is inserted either

before

in between

each COMPS-

WUGS stimulus.

Figure 1: Examples of materials used in our experiments. In this example, ROBIN is the positive concept.

novel concept allows us to robustly control for sim-

ple word-level associations between concepts and

properties that models might have picked up during

training. Figure 1a shows an example.

COMPS-WUGS-DIST

To add distracting informa-

tion, we follow Pandia and Ettinger (2021) and

convert the COMPS-WUGS stimuli by associating

a different subordinate concept (DAX) with the

negative concept (

[NEG-CONCEPT]

), and inserting

before

in-between

the sentence containing

the positive concept and its subordinate, sepa-

rately. This results in two subsets (

before

and

in-

between

) of three-sentence minimal pair stimuli,

which differ in the subordinate to which the prop-

erty is attributed. We use the following template

to create our stimuli: “A

wug

is a

[CONCEPT]

. A

dax

is a

[NEG-CONCEPT]

. Therefore, a (

wug

dax

)

[property-phrase]

.” That is, we have stimuli

that resemble COMPS-WUGS but instead deal with

a pair of competing subordinate concepts in con-

text.3See Figure 1b for an example.

3 Methodology

3.1 Models Investigated

We investigate property knowledge and property

inheritance capacities of 22 different PLMs, be-

longing to six different families. We evaluate four

widely used masked language modeling (MLM)

families: (1) ALBERT (Lan et al.,2020), (2) BERT

(Devlin et al.,2019), (3) ELECTRA (Clark et al.,

2020), and (4) RoBERTa (Liu et al.,2019); as well

as two auto-regressive language modeling families:

(1) GPT2 (Radford et al.,2019), and (2) the GPT-

Neo (Black et al.,2021) and GPT-J models (Wang

and Komatsuzaki,2021) from EleutherAI. We also

use distilled versions of BERT-base, RoBERTa-

base, and GPT2, trained using the method de-

nonce words from existing literature (e.g., wug).

We again choose from our list of four nonsense words

(wug,dax,blicket, and fep), which amounts to 12 unique

ordered pairs, after accounting for counterbalancing.

scribed by Sanh et al. (2019). We list each model’s

parameters, vocabulary size, and training corpora

in Table 3(Appendix A).

3.2 Measuring Performance

To evaluate models on COMPS, we compare

their log-probabilities for the property phrase—

conditioned on contexts (to the left) containing the

positive and negative noun concepts. That is, we

hold the property phrase constant, and compare

across minimally differing conditions to evaluate

the probability with which a property is attributed

to each concept. For example, we score stimuli in

COMPS-BASE, e.g., “A dog can bark.” as:

log p(can bark. |A dog),

its corresponding stimulus in COMPS-WUGS, “A

wug is a dog. Therefore, a wug can bark.” as:

log p(can bark. |A wug is a dog. Therefore, a wug),

and similarly—assuming CAT as the negative

concept—the corresponding stimuli in our COMPS-

WUGS-DIST subset, “A wug is a dog. A dax is a cat.

Therefore, a wug can bark.” as:4

log p(can bark. |A wug is a dog. A dax is a cat. There-

fore, a wug).

This approach to eliciting conditional LM judg-

ments is equivalent to the “scoring by premise”

method (Holtzman et al.,2021), which has been

shown to result in stable comparisons across items.

Additionally, this also takes into account the poten-

tial noise due to frequency effects or tokenization

differences (Misra et al.,2021). Estimating these

conditional log-probabilities using auto-regressive

PLMs can be directly computed in a left-to-right

manner. For MLMs, we use their conditional

Here we show an example where the distractor is added

in-between

the context specifying the positive concept, and

the queried property knowledge.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

COMPS:ConceptualMinimalPairSentencesfortestingRobustPropertyKnowledgeanditsInheritanceinPre-trainedLanguageModelsKanishkaMisraPurdueUniversitykmisra@purdue.eduJuliaRayzPurdueUniversityjtaylor1@purdue.eduAllysonEttingerUniversityofChicagoaettinger@uchicago.eduAbstractAcharacteristicfeatureofhumansema...

展开>> 收起<<

COMPS Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models Kanishka Misra.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

COMPS Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models Kanishka Misra

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: