The Better Your Syntax the Better Your Semantics Probing Pretrained Language Models for the English Comparative Correlative Leonie Weissweiler Valentin Hofmanny Abdullatif Köksal Hinrich Schütze

2025-05-06 0 0 703.67KB 24 页 10玖币

侵权投诉

The Better Your Syntax, the Better Your Semantics? Probing Pretrained

Language Models for the English Comparative Correlative

Leonie Weissweiler*, Valentin Hofmann†*, Abdullatif Köksal*, Hinrich Schütze*

*Center for Information and Language Processing, LMU Munich

Munich Center of Machine Learning

†Faculty of Linguistics, University of Oxford

{weissweiler,akoksal}@cis.lmu.de

valentin.hofmann@ling-phil.ox.ac.uk

Abstract

Construction Grammar (CxG) is a paradigm

from cognitive linguistics emphasising the

connection between syntax and semantics.

Rather than rules that operate on lexical items,

it posits constructions as the central building

blocks of language, i.e., linguistic units of dif-

ferent granularity that combine syntax and se-

mantics. As a ﬁrst step towards assessing the

compatibility of CxG with the syntactic and

semantic knowledge demonstrated by state-of-

the-art pretrained language models (PLMs),

we present an investigation of their capabil-

ity to classify and understand one of the most

commonly studied constructions, the English

comparative correlative (CC). We conduct ex-

periments examining the classiﬁcation accur-

acy of a syntactic probe on the one hand and

the models’ behaviour in a semantic applica-

tion task on the other, with BERT, RoBERTa,

and DeBERTa as the example PLMs. Our res-

ults show that all three investigated PLMs are

able to recognise the structure of the CC but

fail to use its meaning. While human-like per-

formance of PLMs on many NLP tasks has

been alleged, this indicates that PLMs still suf-

fer from substantial shortcomings in central

domains of linguistic knowledge.

1 Introduction

The sentence “The better your syntax, the better

your semantics.” contains a construction called

the English comparative correlative (CC; Fillmore,

1986). Paraphrased, it could be read as “If your

syntax is better, your semantics will also be better.”

Humans reading this sentence are capable of doing

two things: (i) recognising that two instances of

“the” followed by an adjective/adverb in the compar-

ative as well as a phrase of the given structure (i.e.,

the syntax of the CC) express a speciﬁc meaning

(i.e., the semantics of the CC); (ii) understanding

the semantic meaning conveyed by the CC, i.e.,

understanding that in a sentence of the given struc-

ture, the second half is somehow correlated with

the ﬁrst.

In this paper, we ask the following question: are

pretrained language models (PLMs) able to achieve

these two steps? This question is important for

two reasons. Firstly, we hope that recognising the

CC and understanding its meaning is challenging

for PLMs, helping to set the research agenda for

further improvements. Secondly, the CC is one

of the most commonly studied constructions in

construction grammar (CxG), a usage-based syntax

paradigm from cognitive linguistics, thus providing

an interesting alternative to the currently prevailing

practice of analysing the syntactic capabilities of

PLMs with theories from generative grammar (e.g.,

Marvin and Linzen,2018).

We divide our investigation into two parts. In

the ﬁrst part, we examine the CC’s syntactic prop-

erties and how they are represented by PLMs, with

the objective to determine whether PLMs can re-

cognise an instance of the CC. More speciﬁcally,

we construct two syntactic probes with different

properties: one is inspired by recent probing meth-

odology (e.g., Belinkov et al.,2017;Conneau et al.,

2018) and draws upon minimal pairs to quantify

the amount of information contained in each PLM

layer; for the other one, we write a context-free

grammar (CFG) to construct approximate minimal

pairs in which only the word order determines if

the sentences are an instance of the CC or not. We

ﬁnd that starting from the third layer, all invest-

igated PLMs are able to distinguish positive from

negative instances of the CC. However, this method

only covers one speciﬁc subtype of comparative

sentences. To cover the full diversity of instances,

we conduct an additional experiment for which we

collect and manually label sentences from C4 (Raf-

fel et al.,2020) that resemble instances of the CC,

resulting in a diverse set of sentences that either

are instances of the CC or resemble them closely

without being instances of the CC. Applying the

arXiv:2210.13181v1 [cs.CL] 24 Oct 2022

same methodology to this set of sentences, we ob-

serve that all examined PLMs are still able to sep-

arate the examples very well.

In the second part of the paper, we aim to de-

termine if the PLMs are able to understand the

meaning of the CC. We generate test scenarios in

which a statement containing the CC is given to the

PLMs, which they then have to apply in a zero-shot

manner. As this way of testing PLMs is prone to a

variety of biases, we introduce several mitigating

methods in order to determine the full capability

of the PLMs. We ﬁnd that none of the PLMs we

investigate perform above chance level, indicating

that they are not able to understand and apply the

CC in a measurable way in this context.

We make three main contributions:

–

We present the ﬁrst comprehensive study examin-

ing how well PLMs can recognise and understand

a CxG construction, speciﬁcally the English com-

parative correlative.

–

We develop a way of testing the PLMs’ recog-

nition of the CC that overcomes the challenge

of probing for linguistic phenomena not lending

themselves to minimal pairs.

–

We adapt methods from zero-shot prompting and

calibration to develop a way of testing PLMs for

their understanding of the CC.1

2 Construction Grammar

2.1 Overview

A core assumption of generative grammar (Chom-

sky,1988), which can be already found in Bloom-

ﬁeldian structural linguistics (Bloomﬁeld,1933), is

a strict separation of lexicon and grammar: gram-

mar is conceptualized as a set of compositional

and general rules that operate on a list of arbit-

rary and speciﬁc lexical items in generating syn-

tactically well-formed sentences. This dichotom-

ous view was increasingly questioned in the 1980s

when several studies drew attention to the fact

that linguistic units larger than lexical items (e.g.,

idioms) can also possess non-compositional mean-

ings (Langacker,1987;Lakoff,1987;Fillmore

et al.,1988;Fillmore,1989). For instance, it is

not clear how the effect of the words “let alone”(as

In order to foster research at the intersection of NLP

and construction grammar, we will make our data and code

available at

https://github.com/LeonieWeissweiler/

ComparativeCorrelative.

in “she doesn’t eat ﬁsh, let alone meat”) on both the

syntax and the semantics of the rest of the sentence

could be inferred from general syntactic rules (Fill-

more et al.,1988).. This insight about the ubiquity

of stored form-meaning pairings in language is ad-

opted as the central tenet of grammatical theory by

Construction Grammar (CxG; see Hoffmann and

Trousdale (2013) for a comprehensive overview).

Rather than a system divided into non-overlapping

syntactic rules and lexical items, CxG views lan-

guage as a structured system of constructions with

varying granularities that encapsulate syntactic and

semantic components as single linguistic signs—

ranging from individual morphemes up to phrasal

elements and ﬁxed expressions (Kay and Fillmore,

1999;Goldberg,1995). In this framework, syn-

tactic rules can be seen as emergent abstractions

over similar stored constructions (Goldberg,2003,

2006). A different set of stored constructions can

result in different abstractions and thus different

syntactic rules, which allows CxG to naturally ac-

commodate for the dynamic nature of grammar as

evidenced, for instance, by inter-speaker variability

and linguistic change (Hilpert,2006).

2.2 Construction Grammar and NLP

We see three main motivations for the development

of a ﬁrst probing approach for CxG:

–

We believe that the active discourse in (cognit-

ive) linguistics about the best description of hu-

man language capability can be supported and

enriched through a computational exploration of

a wide array of phenomena and viewpoints. We

think that the probing literature in NLP investig-

ating linguistic phenomena with computational

methods should be diversiﬁed to include theor-

ies and problems from all points on the broad

spectrum of linguistic scholarship.

–

We hope that the investigation of large PLMs’ ap-

parent capabilities to imitate human language and

the mechanisms responsible for these capabilit-

ies will be enriched by introducing a usage-based

approach to grammar. This is especially import-

ant as some of the discourse in recent years has

focused on the question of whether PLMs are

constructing syntactically acceptable sentences

for the correct reasons and with the correct under-

lying representations (e.g. McCoy et al.,2019).

We would like to suggest that considering altern-

ative theories of grammar, speciﬁcally CxG with

its incorporation of slots in constructions that

may be ﬁlled by speciﬁc word types and its focus

on learning without an innate, universal grammar,

may be beneﬁcial to understanding the learning

process of PLMs as their capabilities advance

further.

–

Many constructions present an interesting chal-

lenge for PLMs. In fact, recent work in challenge

datasets (Ribeiro et al.,2020) has already started

using what could be considered constructions,

in an attempt to identify types of sentences that

models struggle with, and to point out a potential

direction for improvement. One of the central

tenets of CxG is the relation between the form of

a construction and its meaning, or to put it in NLP

terms, a model must learn to infer parts of the

sentence meaning from patterns that are present

in it, as opposed to words. We believe this to be

an interesting challenge for future PLMs.

2.3 The English Comparative Correlative

The English comparative correlative (CC) is one

of the most commonly studied constructions in lin-

guistics, for several reasons. Firstly, it constitutes

a clear example of a linguistic phenomenon that

is challenging to explain in the framework of gen-

erative grammar (Culicover and Jackendoff,1999;

Abeillé and Borsley,2008), even though there have

been approaches following that school of thought

(Den Dikken,2005;Iwasaki and Radford,2009).

Secondly, it exhibits a range of interesting syntactic

and semantic features, as detailed below. These

reasons, we believe, also make the CC an ideal

testbed for a ﬁrst study attempting to extend the

current trend of syntax probing for rules by devel-

oping methods for probing according to CxG.

The CC can take many different forms, some of

which are exempliﬁed here:

(1) The more, the merrier.

(2) The longer the bake, the browner the colour.

(3)

The more she practiced, the better she became.

Semantically, the CC consists of two clauses, where

the second clause can be seen as the dependent vari-

able for the independent variable speciﬁed in the

ﬁrst one (Goldberg,2003). It can be seen on the one

hand as a statement of a general cause-and-effect

relationship, as in a general conditional statement

(e.g., (2) could be paraphrased as “If the bake is

longer, the colour will be more brown”), and on the

other as a temporal development in a comparative

sentence (paraphrasing (3) as “She became better

over time, and she practiced more over time”). Us-

age of the CC typically implies both readings at the

same time. Syntactically, the CC is characterised

in both clauses by an instance of “the” followed

by an adverb or an adjective in the comparative,

either with “-er” for some adjectives and adverbs,

or with “more” for others, or special forms like

“better”. Special features of the comparative sen-

tences following this are the optional omission of

the future “will” and of “be”, as in (1). Crucially,

“the” in this construction does not function as a de-

terminer of noun phrases (Goldberg,2003); rather,

it has a function speciﬁc to the CC and has vari-

ously been called a “degree word” (Den Dikken,

2005) or “ﬁxed material” (Hoffmann et al.,2019).

3 Syntax

Our investigation of PLMs’ knowledge of the CC

is split into two parts. First, we probe for the PLMs’

knowledge of the syntactic aspects of the CC, to

determine if they recognise its structure. Then we

devise a test of their understanding of its semantic

aspects by investigating their ability to apply, in a

given context, information conveyed by a CC.

3.1 Probing Methods

As the ﬁrst half of our analysis of PLMs’ know-

ledge of the CC, we investigate its syntactic aspects.

Translated into probing questions, this means that

we ask: can a PLM recognise an instance of the

CC? Can it distinguish instances of the CC from

similar-looking non-instances? Is it able to go bey-

ond the simple recognition of its ﬁxed parts (“The

COMP-ADJ

ADV

, the ...”) and group all ways of com-

pleting the sentences that are instances of the CC

separately from all those that are not? And to frame

all of these questions in a syntactic probing frame-

work: will we be able to recover, using a logistic

regression as the probe, this distinguishing inform-

ation from a PLM’s embeddings?

The established way of testing a PLM for its

syntactic knowledge has in recent years become

minimal pairs (e.g., Warstadt et al.,2020,Dem-

szky et al.,2021). This would mean pairs of sen-

tences which are indistinguishable except for the

fact that one of them is an instance of the CC and

the other is not, allowing us to perfectly separate

a model’s knowledge of the CC from other con-

founding factors. While this is indeed possible for

simpler syntactic phenomena such as verb-noun

number agreement, there is no obvious way to con-

struct minimal pairs for the CC. We therefore con-

struct minimal pairs in two ways: one with artiﬁcial

data based on a context-free grammar (CFG), and

one with sentences extracted from C4.

3.1.1 Synthetic Data

In order to ﬁnd a pair of sentences that is as close

as possible to a minimal pair, we devise a way to

modify the words following “The X-er” such that

the sentence is no longer an instance of the con-

struction. The pattern for a positive instance is

“The

ADV

-er the

NUM NOUN VERB

”, e.g., “The harder

the two cats ﬁght”. To create a negative instance,

we reorder the pattern to “The

ADJ

-er

NUM VERB

the

NOUN

”, e.g., “The harder two ﬁght the cats”. The

change in role of the numeral from the depend-

ent of a head to a head itself, made possible by

choosing a verb that can be either transitive or in-

transitive, as well as the change from an adverb

to an adjective, allows us to construct a negative

instance that uses the same words as the positive

one, but in a different order.

In order to generate

a large number of instances, we collect two sets

each of adverbs, numerals, nouns and verbs that

are mutually exclusive between training and test

sets. To investigate if the model is confused by ad-

ditional content in the sentences, we write an CFG

to insert phrases before the start of the ﬁrst half, in

between the two halves, and after the second half

of the CC (see Appendix, Algorithms 1and 2for

the complete CFG).

While this setup is rigourous in the sense

that positive and negative sentences are exactly

matched, it comes with the drawback of only con-

sidering one type of CC. To be able to conduct

a more comprehensive investigation, we adopt a

complementary approach and turn to pairs extrac-

ted from C4 (see Appendix, Tables 6and 7, for

examples of training and test data). These cover a

broad range of CC patterns, albeit without meeting

the criterion that positive and negative samples are

exactly matched.

3.1.2 Corpus-based Minimal Pairs

While accepting that positive and negative in-

stances extracted from a corpus will automatically

not be minimal and therefore contain some lexical

Note that an alternative reading of this sentence exists:

the numeral “two” forms the noun phrase by itself and “The

harder” is still interpreted as part of the CC. The sentence is

actually a positive instance on this interpretation. We regard

this reading as very improbable.

overlap and context cues, we attempt to regularise

our retrieved instances as far as possible. To form

a ﬁrst candidate set, we POS tag C4 using spaCy

(Honnibal and Montani,2018) and extract all sen-

tences that follow the pattern “The” (

DET

) followed

by either “more” and an adjective or adverb, or an

adjective or adverb ending in “-er”, and at any point

later in the sentence again the same pattern. We dis-

card examples with adverbs or adjectives that were

falsely labelled as comparative, such as “other”.

We then group these sentences by their sequence of

POS tags, and manually classify the sequences as

either positive or negative instances. We observe

that sentences sharing a POS tag pattern tend to be

either all negative or all positive instances, allowing

us to save annotation time by working at the POS

tag pattern level instead of the sentence level. To

make the ﬁnal set as diverse as possible, we sort the

patterns randomly and label as many as possible.

In order to further reduce interfering factors in our

probe, we separate the POS tag patterns between

training and test sets (see Appendix, Table 8, for

examples).

3.1.3 The Probe

For both datasets, we investigate the overall ac-

curacy of our probe as well as the impact of sev-

eral factors. The probe consists of training a

simple logistic regression model on top of the

mean-pooled sentence embeddings (Vuli´

c et al.,

2020). To quantify the impact of the length of

the sentence, the start position of the construction,

the position of its second half, and the distance

between them, we construct four different subsets

Dtrain

and

Dtest

from both the artiﬁcially construc-

ted and the corpus-based dataset. For each subset,

we sample sentences such that both the positive and

the negative class is balanced across every value of

the feature within a certain range of values. This

ensures that the probes are unable to exploit correla-

tions between a class and any of the above features.

We create the dataset as follows

Df=[

v∈fv

[

l∗∈L

S(D, v, l∗, n∗),

where

is the feature,

is the set of values for

L={positive, negative}

are the labels, and

is a function that returns

n∗

elements from

that

have value vand label l∗.

To make this task more cognitively realistic,

we aim to test if a model is able to general-

ise from shorter sentences, which contain relat-

Figure 1: Overall accuracy per layer for Dlength. All

shown models are the large model variants. The mod-

els can easily distinguish between positive and negative

examples in at least some of their layers.

ively little additional information besides the parts

relevant to the classiﬁcation task, to those with

greater potential interference due to more addi-

tional content that is not useful for classiﬁcation.

Thus, we restrict the training set to samples from

the lowest quartile of each feature so that

be-

comes

[vmin

f, vmin

f+1

4(vmax

f−vmin

f)]

for

Dtrain

and

[vmin

f, vmax

for

Dtest

. We report the test perform-

ance for every value of a given feature separately to

recognise patterns. For the artiﬁcial syntax probing,

we generate 1000 data points for each value of each

feature for each training and test for each subset

associated with a feature. For the corpus syntax

probing, we collect 9710 positive and 533 negat-

ive sentences in total, from which we choose 10

training and 5 test sentences for each value of each

feature in a similar manner. To improve compar-

ability and make the experiment computationally

feasible, we test the “large” size of each of our

three models, using the Huggingface Transformers

library (Wolf et al.,2019). Our logistic regression

probes are implemented using Scikitlearn (Pedre-

gosa et al.,2011).

3.2 Probing Results

3.2.1 Artiﬁcial Data

As shown in Figure 1, the results of our syntactic

probe indicate that all models can easily distin-

guish between positive and negative examples in

at least some of their layers, independently of any

of the sentence properties that we have investig-

ated. We report full results in the Appendix in

Figures 2,3, and 4. We ﬁnd a clear trend that De-

BERTa performs better than RoBERTa, which in

turn performs better than BERT across the board.

As DeBERTa’s performance in all layers is nearly

perfect, we are unable to observe patterns related to

the length of the sentence, the start position of the

CC, the start position of the second half of the CC,

and the distance between them. By contrast, we ob-

serve interesting patterns for BERT and RoBERTa.

For

Dlength

, and to a lesser degree

Ddistance

(which

correlates with it), we observe that at ﬁrst, perform-

ance goes down with increased length as we would

expect—the model struggles to generalise to longer

sentences with more interference since it was only

trained on short ones. However, this trend is re-

versed in the last few layers. We hypothesize this

may be due to an increased focus on semantics in

the last layers (Peters et al.,2018;Tenney et al.,

2019), which could lead to interfering features par-

ticularly in shorter sentences.

3.2.2 Corpus Data

In contrast, the results of our probe on more nat-

ural data from C4 indicate two different trends:

ﬁrst, as the positive and negative instances are not

identical on a bag-of-word level, performance is

not uniformly at 50% (i.e., chance) level in the ﬁrst

layers, indicating that the model can exploit lexical

cues to some degree. We observe a similar trend as

with the artiﬁcial experiment, which showed that

DeBERTa performs best and BERT worst. The cor-

responding graphs can be found in the Appendix in

Figures 5,6, and 7.

Generally, this additional corpus-based experi-

ment validates our ﬁndings from the experiment

with artiﬁcially generated data, as all models per-

form at 80% or better from the middle layers on,

indicating that the models are able to classify in-

stances of the construction even when they are very

diverse and use unseen POS tag patterns.

Comparing the average accuracies on

Dlength

for both data sources in Figure 1, we observe that

all models perform better on artiﬁcial than on cor-

pus data from the ﬁfth layer on, with the notable

exception of a dip in performance for BERT large

around layer 10.

4 Semantics

4.1 Probing Methods

4.1.1 Usage-based Testing

For the second half of our investigation, we turn

to semantics. In order to determine if a model has

understood the meaning of the CC, i.e., if it has

understood that in any sentence, “the

COMP

.... the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheBetterYourSyntax,theBetterYourSemantics?ProbingPretrainedLanguageModelsfortheEnglishComparativeCorrelativeLeonieWeissweiler*,ValentinHofmanny*,AbdullatifKöksal*,HinrichSchütze**CenterforInformationandLanguageProcessing,LMUMunichMunichCenterofMachineLearningyFacultyofLinguistics,UniversityofOx...

展开>> 收起<<

The Better Your Syntax the Better Your Semantics Probing Pretrained Language Models for the English Comparative Correlative Leonie Weissweiler Valentin Hofmanny Abdullatif Köksal Hinrich Schütze.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Better Your Syntax the Better Your Semantics Probing Pretrained Language Models for the English Comparative Correlative Leonie Weissweiler Valentin Hofmanny Abdullatif Köksal Hinrich Schütze

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: