The Better Your Syntax the Better Your Semantics Probing Pretrained Language Models for the English Comparative Correlative Leonie Weissweiler Valentin Hofmanny Abdullatif Köksal Hinrich Schütze

2025-05-06 0 0 703.67KB 24 页 10玖币
侵权投诉
The Better Your Syntax, the Better Your Semantics? Probing Pretrained
Language Models for the English Comparative Correlative
Leonie Weissweiler*, Valentin Hofmann*, Abdullatif Köksal*, Hinrich Schütze*
*Center for Information and Language Processing, LMU Munich
Munich Center of Machine Learning
Faculty of Linguistics, University of Oxford
{weissweiler,akoksal}@cis.lmu.de
valentin.hofmann@ling-phil.ox.ac.uk
Abstract
Construction Grammar (CxG) is a paradigm
from cognitive linguistics emphasising the
connection between syntax and semantics.
Rather than rules that operate on lexical items,
it posits constructions as the central building
blocks of language, i.e., linguistic units of dif-
ferent granularity that combine syntax and se-
mantics. As a first step towards assessing the
compatibility of CxG with the syntactic and
semantic knowledge demonstrated by state-of-
the-art pretrained language models (PLMs),
we present an investigation of their capabil-
ity to classify and understand one of the most
commonly studied constructions, the English
comparative correlative (CC). We conduct ex-
periments examining the classification accur-
acy of a syntactic probe on the one hand and
the models’ behaviour in a semantic applica-
tion task on the other, with BERT, RoBERTa,
and DeBERTa as the example PLMs. Our res-
ults show that all three investigated PLMs are
able to recognise the structure of the CC but
fail to use its meaning. While human-like per-
formance of PLMs on many NLP tasks has
been alleged, this indicates that PLMs still suf-
fer from substantial shortcomings in central
domains of linguistic knowledge.
1 Introduction
The sentence “The better your syntax, the better
your semantics. contains a construction called
the English comparative correlative (CC; Fillmore,
1986). Paraphrased, it could be read as “If your
syntax is better, your semantics will also be better.
Humans reading this sentence are capable of doing
two things: (i) recognising that two instances of
“the” followed by an adjective/adverb in the compar-
ative as well as a phrase of the given structure (i.e.,
the syntax of the CC) express a specific meaning
(i.e., the semantics of the CC); (ii) understanding
the semantic meaning conveyed by the CC, i.e.,
understanding that in a sentence of the given struc-
ture, the second half is somehow correlated with
the first.
In this paper, we ask the following question: are
pretrained language models (PLMs) able to achieve
these two steps? This question is important for
two reasons. Firstly, we hope that recognising the
CC and understanding its meaning is challenging
for PLMs, helping to set the research agenda for
further improvements. Secondly, the CC is one
of the most commonly studied constructions in
construction grammar (CxG), a usage-based syntax
paradigm from cognitive linguistics, thus providing
an interesting alternative to the currently prevailing
practice of analysing the syntactic capabilities of
PLMs with theories from generative grammar (e.g.,
Marvin and Linzen,2018).
We divide our investigation into two parts. In
the first part, we examine the CC’s syntactic prop-
erties and how they are represented by PLMs, with
the objective to determine whether PLMs can re-
cognise an instance of the CC. More specifically,
we construct two syntactic probes with different
properties: one is inspired by recent probing meth-
odology (e.g., Belinkov et al.,2017;Conneau et al.,
2018) and draws upon minimal pairs to quantify
the amount of information contained in each PLM
layer; for the other one, we write a context-free
grammar (CFG) to construct approximate minimal
pairs in which only the word order determines if
the sentences are an instance of the CC or not. We
find that starting from the third layer, all invest-
igated PLMs are able to distinguish positive from
negative instances of the CC. However, this method
only covers one specific subtype of comparative
sentences. To cover the full diversity of instances,
we conduct an additional experiment for which we
collect and manually label sentences from C4 (Raf-
fel et al.,2020) that resemble instances of the CC,
resulting in a diverse set of sentences that either
are instances of the CC or resemble them closely
without being instances of the CC. Applying the
arXiv:2210.13181v1 [cs.CL] 24 Oct 2022
same methodology to this set of sentences, we ob-
serve that all examined PLMs are still able to sep-
arate the examples very well.
In the second part of the paper, we aim to de-
termine if the PLMs are able to understand the
meaning of the CC. We generate test scenarios in
which a statement containing the CC is given to the
PLMs, which they then have to apply in a zero-shot
manner. As this way of testing PLMs is prone to a
variety of biases, we introduce several mitigating
methods in order to determine the full capability
of the PLMs. We find that none of the PLMs we
investigate perform above chance level, indicating
that they are not able to understand and apply the
CC in a measurable way in this context.
We make three main contributions:
We present the first comprehensive study examin-
ing how well PLMs can recognise and understand
a CxG construction, specifically the English com-
parative correlative.
We develop a way of testing the PLMs’ recog-
nition of the CC that overcomes the challenge
of probing for linguistic phenomena not lending
themselves to minimal pairs.
We adapt methods from zero-shot prompting and
calibration to develop a way of testing PLMs for
their understanding of the CC.1
2 Construction Grammar
2.1 Overview
A core assumption of generative grammar (Chom-
sky,1988), which can be already found in Bloom-
fieldian structural linguistics (Bloomfield,1933), is
a strict separation of lexicon and grammar: gram-
mar is conceptualized as a set of compositional
and general rules that operate on a list of arbit-
rary and specific lexical items in generating syn-
tactically well-formed sentences. This dichotom-
ous view was increasingly questioned in the 1980s
when several studies drew attention to the fact
that linguistic units larger than lexical items (e.g.,
idioms) can also possess non-compositional mean-
ings (Langacker,1987;Lakoff,1987;Fillmore
et al.,1988;Fillmore,1989). For instance, it is
not clear how the effect of the words “let alone”(as
1
In order to foster research at the intersection of NLP
and construction grammar, we will make our data and code
available at
https://github.com/LeonieWeissweiler/
ComparativeCorrelative.
in “she doesn’t eat fish, let alone meat”) on both the
syntax and the semantics of the rest of the sentence
could be inferred from general syntactic rules (Fill-
more et al.,1988).. This insight about the ubiquity
of stored form-meaning pairings in language is ad-
opted as the central tenet of grammatical theory by
Construction Grammar (CxG; see Hoffmann and
Trousdale (2013) for a comprehensive overview).
Rather than a system divided into non-overlapping
syntactic rules and lexical items, CxG views lan-
guage as a structured system of constructions with
varying granularities that encapsulate syntactic and
semantic components as single linguistic signs—
ranging from individual morphemes up to phrasal
elements and fixed expressions (Kay and Fillmore,
1999;Goldberg,1995). In this framework, syn-
tactic rules can be seen as emergent abstractions
over similar stored constructions (Goldberg,2003,
2006). A different set of stored constructions can
result in different abstractions and thus different
syntactic rules, which allows CxG to naturally ac-
commodate for the dynamic nature of grammar as
evidenced, for instance, by inter-speaker variability
and linguistic change (Hilpert,2006).
2.2 Construction Grammar and NLP
We see three main motivations for the development
of a first probing approach for CxG:
We believe that the active discourse in (cognit-
ive) linguistics about the best description of hu-
man language capability can be supported and
enriched through a computational exploration of
a wide array of phenomena and viewpoints. We
think that the probing literature in NLP investig-
ating linguistic phenomena with computational
methods should be diversified to include theor-
ies and problems from all points on the broad
spectrum of linguistic scholarship.
We hope that the investigation of large PLMs’ ap-
parent capabilities to imitate human language and
the mechanisms responsible for these capabilit-
ies will be enriched by introducing a usage-based
approach to grammar. This is especially import-
ant as some of the discourse in recent years has
focused on the question of whether PLMs are
constructing syntactically acceptable sentences
for the correct reasons and with the correct under-
lying representations (e.g. McCoy et al.,2019).
We would like to suggest that considering altern-
ative theories of grammar, specifically CxG with
its incorporation of slots in constructions that
may be filled by specific word types and its focus
on learning without an innate, universal grammar,
may be beneficial to understanding the learning
process of PLMs as their capabilities advance
further.
Many constructions present an interesting chal-
lenge for PLMs. In fact, recent work in challenge
datasets (Ribeiro et al.,2020) has already started
using what could be considered constructions,
in an attempt to identify types of sentences that
models struggle with, and to point out a potential
direction for improvement. One of the central
tenets of CxG is the relation between the form of
a construction and its meaning, or to put it in NLP
terms, a model must learn to infer parts of the
sentence meaning from patterns that are present
in it, as opposed to words. We believe this to be
an interesting challenge for future PLMs.
2.3 The English Comparative Correlative
The English comparative correlative (CC) is one
of the most commonly studied constructions in lin-
guistics, for several reasons. Firstly, it constitutes
a clear example of a linguistic phenomenon that
is challenging to explain in the framework of gen-
erative grammar (Culicover and Jackendoff,1999;
Abeillé and Borsley,2008), even though there have
been approaches following that school of thought
(Den Dikken,2005;Iwasaki and Radford,2009).
Secondly, it exhibits a range of interesting syntactic
and semantic features, as detailed below. These
reasons, we believe, also make the CC an ideal
testbed for a first study attempting to extend the
current trend of syntax probing for rules by devel-
oping methods for probing according to CxG.
The CC can take many different forms, some of
which are exemplified here:
(1) The more, the merrier.
(2) The longer the bake, the browner the colour.
(3)
The more she practiced, the better she became.
Semantically, the CC consists of two clauses, where
the second clause can be seen as the dependent vari-
able for the independent variable specified in the
first one (Goldberg,2003). It can be seen on the one
hand as a statement of a general cause-and-effect
relationship, as in a general conditional statement
(e.g., (2) could be paraphrased as “If the bake is
longer, the colour will be more brown”), and on the
other as a temporal development in a comparative
sentence (paraphrasing (3) as “She became better
over time, and she practiced more over time”). Us-
age of the CC typically implies both readings at the
same time. Syntactically, the CC is characterised
in both clauses by an instance of “the” followed
by an adverb or an adjective in the comparative,
either with “-er” for some adjectives and adverbs,
or with “more” for others, or special forms like
“better”. Special features of the comparative sen-
tences following this are the optional omission of
the future “will” and of “be”, as in (1). Crucially,
“the” in this construction does not function as a de-
terminer of noun phrases (Goldberg,2003); rather,
it has a function specific to the CC and has vari-
ously been called a “degree word” (Den Dikken,
2005) or “fixed material” (Hoffmann et al.,2019).
3 Syntax
Our investigation of PLMs’ knowledge of the CC
is split into two parts. First, we probe for the PLMs’
knowledge of the syntactic aspects of the CC, to
determine if they recognise its structure. Then we
devise a test of their understanding of its semantic
aspects by investigating their ability to apply, in a
given context, information conveyed by a CC.
3.1 Probing Methods
As the first half of our analysis of PLMs’ know-
ledge of the CC, we investigate its syntactic aspects.
Translated into probing questions, this means that
we ask: can a PLM recognise an instance of the
CC? Can it distinguish instances of the CC from
similar-looking non-instances? Is it able to go bey-
ond the simple recognition of its fixed parts (“The
COMP-ADJ
/
ADV
, the ...”) and group all ways of com-
pleting the sentences that are instances of the CC
separately from all those that are not? And to frame
all of these questions in a syntactic probing frame-
work: will we be able to recover, using a logistic
regression as the probe, this distinguishing inform-
ation from a PLM’s embeddings?
The established way of testing a PLM for its
syntactic knowledge has in recent years become
minimal pairs (e.g., Warstadt et al.,2020,Dem-
szky et al.,2021). This would mean pairs of sen-
tences which are indistinguishable except for the
fact that one of them is an instance of the CC and
the other is not, allowing us to perfectly separate
a model’s knowledge of the CC from other con-
founding factors. While this is indeed possible for
simpler syntactic phenomena such as verb-noun
number agreement, there is no obvious way to con-
struct minimal pairs for the CC. We therefore con-
struct minimal pairs in two ways: one with artificial
data based on a context-free grammar (CFG), and
one with sentences extracted from C4.
3.1.1 Synthetic Data
In order to find a pair of sentences that is as close
as possible to a minimal pair, we devise a way to
modify the words following “The X-er” such that
the sentence is no longer an instance of the con-
struction. The pattern for a positive instance is
“The
ADV
-er the
NUM NOUN VERB
”, e.g., “The harder
the two cats fight”. To create a negative instance,
we reorder the pattern to “The
ADJ
-er
NUM VERB
the
NOUN
”, e.g., “The harder two fight the cats”. The
change in role of the numeral from the depend-
ent of a head to a head itself, made possible by
choosing a verb that can be either transitive or in-
transitive, as well as the change from an adverb
to an adjective, allows us to construct a negative
instance that uses the same words as the positive
one, but in a different order.
2
In order to generate
a large number of instances, we collect two sets
each of adverbs, numerals, nouns and verbs that
are mutually exclusive between training and test
sets. To investigate if the model is confused by ad-
ditional content in the sentences, we write an CFG
to insert phrases before the start of the first half, in
between the two halves, and after the second half
of the CC (see Appendix, Algorithms 1and 2for
the complete CFG).
While this setup is rigourous in the sense
that positive and negative sentences are exactly
matched, it comes with the drawback of only con-
sidering one type of CC. To be able to conduct
a more comprehensive investigation, we adopt a
complementary approach and turn to pairs extrac-
ted from C4 (see Appendix, Tables 6and 7, for
examples of training and test data). These cover a
broad range of CC patterns, albeit without meeting
the criterion that positive and negative samples are
exactly matched.
3.1.2 Corpus-based Minimal Pairs
While accepting that positive and negative in-
stances extracted from a corpus will automatically
not be minimal and therefore contain some lexical
2
Note that an alternative reading of this sentence exists:
the numeral “two” forms the noun phrase by itself and “The
harder” is still interpreted as part of the CC. The sentence is
actually a positive instance on this interpretation. We regard
this reading as very improbable.
overlap and context cues, we attempt to regularise
our retrieved instances as far as possible. To form
a first candidate set, we POS tag C4 using spaCy
(Honnibal and Montani,2018) and extract all sen-
tences that follow the pattern “The” (
DET
) followed
by either “more” and an adjective or adverb, or an
adjective or adverb ending in “-er”, and at any point
later in the sentence again the same pattern. We dis-
card examples with adverbs or adjectives that were
falsely labelled as comparative, such as “other”.
We then group these sentences by their sequence of
POS tags, and manually classify the sequences as
either positive or negative instances. We observe
that sentences sharing a POS tag pattern tend to be
either all negative or all positive instances, allowing
us to save annotation time by working at the POS
tag pattern level instead of the sentence level. To
make the final set as diverse as possible, we sort the
patterns randomly and label as many as possible.
In order to further reduce interfering factors in our
probe, we separate the POS tag patterns between
training and test sets (see Appendix, Table 8, for
examples).
3.1.3 The Probe
For both datasets, we investigate the overall ac-
curacy of our probe as well as the impact of sev-
eral factors. The probe consists of training a
simple logistic regression model on top of the
mean-pooled sentence embeddings (Vuli´
c et al.,
2020). To quantify the impact of the length of
the sentence, the start position of the construction,
the position of its second half, and the distance
between them, we construct four different subsets
Dtrain
f
and
Dtest
f
from both the artificially construc-
ted and the corpus-based dataset. For each subset,
we sample sentences such that both the positive and
the negative class is balanced across every value of
the feature within a certain range of values. This
ensures that the probes are unable to exploit correla-
tions between a class and any of the above features.
We create the dataset as follows
Df=[
vfv
[
lL
S(D, v, l, n),
where
f
is the feature,
fv
is the set of values for
f
,
L={positive, negative}
are the labels, and
S
is a function that returns
n
elements from
D
that
have value vand label l.
To make this task more cognitively realistic,
we aim to test if a model is able to general-
ise from shorter sentences, which contain relat-
Figure 1: Overall accuracy per layer for Dlength. All
shown models are the large model variants. The mod-
els can easily distinguish between positive and negative
examples in at least some of their layers.
ively little additional information besides the parts
relevant to the classification task, to those with
greater potential interference due to more addi-
tional content that is not useful for classification.
Thus, we restrict the training set to samples from
the lowest quartile of each feature so that
fv
be-
comes
[vmin
f, vmin
f+1
4(vmax
fvmin
f)]
for
Dtrain
f
and
[vmin
f, vmax
f]
for
Dtest
f
. We report the test perform-
ance for every value of a given feature separately to
recognise patterns. For the artificial syntax probing,
we generate 1000 data points for each value of each
feature for each training and test for each subset
associated with a feature. For the corpus syntax
probing, we collect 9710 positive and 533 negat-
ive sentences in total, from which we choose 10
training and 5 test sentences for each value of each
feature in a similar manner. To improve compar-
ability and make the experiment computationally
feasible, we test the “large” size of each of our
three models, using the Huggingface Transformers
library (Wolf et al.,2019). Our logistic regression
probes are implemented using Scikitlearn (Pedre-
gosa et al.,2011).
3.2 Probing Results
3.2.1 Artificial Data
As shown in Figure 1, the results of our syntactic
probe indicate that all models can easily distin-
guish between positive and negative examples in
at least some of their layers, independently of any
of the sentence properties that we have investig-
ated. We report full results in the Appendix in
Figures 2,3, and 4. We find a clear trend that De-
BERTa performs better than RoBERTa, which in
turn performs better than BERT across the board.
As DeBERTa’s performance in all layers is nearly
perfect, we are unable to observe patterns related to
the length of the sentence, the start position of the
CC, the start position of the second half of the CC,
and the distance between them. By contrast, we ob-
serve interesting patterns for BERT and RoBERTa.
For
Dlength
, and to a lesser degree
Ddistance
(which
correlates with it), we observe that at first, perform-
ance goes down with increased length as we would
expect—the model struggles to generalise to longer
sentences with more interference since it was only
trained on short ones. However, this trend is re-
versed in the last few layers. We hypothesize this
may be due to an increased focus on semantics in
the last layers (Peters et al.,2018;Tenney et al.,
2019), which could lead to interfering features par-
ticularly in shorter sentences.
3.2.2 Corpus Data
In contrast, the results of our probe on more nat-
ural data from C4 indicate two different trends:
first, as the positive and negative instances are not
identical on a bag-of-word level, performance is
not uniformly at 50% (i.e., chance) level in the first
layers, indicating that the model can exploit lexical
cues to some degree. We observe a similar trend as
with the artificial experiment, which showed that
DeBERTa performs best and BERT worst. The cor-
responding graphs can be found in the Appendix in
Figures 5,6, and 7.
Generally, this additional corpus-based experi-
ment validates our findings from the experiment
with artificially generated data, as all models per-
form at 80% or better from the middle layers on,
indicating that the models are able to classify in-
stances of the construction even when they are very
diverse and use unseen POS tag patterns.
Comparing the average accuracies on
Dlength
for both data sources in Figure 1, we observe that
all models perform better on artificial than on cor-
pus data from the fifth layer on, with the notable
exception of a dip in performance for BERT large
around layer 10.
4 Semantics
4.1 Probing Methods
4.1.1 Usage-based Testing
For the second half of our investigation, we turn
to semantics. In order to determine if a model has
understood the meaning of the CC, i.e., if it has
understood that in any sentence, “the
COMP
.... the
摘要:

TheBetterYourSyntax,theBetterYourSemantics?ProbingPretrainedLanguageModelsfortheEnglishComparativeCorrelativeLeonieWeissweiler*,ValentinHofmanny*,AbdullatifKöksal*,HinrichSchütze**CenterforInformationandLanguageProcessing,LMUMunichMunichCenterofMachineLearningyFacultyofLinguistics,UniversityofOx...

展开>> 收起<<
The Better Your Syntax the Better Your Semantics Probing Pretrained Language Models for the English Comparative Correlative Leonie Weissweiler Valentin Hofmanny Abdullatif Köksal Hinrich Schütze.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:703.67KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注