Can language models handle recursively nested grammatical structures A case study on comparing models and humans Andrew Lampinen

2025-04-30 0 0 594.89KB 22 页 10玖币

侵权投诉

Can language models handle recursively nested grammatical structures?

A case study on comparing models and humans

Andrew Lampinen

DeepMind

lampinen@deepmind.com

Abstract

How should we compare the capabilities of lan-

guage models (LMs) and humans? I draw inspi-

ration from comparative psychology to highlight

some challenges. In particular, I consider a case

study: processing of recursively nested grammati-

cal structures. Prior work suggests that LMs can-

not handle these structures as reliably as humans

can. However, the humans were provided with in-

structions and training, while the LMs were eval-

uated zero-shot. I therefore match the evaluation

more closely. Providing large LMs with a sim-

ple prompt—substantially less content than the hu-

man training—allows the LMs to consistently out-

perform the human results, and even to extrapolate

to more deeply nested conditions than were tested

with humans. Further, reanalyzing the prior hu-

man data suggests that the humans may not per-

form above chance at the difﬁcult structures ini-

tially. Thus, large LMs may indeed process recur-

sively nested grammatical structures as reliably as

humans. This case study highlights how discrep-

ancies in the evaluation can confound comparisons

of language models and humans. I therefore reﬂect

on the broader challenge of comparing human and

model capabilities, and highlight an important dif-

ference between evaluating cognitive models and

foundation models.

1 Introduction

There is an increasing interest in comparing the

capabilities of large Language Models to human

capabilities (e.g. Lakretz et al.,2022;Binz and

Schulz,2022;Misra et al.,2022;Dasgupta et al.,

2022). Here, I argue that such comparisons re-

quire careful consideration of differences in ex-

perimental paradigms used to test the humans and

models. This challenge is analogous to the chal-

lenges in comparative psychology when attempt-

ing to compare cognitive capabilities between hu-

mans and animals; careful methods are required to

ensure that differences in performance correspond

to real capability differences rather than confound-

ing differences in motivation, task understanding,

etc. (Smith et al.,2018). I illustrate the importance

of careful comparisons to LMs with a case study:

some new experiments and analyses following re-

cent work comparing the capability of language

models and humans to process recursively-nested

grammatical structures (Lakretz et al.,2021,2022).

Recursion is fundamental to linguistic theories

of syntax. In the minimalist program (Chomsky,

2014b), the ability to recursively Merge smaller

constituents into a single larger syntactic structure

is considered a key, evolved component (Chomsky,

1999;Coolidge et al.,2011;Berwick and Chomsky,

2019). Merge is generally described as a classical

symbolic operation over structures—one that is

“indispensable” (Chomsky,1999).

Thus, there has been substantial interest in the re-

cursive abilities of neural networks. Elman (1991)

showed that simple recurrent networks can learn

recursive compositional syntactic structures in a

simpliﬁed language, which initiated a line of re-

search (e.g. Christiansen and Chater,1999;Chris-

tiansen and MacDonald,2009). Some recent work

has explored whether networks trained on natural

language can learn syntactic structures (e.g. Futrell

et al.,2019;Wilcox et al.,2020), and in partic-

ular recursive dependencies (Wilcox et al.,2019;

Lakretz et al.,2021;Hahn et al.,2022). Other re-

searchers have attempted to build neural networks

with explicit recursive structure for language pro-

cessing (e.g. Socher et al.,2013;Dyer et al.,2016).

However, the most broadly successful neural-

network language-processing systems have not in-

corporated explicit recursion; instead they have

used recurrence, and more recently, attention (Bah-

danau et al.,2014;Vaswani et al.,2017). When

attention-based models are trained simply to pre-

dict the next token on large corpora of human-

generated text, they can exhibit emergent capabil-

ities (e.g. Wei et al.,2022a), such as answering

college questions in subjects like medicine with

some accuracy (e.g. Hoffmann et al.,2022). The

arXiv:2210.15303v3 [cs.CL] 16 Feb 2023

“The actors that the mother near the students watches meet the author.”

Inner (singular)

Outer (plural)

Distractor (plural)

(a) A long, nested sentence.

“The actors that the mother near the students

watch

watches

(b) The inner agreement task.

Figure 1: Subject-verb agreement in nested, center-

embedded sentences. (a) An example sentence with

a three layer nested structure. Grammatical dependen-

cies are highlighted. (b) The task for the models is to

choose the next word in the sentence; in this case, com-

pleting the inner dependency with a verb that matches

whether the noun is plural or singular. Models (and

humans) make relatively more errors on the inner de-

pendency (green), particularly when it is singular and

the other two nouns are plural. (Example based on the

dataset of Lakretz et al. 2022, as released in Srivastava

et al. 2022.)

LMs’ representations even predict aspects of hu-

man neural activity during language processing

(e.g. Schrimpf et al.,2021). But can such models

process recursively nested structures?

Lakretz et al. (2021,2022) explored this ques-

tion, by considering recursively center-embedded

sentences (Fig. 1). These structures are rare in writ-

ten language—especially with multiple levels of

embedding—and almost absent in speech (Karls-

son,2007). Thus, these structures offer a challeng-

ing test of syntactic knowledge.

Lakretz et al. (2021) found that LSTM-based

models could learn subject-verb agreement over

short spans, but failed at some longer nested depen-

dencies. Subsequently, Lakretz et al. (2022) eval-

uated modern transformer LMs—including GPT-

2 XL (Radford et al.,2019)—on the same task.

While transformers performed more similarly to

humans than LSTMs did, and performed above

chance overall, they still performed below chance

in one key condition. The authors also contributed

their task to Srivastava et al. (2022), where other

models were evaluated; even GPT-3 (Brown et al.,

2020) did not perform above chance in the most

difﬁcult condition.

Thus, are large language mod-

els incapable of processing such structures?

N.B. the human experiments used Italian, while Lakretz

et al. (2022) switched to English, see below.

See the plots for this subtask within

https://github.

com/google/BIG-bench/

I contribute a few new observations and experi-

ments on this question. First, I highlight substan-

tial differences in the evaluation of the models and

humans in prior work (cf. Firestone,2020). The

humans were evaluated in the laboratory, a context

where they were motivated to follow instructions,

and were given substantial training. By contrast,

the language models were evaluated zero-shot.

To be clear, these are reasonable choices to

make in each case—the researchers followed estab-

lished practices in (respectively) the ﬁelds of cog-

nitive psychology and natural language processing.

However, the observed differences in performance

could potentially originate from differences in task-

speciﬁc context; perhaps the humans would also

perform below chance without training.

I therefore attempted to more closely match the

comparison, by providing simple prompts to LMs

(cf. Brown et al.,2020). I show that prompts with

substantially less content than the human training

allow state-of-the-art LMs (Hoffmann et al.,2022)

to perform as well in the hardest conditions as hu-

man subjects performed in the easiest conditions.

Furthermore, the largest model can even extrapo-

late to more challenging conditions than have been

evaluated with the human subjects, while perform-

ing better than humans do in easier, shallower con-

ditions. Thus, transformer language models appear

capable of handling these dependencies at least as

well as humans do (though it remains to be seen

whether they could do so from a more human-like

amount of experience; cf. Hosseini et al.,2022).

I then reanalyze the human results of Lakretz

et al. (2021), and show suggestive—though far

from conclusive—evidence that the human subjects

may be learning about the difﬁcult syntactic struc-

tures during the experiment. The human subjects

seem to perform near chance on early encounters

with difﬁcult structures, even after training; thus,

humans may also require experience with the task

in order to perform well in the hardest conditions.

I use this case study to reﬂect on the more gen-

eral issue of comparing human capabilities to those

of a model. In particular, I argue that foundation

models (Bommasani et al.,2021)—those given

broad training—require fundamentally different

evaluation than narrow cognitive models of a par-

ticular phenomenon. Foundations models need to

be guided into an experiment-appropriate behav-

ioral context, analogously to the way cognitive re-

searchers place humans in an experimental context,

and orient them toward the task with instructions

and examples. By accounting for these factors, we

can make more precise comparisons between hu-

mans and models.

1.1 Re-examining the prior work

The prior work on recursive grammatical struc-

tures in humans (Lakretz et al.,2021) gave the hu-

mans substantial instruction, training, and feedback

throughout the experiment (emphasis mine):

After each trial, participants received

feedback concerning their response: [...]

At the beginning of each session, partic-

ipants performed a training block com-

prising 40 trials.

The training sec-

tion included all stimulus types

, which

were constructed from a different lexicon

than that used for the main experiment.

By contrast, the comparative work on LMs (Lakretz

et al.,2022) did not provide the models with any

examples, instructions, or other context.

Furthermore, the direct effect of training on per-

formance is not the only factor to consider. Hu-

mans exhibit structural priming (e.g. Pickering and

Branigan,1999;Pickering and Ferreira,2008)—

a tendency to reproduce syntactic structures that

they have recently encountered. Structural priming

might allow humans to perform better on the task

when exposed repeatedly to the same structures,

even if they were not receiving feedback. Similarly,

LMs exhibit some analogous structural priming ef-

fects (Sinclair et al.,2022), and might therefore

likewise beneﬁt from exposure to related structures

before testing.

An additional consideration is that the humans

knew they were participating in an experiment, and

were presumably motivated by social demands to

perform their best. By contrast, zero-shot evalua-

tion is a difﬁcult context for language models—the

models are trained to reproduce a wide variety of

text from the internet, and it may be hard to tell

from a single question what the context is supposed

to be. Internet text conveys some of the breadth of

human communication and motivations (e.g. jok-

ing, memes, obfuscation). Thus, models trained on

this broad distribution are not necessarily as “moti-

vated” as human experimental participants to care-

fully answer difﬁcult questions that are posed to

them zero-shot—that is, the models may produce

answers consistent with a different part of the nat-

ural language distribution on the internet, such as

the more frequent grammatical errors made in hur-

ried forum comments.

Finally, the language models used in prior work

were far from the most capable extant models; as

Bowman (2022) notes, it may be misleading to

make broad claims about the failures of a model

class based only on weaker instances of that class.

Due to these limitations, it is unclear whether

previous results showing differences of perfor-

mance between models and humans should be in-

terpreted as differences in a fundamental capabil-

ity. I therefore attempted to make more closely

matched comparisons of the humans and models.

2 Methods

Models:

I evaluated two LMs from Hoffmann et al.

(2022): Chinchilla, with 70 billion parameters, and

a smaller 7 billion parameter model.

The mod-

els are decoder-only Transformers (Vaswani et al.,

2017), using SentencePiece (Kudo and Richardson,

2018) with 32000 tokens; they are trained to pre-

dict the next token in context on a large text corpus.

Tasks:

I used the task stimuli Lakretz et al.

(2022) released in Srivastava et al. (2022). The

model is given the beginning of a sentence as con-

text, and is given a forced choice between singular

and plural forms of a verb (Fig. 1b). The goal is to

choose the syntactically-appropriate match for the

relevant dependency. I focus on the Long-Nested

condition, as this condition is the most difﬁcult,

and the only case where transformers performed

below chance in the prior experiments. In the long-

nested sentence structures (Fig. 1a) there are three

nouns: one that is part of an outer dependency, one

in an inner dependency, and one that is part of a

prepositional phrase in the center (which I term

a distractor). The sub-conditions can be labelled

by whether each noun is singular (S) or plural (P).

Lakretz et al. (2021) found that the most difﬁcult

condition for the models was the inner dependency

of the PSP condition—that is, the condition where

the model must choose the singular verb form de-

spite the distractor and outer nouns being plural.

Note that the human task in Lakretz et al. (2021)

was slightly different than the language model

task Lakretz et al. (2021,2022)—the humans were

asked to identify grammatical violations either

when they occurred, or up to the end of the sen-

Note that even the smaller model is larger than those that

Lakretz et al. (2022) used—larger models respond better to

prompting, so the original models might show less beneﬁt.

Chance

100

SSS SSP SPP SPS PPP PPS PSS PSP

Sentence type

Error rate (%)

(a) No prompt.

Chance

Humans

100

SSS SSP SPP SPS PPP PPS PSS PSP

Sentence type

Error rate (%)

(b) Two shots.

Chance

Humans

100

SSS SSP SPP SPS PPP PPS PSS PSP

Sentence type

Error rate (%)

Figure 2: Error rates by prompt condition—Chinchilla performs well at the long embedded clauses when given a

brief prompt. The plots show error rates (lower is better). Dashed lines show human performance in each condition,

after 40 training trials, from Lakretz et al. (2021). (a) With no prompt, as in Lakretz et al. (2022), Chinchilla

performs poorly on two challenging conditions. (b) With a two-shot prompt, Chinchilla performs comparably

or better than humans in all conditions, and better than humans in the key PSP condition. (c) With eight shots,

Chinchilla performs much better, consistently exhibiting error rates of less than 10% across all conditions. (Error

bars are bootstrap 95%-CIs.)

“The actors that the mother near the students watches [...]”

“The writers whom”

Deeper nesting

“in the courtyards outside the buildings”

More center distractors

(a) The task modiﬁcations.

Chance

Humans (easier)

100

SSSPP

SSPPP

SPPPP

SPSPP

PPPPP

PPSPP

PSSPP

PSPPP

Sentence type

Error rate (%)

(b) More center distractors.

Chance

Humans (easier)

100

PSSS

PSSP

PSPP

PSPS

PPPP

PPPS

PPSS

PPSP

Sentence type

Error rate (%)

Figure 3: Chinchilla, with the same eight-shot prompt, evaluated on more challenging conditions. (a) The modiﬁ-

cations to the tasks—either nesting the sentence more deeply (top), or inserting more center distractors (bottom).

(b) Adding two more distractor plural prepositional phrases in the center does not substantially change the error

rates. (c) Increasing the embedding depth, by prepending an additional plural preﬁx does increase error rates in

the most challenging condition—however, the model still performs better than humans do in easier conditions, in-

dicated as dashed lines. (The dashed lines show human performance in the hardest conditions that Lakretz et al.

(2021) evaluated with humans; the conditions we evaluate the model on here are alterations intended to make the

task even more difﬁcult, especially in the PSP condition. Error bars are bootstrap 95%-CIs.)

Chance

0.00

0.25

0.50

0.75

1.00

SSS SSP SPP SPS PPP PPS PSS PSP

Sentence type

Error rate (%)

(a) First encounter error rates.

0.00

0.25

0.50

0.75

1.00

0 50 100 150

Trial when first encountered

PSP error rate (%)

(b) PSP ﬁrst encounter error rates by trial.

0.00

0.25

0.50

0.75

1.00

12345

Number of encounters

PSP error rate (%)

Figure 4: Human error rates on inner/embedded dependencies from Lakretz et al. (2021), reanalyzed to explore

learning effects. Note that these plots are after the humans have completed their training phase. (a) Performance

on all structures the ﬁrst time they are encountered after training. Humans do not appear to perform better than

chance on the most difﬁcult structures. (b) Performance on the key PSP structure when it is ﬁrst encountered, as a

function of trial—a proxy for experience on related structures. (c) Performance on the PSP structure, with the target

grammar violation, as a function of the number of times it has been encountered. The results are suggestive of a

learning effect, but are not conclusive due to the small sample size. (Points/bars are aggregates across subjects—in

panel b, all subjects who ﬁrst encountered that structure at that trial. Errorbars in panel aare bootstrap 95%-CIs;

lines/ranges in panels b-care logistic regression ﬁts.)

tence. Because the language model is predicting

the next word, it is restricted to responding immedi-

ately, and cannot respond after the remainder of the

sentence. Another important difference between

the humans and models is that the humans were

evaluated in Italian, while the models were evalu-

ated in English. However, as Lakretz et al. note,

this makes exact comparisons challenging because

of differences in verb marking and sentence struc-

tures between the languages. Nevertheless, I fol-

low the evaluation approach of Lakretz et al. (2022)

here, in order to compare directly to their results.

Prompts:

To provide the language models with

an analog of the human training for the experi-

ment, I augmented the task data with one of several

simple prompts, which precede each evaluation

question. The prompts consisted of two or eight

grammatically-correct sentences with nested struc-

tures ("two-shot" or "eight-shot" prompts). Each

sentence appeared on a separate line. Following

Lakretz et al. (2021), these prompts use a lexicon

that does not overlap with the test stimuli.

The two-shot prompt involved two sentences

with structures corresponding to the easier “short”

conditions from Lakretz et al. (2022):

The computer that the children build is

fancy.,→

The chairs that the professor buys are

uncomfortable.,→

The eight-shot prompt included eight sentences

of varying length, with only one from the “long”

condition (and from the easier PPS case). The

full prompt is provided in Appendix A. Plural and

singular were roughly balanced within and across

sentences. I did not ﬁnd it necessary to tune the

prompts for performance, although it would likely

further improve results. Further prompt variations

are explored in the Appendix B—including experi-

ments in which the numbered of nested examples

in the prompt is systematically varied, to evaluate

the effect of priming the particular structures in

question compared to simply supplying grammati-

cally accurate examples (Appendix B.6).

To score a language model on a question, I pro-

vided the prompt and then (on a new line) the be-

ginning of the target sentence as context. The mod-

els are trained to predict the next token, so fol-

lowing Lakretz et al. (2022) I calculated the like-

lihood of each version of the verb (plural and sin-

gular) at the appropriate place in the sentence (Fig.

1b), and marked the model as correct if it assigned

higher likelihood to the correct answer. Thus, the

grammatically-correct complex sentences in the

prompt might instill in the model the expectation

that future sentences in this context should be gram-

matically correct, just as the training the humans

receive is intended to help them learn the task.

3 Results

In Fig. 2I show the main results with the two

prompts described above. With the two-shot

prompt—which only contains short structures—

Chinchilla performs better than humans in the most

challenging long conditions. With the eight-shot

prompt—which contains only one long structure,

and from an easier condition—Chinchilla performs

better in every long condition than humans in the

easiest condition (PPP). Chinchilla makes no mis-

takes across the four easiest conditions (PPP, PPS,

SPP, SPS) with the eight-shot prompt (only one

of these conditions is represented in the prompt).

In Appendix BI show supplemental analyses: a

smaller 7-billion parameter model also performs

well, both models also handle the outer dependency,

zero-shot instruction prompts are less effective (but

show some intriguing results), and ﬁnally, perfor-

mance is only mildly improved by having nested

prompt examples, rather than examples from the

successive conditions from Lakretz et al. (2021).

Because Chinchilla performs exceptionally well

with the eight-shot prompt, I attempted to increase

the task difﬁculty. I tried two manipulations (Fig.

3a): either appending two more plural prepositional

phrases to the center distractor, or increasing the

center embedding depth by prepending an unre-

solved plural phrase (see Appendix B.3 for details).

I targeted these manipulations to increase the difﬁ-

culty of the key PSP condition; therefore all addi-

tions are plural, on the assumption that they would

most increase the difﬁculty of the challenging sin-

gular dependency. The results are shown in Fig. 3.

Adding more center distractors does not substan-

tially affect error rates. Increasing the embedding

depth does dramatically increase model error rates

in the PPSP condition, but the model still performs

better than humans do in the easier PSP condition.

4 Reanalyzing the human data

While the above experiments describe an attempt

to give the models an experimental context closer

to the humans, we can also attempt to analyze the

humans more like the models. Lakretz et al. (2021)

kindly shared their data. The analyses presented

here take place only after the participants have en-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Canlanguagemodelshandlerecursivelynestedgrammaticalstructures?AcasestudyoncomparingmodelsandhumansAndrewLampinenDeepMindlampinen@deepmind.comAbstractHowshouldwecomparethecapabilitiesoflan-guagemodels(LMs)andhumans?Idrawinspi-rationfromcomparativepsychologytohighlightsomechallenges.Inparticular,Icons...

展开>> 收起<<

Can language models handle recursively nested grammatical structures A case study on comparing models and humans Andrew Lampinen.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Can language models handle recursively nested grammatical structures A case study on comparing models and humans Andrew Lampinen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: