Can language models handle recursively nested grammatical structures A case study on comparing models and humans Andrew Lampinen

2025-04-30 0 0 594.89KB 22 页 10玖币
侵权投诉
Can language models handle recursively nested grammatical structures?
A case study on comparing models and humans
Andrew Lampinen
DeepMind
lampinen@deepmind.com
Abstract
How should we compare the capabilities of lan-
guage models (LMs) and humans? I draw inspi-
ration from comparative psychology to highlight
some challenges. In particular, I consider a case
study: processing of recursively nested grammati-
cal structures. Prior work suggests that LMs can-
not handle these structures as reliably as humans
can. However, the humans were provided with in-
structions and training, while the LMs were eval-
uated zero-shot. I therefore match the evaluation
more closely. Providing large LMs with a sim-
ple prompt—substantially less content than the hu-
man training—allows the LMs to consistently out-
perform the human results, and even to extrapolate
to more deeply nested conditions than were tested
with humans. Further, reanalyzing the prior hu-
man data suggests that the humans may not per-
form above chance at the difficult structures ini-
tially. Thus, large LMs may indeed process recur-
sively nested grammatical structures as reliably as
humans. This case study highlights how discrep-
ancies in the evaluation can confound comparisons
of language models and humans. I therefore reflect
on the broader challenge of comparing human and
model capabilities, and highlight an important dif-
ference between evaluating cognitive models and
foundation models.
1 Introduction
There is an increasing interest in comparing the
capabilities of large Language Models to human
capabilities (e.g. Lakretz et al.,2022;Binz and
Schulz,2022;Misra et al.,2022;Dasgupta et al.,
2022). Here, I argue that such comparisons re-
quire careful consideration of differences in ex-
perimental paradigms used to test the humans and
models. This challenge is analogous to the chal-
lenges in comparative psychology when attempt-
ing to compare cognitive capabilities between hu-
mans and animals; careful methods are required to
ensure that differences in performance correspond
to real capability differences rather than confound-
ing differences in motivation, task understanding,
etc. (Smith et al.,2018). I illustrate the importance
of careful comparisons to LMs with a case study:
some new experiments and analyses following re-
cent work comparing the capability of language
models and humans to process recursively-nested
grammatical structures (Lakretz et al.,2021,2022).
Recursion is fundamental to linguistic theories
of syntax. In the minimalist program (Chomsky,
2014b), the ability to recursively Merge smaller
constituents into a single larger syntactic structure
is considered a key, evolved component (Chomsky,
1999;Coolidge et al.,2011;Berwick and Chomsky,
2019). Merge is generally described as a classical
symbolic operation over structures—one that is
“indispensable” (Chomsky,1999).
Thus, there has been substantial interest in the re-
cursive abilities of neural networks. Elman (1991)
showed that simple recurrent networks can learn
recursive compositional syntactic structures in a
simplified language, which initiated a line of re-
search (e.g. Christiansen and Chater,1999;Chris-
tiansen and MacDonald,2009). Some recent work
has explored whether networks trained on natural
language can learn syntactic structures (e.g. Futrell
et al.,2019;Wilcox et al.,2020), and in partic-
ular recursive dependencies (Wilcox et al.,2019;
Lakretz et al.,2021;Hahn et al.,2022). Other re-
searchers have attempted to build neural networks
with explicit recursive structure for language pro-
cessing (e.g. Socher et al.,2013;Dyer et al.,2016).
However, the most broadly successful neural-
network language-processing systems have not in-
corporated explicit recursion; instead they have
used recurrence, and more recently, attention (Bah-
danau et al.,2014;Vaswani et al.,2017). When
attention-based models are trained simply to pre-
dict the next token on large corpora of human-
generated text, they can exhibit emergent capabil-
ities (e.g. Wei et al.,2022a), such as answering
college questions in subjects like medicine with
some accuracy (e.g. Hoffmann et al.,2022). The
1
arXiv:2210.15303v3 [cs.CL] 16 Feb 2023
“The actors that the mother near the students watches meet the author.
Inner (singular)
Outer (plural)
Distractor (plural)
(a) A long, nested sentence.
“The actors that the mother near the students
watch
watches
(b) The inner agreement task.
Figure 1: Subject-verb agreement in nested, center-
embedded sentences. (a) An example sentence with
a three layer nested structure. Grammatical dependen-
cies are highlighted. (b) The task for the models is to
choose the next word in the sentence; in this case, com-
pleting the inner dependency with a verb that matches
whether the noun is plural or singular. Models (and
humans) make relatively more errors on the inner de-
pendency (green), particularly when it is singular and
the other two nouns are plural. (Example based on the
dataset of Lakretz et al. 2022, as released in Srivastava
et al. 2022.)
LMs’ representations even predict aspects of hu-
man neural activity during language processing
(e.g. Schrimpf et al.,2021). But can such models
process recursively nested structures?
Lakretz et al. (2021,2022) explored this ques-
tion, by considering recursively center-embedded
sentences (Fig. 1). These structures are rare in writ-
ten language—especially with multiple levels of
embedding—and almost absent in speech (Karls-
son,2007). Thus, these structures offer a challeng-
ing test of syntactic knowledge.
Lakretz et al. (2021) found that LSTM-based
models could learn subject-verb agreement over
short spans, but failed at some longer nested depen-
dencies. Subsequently, Lakretz et al. (2022) eval-
uated modern transformer LMs—including GPT-
2 XL (Radford et al.,2019)—on the same task.
1
While transformers performed more similarly to
humans than LSTMs did, and performed above
chance overall, they still performed below chance
in one key condition. The authors also contributed
their task to Srivastava et al. (2022), where other
models were evaluated; even GPT-3 (Brown et al.,
2020) did not perform above chance in the most
difficult condition.
2
Thus, are large language mod-
els incapable of processing such structures?
1
N.B. the human experiments used Italian, while Lakretz
et al. (2022) switched to English, see below.
2
See the plots for this subtask within
https://github.
com/google/BIG-bench/
I contribute a few new observations and experi-
ments on this question. First, I highlight substan-
tial differences in the evaluation of the models and
humans in prior work (cf. Firestone,2020). The
humans were evaluated in the laboratory, a context
where they were motivated to follow instructions,
and were given substantial training. By contrast,
the language models were evaluated zero-shot.
To be clear, these are reasonable choices to
make in each case—the researchers followed estab-
lished practices in (respectively) the fields of cog-
nitive psychology and natural language processing.
However, the observed differences in performance
could potentially originate from differences in task-
specific context; perhaps the humans would also
perform below chance without training.
I therefore attempted to more closely match the
comparison, by providing simple prompts to LMs
(cf. Brown et al.,2020). I show that prompts with
substantially less content than the human training
allow state-of-the-art LMs (Hoffmann et al.,2022)
to perform as well in the hardest conditions as hu-
man subjects performed in the easiest conditions.
Furthermore, the largest model can even extrapo-
late to more challenging conditions than have been
evaluated with the human subjects, while perform-
ing better than humans do in easier, shallower con-
ditions. Thus, transformer language models appear
capable of handling these dependencies at least as
well as humans do (though it remains to be seen
whether they could do so from a more human-like
amount of experience; cf. Hosseini et al.,2022).
I then reanalyze the human results of Lakretz
et al. (2021), and show suggestive—though far
from conclusive—evidence that the human subjects
may be learning about the difficult syntactic struc-
tures during the experiment. The human subjects
seem to perform near chance on early encounters
with difficult structures, even after training; thus,
humans may also require experience with the task
in order to perform well in the hardest conditions.
I use this case study to reflect on the more gen-
eral issue of comparing human capabilities to those
of a model. In particular, I argue that foundation
models (Bommasani et al.,2021)—those given
broad training—require fundamentally different
evaluation than narrow cognitive models of a par-
ticular phenomenon. Foundations models need to
be guided into an experiment-appropriate behav-
ioral context, analogously to the way cognitive re-
searchers place humans in an experimental context,
2
and orient them toward the task with instructions
and examples. By accounting for these factors, we
can make more precise comparisons between hu-
mans and models.
1.1 Re-examining the prior work
The prior work on recursive grammatical struc-
tures in humans (Lakretz et al.,2021) gave the hu-
mans substantial instruction, training, and feedback
throughout the experiment (emphasis mine):
After each trial, participants received
feedback concerning their response: [...]
At the beginning of each session, partic-
ipants performed a training block com-
prising 40 trials.
The training sec-
tion included all stimulus types
, which
were constructed from a different lexicon
than that used for the main experiment.
By contrast, the comparative work on LMs (Lakretz
et al.,2022) did not provide the models with any
examples, instructions, or other context.
Furthermore, the direct effect of training on per-
formance is not the only factor to consider. Hu-
mans exhibit structural priming (e.g. Pickering and
Branigan,1999;Pickering and Ferreira,2008)—
a tendency to reproduce syntactic structures that
they have recently encountered. Structural priming
might allow humans to perform better on the task
when exposed repeatedly to the same structures,
even if they were not receiving feedback. Similarly,
LMs exhibit some analogous structural priming ef-
fects (Sinclair et al.,2022), and might therefore
likewise benefit from exposure to related structures
before testing.
An additional consideration is that the humans
knew they were participating in an experiment, and
were presumably motivated by social demands to
perform their best. By contrast, zero-shot evalua-
tion is a difficult context for language models—the
models are trained to reproduce a wide variety of
text from the internet, and it may be hard to tell
from a single question what the context is supposed
to be. Internet text conveys some of the breadth of
human communication and motivations (e.g. jok-
ing, memes, obfuscation). Thus, models trained on
this broad distribution are not necessarily as “moti-
vated” as human experimental participants to care-
fully answer difficult questions that are posed to
them zero-shot—that is, the models may produce
answers consistent with a different part of the nat-
ural language distribution on the internet, such as
the more frequent grammatical errors made in hur-
ried forum comments.
Finally, the language models used in prior work
were far from the most capable extant models; as
Bowman (2022) notes, it may be misleading to
make broad claims about the failures of a model
class based only on weaker instances of that class.
Due to these limitations, it is unclear whether
previous results showing differences of perfor-
mance between models and humans should be in-
terpreted as differences in a fundamental capabil-
ity. I therefore attempted to make more closely
matched comparisons of the humans and models.
2 Methods
Models:
I evaluated two LMs from Hoffmann et al.
(2022): Chinchilla, with 70 billion parameters, and
a smaller 7 billion parameter model.
3
The mod-
els are decoder-only Transformers (Vaswani et al.,
2017), using SentencePiece (Kudo and Richardson,
2018) with 32000 tokens; they are trained to pre-
dict the next token in context on a large text corpus.
Tasks:
I used the task stimuli Lakretz et al.
(2022) released in Srivastava et al. (2022). The
model is given the beginning of a sentence as con-
text, and is given a forced choice between singular
and plural forms of a verb (Fig. 1b). The goal is to
choose the syntactically-appropriate match for the
relevant dependency. I focus on the Long-Nested
condition, as this condition is the most difficult,
and the only case where transformers performed
below chance in the prior experiments. In the long-
nested sentence structures (Fig. 1a) there are three
nouns: one that is part of an outer dependency, one
in an inner dependency, and one that is part of a
prepositional phrase in the center (which I term
a distractor). The sub-conditions can be labelled
by whether each noun is singular (S) or plural (P).
Lakretz et al. (2021) found that the most difficult
condition for the models was the inner dependency
of the PSP condition—that is, the condition where
the model must choose the singular verb form de-
spite the distractor and outer nouns being plural.
Note that the human task in Lakretz et al. (2021)
was slightly different than the language model
task Lakretz et al. (2021,2022)—the humans were
asked to identify grammatical violations either
when they occurred, or up to the end of the sen-
3
Note that even the smaller model is larger than those that
Lakretz et al. (2022) used—larger models respond better to
prompting, so the original models might show less benefit.
3
Chance
0
25
50
75
100
SSS SSP SPP SPS PPP PPS PSS PSP
Sentence type
Error rate (%)
(a) No prompt.
Chance
Humans
0
25
50
75
100
SSS SSP SPP SPS PPP PPS PSS PSP
Sentence type
Error rate (%)
(b) Two shots.
Chance
Humans
0
25
50
75
100
SSS SSP SPP SPS PPP PPS PSS PSP
Sentence type
Error rate (%)
(c) Eight shots.
Figure 2: Error rates by prompt condition—Chinchilla performs well at the long embedded clauses when given a
brief prompt. The plots show error rates (lower is better). Dashed lines show human performance in each condition,
after 40 training trials, from Lakretz et al. (2021). (a) With no prompt, as in Lakretz et al. (2022), Chinchilla
performs poorly on two challenging conditions. (b) With a two-shot prompt, Chinchilla performs comparably
or better than humans in all conditions, and better than humans in the key PSP condition. (c) With eight shots,
Chinchilla performs much better, consistently exhibiting error rates of less than 10% across all conditions. (Error
bars are bootstrap 95%-CIs.)
“The actors that the mother near the students watches [...]”
“The writers whom”
Deeper nesting
“in the courtyards outside the buildings
More center distractors
(a) The task modifications.
Chance
Humans (easier)
0
25
50
75
100
SSSPP
SSPPP
SPPPP
SPSPP
PPPPP
PPSPP
PSSPP
PSPPP
Sentence type
Error rate (%)
(b) More center distractors.
Chance
Humans (easier)
0
25
50
75
100
PSSS
PSSP
PSPP
PSPS
PPPP
PPPS
PPSS
PPSP
Sentence type
Error rate (%)
(c) Deeper nesting.
Figure 3: Chinchilla, with the same eight-shot prompt, evaluated on more challenging conditions. (a) The modifi-
cations to the tasks—either nesting the sentence more deeply (top), or inserting more center distractors (bottom).
(b) Adding two more distractor plural prepositional phrases in the center does not substantially change the error
rates. (c) Increasing the embedding depth, by prepending an additional plural prefix does increase error rates in
the most challenging condition—however, the model still performs better than humans do in easier conditions, in-
dicated as dashed lines. (The dashed lines show human performance in the hardest conditions that Lakretz et al.
(2021) evaluated with humans; the conditions we evaluate the model on here are alterations intended to make the
task even more difficult, especially in the PSP condition. Error bars are bootstrap 95%-CIs.)
Chance
0.00
0.25
0.50
0.75
1.00
SSS SSP SPP SPS PPP PPS PSS PSP
Sentence type
Error rate (%)
(a) First encounter error rates.
0.00
0.25
0.50
0.75
1.00
0 50 100 150
Trial when first encountered
PSP error rate (%)
(b) PSP first encounter error rates by trial.
0.00
0.25
0.50
0.75
1.00
12345
Number of encounters
PSP error rate (%)
(c) PSP error rates by encounter.
Figure 4: Human error rates on inner/embedded dependencies from Lakretz et al. (2021), reanalyzed to explore
learning effects. Note that these plots are after the humans have completed their training phase. (a) Performance
on all structures the first time they are encountered after training. Humans do not appear to perform better than
chance on the most difficult structures. (b) Performance on the key PSP structure when it is first encountered, as a
function of trial—a proxy for experience on related structures. (c) Performance on the PSP structure, with the target
grammar violation, as a function of the number of times it has been encountered. The results are suggestive of a
learning effect, but are not conclusive due to the small sample size. (Points/bars are aggregates across subjects—in
panel b, all subjects who first encountered that structure at that trial. Errorbars in panel aare bootstrap 95%-CIs;
lines/ranges in panels b-care logistic regression fits.)
4
tence. Because the language model is predicting
the next word, it is restricted to responding immedi-
ately, and cannot respond after the remainder of the
sentence. Another important difference between
the humans and models is that the humans were
evaluated in Italian, while the models were evalu-
ated in English. However, as Lakretz et al. note,
this makes exact comparisons challenging because
of differences in verb marking and sentence struc-
tures between the languages. Nevertheless, I fol-
low the evaluation approach of Lakretz et al. (2022)
here, in order to compare directly to their results.
Prompts:
To provide the language models with
an analog of the human training for the experi-
ment, I augmented the task data with one of several
simple prompts, which precede each evaluation
question. The prompts consisted of two or eight
grammatically-correct sentences with nested struc-
tures ("two-shot" or "eight-shot" prompts). Each
sentence appeared on a separate line. Following
Lakretz et al. (2021), these prompts use a lexicon
that does not overlap with the test stimuli.
The two-shot prompt involved two sentences
with structures corresponding to the easier “short”
conditions from Lakretz et al. (2022):
The computer that the children build is
fancy.,
The chairs that the professor buys are
uncomfortable.,
The eight-shot prompt included eight sentences
of varying length, with only one from the “long”
condition (and from the easier PPS case). The
full prompt is provided in Appendix A. Plural and
singular were roughly balanced within and across
sentences. I did not find it necessary to tune the
prompts for performance, although it would likely
further improve results. Further prompt variations
are explored in the Appendix B—including experi-
ments in which the numbered of nested examples
in the prompt is systematically varied, to evaluate
the effect of priming the particular structures in
question compared to simply supplying grammati-
cally accurate examples (Appendix B.6).
To score a language model on a question, I pro-
vided the prompt and then (on a new line) the be-
ginning of the target sentence as context. The mod-
els are trained to predict the next token, so fol-
lowing Lakretz et al. (2022) I calculated the like-
lihood of each version of the verb (plural and sin-
gular) at the appropriate place in the sentence (Fig.
1b), and marked the model as correct if it assigned
higher likelihood to the correct answer. Thus, the
grammatically-correct complex sentences in the
prompt might instill in the model the expectation
that future sentences in this context should be gram-
matically correct, just as the training the humans
receive is intended to help them learn the task.
3 Results
In Fig. 2I show the main results with the two
prompts described above. With the two-shot
prompt—which only contains short structures—
Chinchilla performs better than humans in the most
challenging long conditions. With the eight-shot
prompt—which contains only one long structure,
and from an easier condition—Chinchilla performs
better in every long condition than humans in the
easiest condition (PPP). Chinchilla makes no mis-
takes across the four easiest conditions (PPP, PPS,
SPP, SPS) with the eight-shot prompt (only one
of these conditions is represented in the prompt).
In Appendix BI show supplemental analyses: a
smaller 7-billion parameter model also performs
well, both models also handle the outer dependency,
zero-shot instruction prompts are less effective (but
show some intriguing results), and finally, perfor-
mance is only mildly improved by having nested
prompt examples, rather than examples from the
successive conditions from Lakretz et al. (2021).
Because Chinchilla performs exceptionally well
with the eight-shot prompt, I attempted to increase
the task difficulty. I tried two manipulations (Fig.
3a): either appending two more plural prepositional
phrases to the center distractor, or increasing the
center embedding depth by prepending an unre-
solved plural phrase (see Appendix B.3 for details).
I targeted these manipulations to increase the diffi-
culty of the key PSP condition; therefore all addi-
tions are plural, on the assumption that they would
most increase the difficulty of the challenging sin-
gular dependency. The results are shown in Fig. 3.
Adding more center distractors does not substan-
tially affect error rates. Increasing the embedding
depth does dramatically increase model error rates
in the PPSP condition, but the model still performs
better than humans do in the easier PSP condition.
4 Reanalyzing the human data
While the above experiments describe an attempt
to give the models an experimental context closer
to the humans, we can also attempt to analyze the
humans more like the models. Lakretz et al. (2021)
kindly shared their data. The analyses presented
here take place only after the participants have en-
5
摘要:

Canlanguagemodelshandlerecursivelynestedgrammaticalstructures?AcasestudyoncomparingmodelsandhumansAndrewLampinenDeepMindlampinen@deepmind.comAbstractHowshouldwecomparethecapabilitiesoflan-guagemodels(LMs)andhumans?Idrawinspi-rationfromcomparativepsychologytohighlightsomechallenges.Inparticular,Icons...

展开>> 收起<<
Can language models handle recursively nested grammatical structures A case study on comparing models and humans Andrew Lampinen.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:594.89KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注