tence. Because the language model is predicting
the next word, it is restricted to responding immedi-
ately, and cannot respond after the remainder of the
sentence. Another important difference between
the humans and models is that the humans were
evaluated in Italian, while the models were evalu-
ated in English. However, as Lakretz et al. note,
this makes exact comparisons challenging because
of differences in verb marking and sentence struc-
tures between the languages. Nevertheless, I fol-
low the evaluation approach of Lakretz et al. (2022)
here, in order to compare directly to their results.
Prompts:
To provide the language models with
an analog of the human training for the experi-
ment, I augmented the task data with one of several
simple prompts, which precede each evaluation
question. The prompts consisted of two or eight
grammatically-correct sentences with nested struc-
tures ("two-shot" or "eight-shot" prompts). Each
sentence appeared on a separate line. Following
Lakretz et al. (2021), these prompts use a lexicon
that does not overlap with the test stimuli.
The two-shot prompt involved two sentences
with structures corresponding to the easier “short”
conditions from Lakretz et al. (2022):
The computer that the children build is
fancy.,→
The chairs that the professor buys are
uncomfortable.,→
The eight-shot prompt included eight sentences
of varying length, with only one from the “long”
condition (and from the easier PPS case). The
full prompt is provided in Appendix A. Plural and
singular were roughly balanced within and across
sentences. I did not find it necessary to tune the
prompts for performance, although it would likely
further improve results. Further prompt variations
are explored in the Appendix B—including experi-
ments in which the numbered of nested examples
in the prompt is systematically varied, to evaluate
the effect of priming the particular structures in
question compared to simply supplying grammati-
cally accurate examples (Appendix B.6).
To score a language model on a question, I pro-
vided the prompt and then (on a new line) the be-
ginning of the target sentence as context. The mod-
els are trained to predict the next token, so fol-
lowing Lakretz et al. (2022) I calculated the like-
lihood of each version of the verb (plural and sin-
gular) at the appropriate place in the sentence (Fig.
1b), and marked the model as correct if it assigned
higher likelihood to the correct answer. Thus, the
grammatically-correct complex sentences in the
prompt might instill in the model the expectation
that future sentences in this context should be gram-
matically correct, just as the training the humans
receive is intended to help them learn the task.
3 Results
In Fig. 2I show the main results with the two
prompts described above. With the two-shot
prompt—which only contains short structures—
Chinchilla performs better than humans in the most
challenging long conditions. With the eight-shot
prompt—which contains only one long structure,
and from an easier condition—Chinchilla performs
better in every long condition than humans in the
easiest condition (PPP). Chinchilla makes no mis-
takes across the four easiest conditions (PPP, PPS,
SPP, SPS) with the eight-shot prompt (only one
of these conditions is represented in the prompt).
In Appendix BI show supplemental analyses: a
smaller 7-billion parameter model also performs
well, both models also handle the outer dependency,
zero-shot instruction prompts are less effective (but
show some intriguing results), and finally, perfor-
mance is only mildly improved by having nested
prompt examples, rather than examples from the
successive conditions from Lakretz et al. (2021).
Because Chinchilla performs exceptionally well
with the eight-shot prompt, I attempted to increase
the task difficulty. I tried two manipulations (Fig.
3a): either appending two more plural prepositional
phrases to the center distractor, or increasing the
center embedding depth by prepending an unre-
solved plural phrase (see Appendix B.3 for details).
I targeted these manipulations to increase the diffi-
culty of the key PSP condition; therefore all addi-
tions are plural, on the assumption that they would
most increase the difficulty of the challenging sin-
gular dependency. The results are shown in Fig. 3.
Adding more center distractors does not substan-
tially affect error rates. Increasing the embedding
depth does dramatically increase model error rates
in the PPSP condition, but the model still performs
better than humans do in the easier PSP condition.
4 Reanalyzing the human data
While the above experiments describe an attempt
to give the models an experimental context closer
to the humans, we can also attempt to analyze the
humans more like the models. Lakretz et al. (2021)
kindly shared their data. The analyses presented
here take place only after the participants have en-
5