is crucial to ensure that models behave robustly, reliably, and fairly when making predictions about data
different from the data that they learned from, which is of critical importance when models are employed
in the real world. Others see good generalisation as intrinsically equivalent to good performance and
believe that without it a model is not truly able to conduct the task we intend it to. Yet others strive for
good generalisation because they believe models should behave in a human-like way, and humans are
known to generalise well. While the importance of generalisation is almost undisputed – in the past five
years, in the ACL Anthology alone over 1200 papers mentioned it in their title or abstract – systematic
generalisation testing is not the status quo in the field of NLP.
At the root of this problem lies the fact that there is little understanding and agreement about what
good generalisation looks like, what types of generalisation exist, and which should be prioritised in
varying scenarios. Broadly speaking, generalisation is evaluated by assessing how well a model performs
on a test dataset, given the relationship of this dataset with the data the model was trained on. For
decades, it was common to exert only one simple constraint on this relationship: that the train and test
data are different. Typically, this was achieved by randomly splitting available data into a training and a
test partition. Generalisation was thus evaluated by training and testing models on different but similarly
sampled data, assumed to be independent and identically distributed (i.i.d.). In the past 20 years, we
have seen great strides on such random train–test splits in a range of different applications. Since the
first release of the Penn Treebank (Marcus et al., 1993), F1scores for labelled constituency parsing went
from above 80% at the end of the previous century (Collins, 1996; Magerman, 1995) and close to 90% in
the first ten years of the current one (e.g. Petrov and Klein, 2007; Sangati and Zuidema, 2011) to scores
up to 96% in recent years (Mrini et al., 2020; Yang and Deng, 2020). On the same corpus, performance
for language modelling went from per-word perplexity scores well above 100 in the mid-90s (Kneser
and Ney, 1995; ROSENFELD, 1996) to a score of 20.5 in 2020 (Brown et al., 2020). In many areas of
NLP, the rate of progress has become even faster in the recent past. Scores for the popular evaluation
suite GLUE went from values between 60 and 70 at its release in 2018 (Wang et al., 2018) to scores
exceeding 90 less than a year after (Devlin et al., 2019), with performances on a wide range of tasks
reaching and surpassing human-level scores by 2019 (e.g. Devlin et al., 2019; Liu et al., 2019b; Wang
et al., 2019, 2018). In 2022, strongly scaled-up models (e.g. Chowdhery et al., 2022) showed astounding
performances on almost all existing i.i.d. natural language understanding benchmarks.
With this progress, however, came the realisation that, for an NLP model, reaching very high or
human-level scores on an i.i.d. test set does not imply that the model robustly generalises to a wide range
of different scenarios in the way humans do. In the recent past, we witnessed a tide of different studies
pointing out generalisation failures in neural models that have state-of-the-art scores on random train–
test splits (Blodgett et al., 2016; Khishigsuren et al., 2022; Kim and Linzen, 2020; Lake and Baroni,
2018; Marcus, 2018; McCoy et al., 2019; Plank, 2016; Razeghi et al., 2022; Sinha et al., 2021, to give
just a few examples). Some show that when models perform well on i.i.d. test splits, they might rely
on simple heuristics that do not robustly generalise in a wide range of non-i.i.d. scenarios (Gardner
et al., 2020; Kaushik et al., 2019; McCoy et al., 2019), over-rely on stereotypes (Parrish et al., 2022;
Srivastava et al., 2022), or bank on memorisation rather than generalisation (Lewis et al., 2021; Razeghi
et al., 2022). Others, instead, display cases in which performances drop when the evaluation data differs
from the training data in terms of genre, domain or topic (e.g. Malinin et al., 2021; Michel and Neubig,
2018; Plank, 2016), or when it represents different subpopulations (e.g. Blodgett et al., 2016; Dixon
et al., 2018). Yet other studies focus on models’ inability to generalise compositionally (Dankers et al.,
2022; Kim and Linzen, 2020; Lake and Baroni, 2018; Li et al., 2021b), structurally (Sinha et al., 2021;
Weber et al., 2021; Wei et al., 2021), to longer sequences (Dubois et al., 2020; Raunak et al., 2019), or
to slightly different task formulations of the same problem (Srivastava et al., 2022).
By showing that good performance on traditional train–test splits does not equal good generalisation,
the examples above bring into question what kind of model capabilities recent breakthroughs actually
reflect, and they suggest that research on the evaluation of NLP models is catching up with the fast
2