State-of-the-art generalisation research in NLP A taxonomy and review Dieuwke Hupkes1 Mario Giulianelli23 Verna Dankers14 Mikel Artetxe5

2025-05-03 0 0 2.47MB 86 页 10玖币

侵权投诉

State-of-the-art generalisation research in NLP:

A taxonomy and review

Dieuwke Hupkes1, Mario Giulianelli2,3, Verna Dankers1,4, Mikel Artetxe5

Yanai Elazar6,7, Tiago Pimentel8, Christos Christodoulopoulos9, Karim Lasri10,11

Naomi Saphra12, Arabella Sinclair13, Dennis Ulmer14, Florian Schottmann3,16

Khuyagbaatar Batsuren17, Kaiser Sun1, Koustuv Sinha1, Leila Khalatbari18

Maria Ryskina19, Rita Frieske18, Ryan Cotterell3, Zhijing Jin3,20

dieuwkehupkes@meta.com mgiulianelli@inf.ethz.ch

vernadankers@gmail.com

1FAIR 2University of Amsterdam 3ETH Zürich 4University of Edinburgh

5Reka AI 6Allen Institute for AI 7University of Washington 8University of Cambridge

9Amazon Alexa AI 10École Normale Supérieure-PSL 11The World Bank 12Harvard University

13University of Aberdeen 14IT University of Copenhagen 15Pioneer Centre for Artiﬁcial Intelligence

16Textshuttle 17National University of Mongolia 18Hong Kong University of Science and Technology

19MIT 20Max Planck Institute for Intelligent Systems

Abstract

The ability to generalise well is one of the primary desiderata of natural language processing

(NLP). Yet, what ‘good generalisation’ entails and how it should be evaluated is not well understood,

nor are there any evaluation standards for generalisation. In this paper, we lay the groundwork to

address both of these issues. We present a taxonomy for characterising and understanding generali-

sation research in NLP. Our taxonomy is based on an extensive literature review of generalisation re-

search, and contains ﬁve axes along which studies can differ: their main motivation, the type of gen-

eralisation they investigate, the type of data shift they consider, the source of this data shift, and the

locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 papers that

test generalisation, for a total of more than 600 individual experiments. Considering the results of

this review, we present an in-depth analysis that maps out the current state of generalisation research

in NLP, and we make recommendations for which areas might deserve attention in the future. Along

with this paper, we release a webpage where the results of our review can be dynamically explored,

and which we intend to update as new NLP generalisation studies are published. With this work, we

aim to take steps towards making state-of-the-art generalisation testing the new status quo in NLP.

This preprint was published as an Analysis article in Nature Machine Intelligence.

Please refer to the published version when citing this work.

1 Introduction

Good generalisation, roughly deﬁned as the ability to successfully transfer representations, knowledge,

and strategies from past experience to new experiences, is one of the primary desiderata for models of

natural language processing (NLP), as well as for models in the wider ﬁeld of machine learning (Elan-

govan et al., 2021; Kirk et al., 2021; Lake et al., 2017; Linzen, 2020; Marcus, 2018, 1998; Schmidhuber,

1990; Shen et al., 2021; Wong and Wang, 2007; Yogatama et al., 2019, i.a.). For some, generalisation

arXiv:2210.03050v4 [cs.CL] 12 Jan 2024

is crucial to ensure that models behave robustly, reliably, and fairly when making predictions about data

different from the data that they learned from, which is of critical importance when models are employed

in the real world. Others see good generalisation as intrinsically equivalent to good performance and

believe that without it a model is not truly able to conduct the task we intend it to. Yet others strive for

good generalisation because they believe models should behave in a human-like way, and humans are

known to generalise well. While the importance of generalisation is almost undisputed – in the past ﬁve

years, in the ACL Anthology alone over 1200 papers mentioned it in their title or abstract – systematic

generalisation testing is not the status quo in the ﬁeld of NLP.

At the root of this problem lies the fact that there is little understanding and agreement about what

good generalisation looks like, what types of generalisation exist, and which should be prioritised in

varying scenarios. Broadly speaking, generalisation is evaluated by assessing how well a model performs

on a test dataset, given the relationship of this dataset with the data the model was trained on. For

decades, it was common to exert only one simple constraint on this relationship: that the train and test

data are different. Typically, this was achieved by randomly splitting available data into a training and a

test partition. Generalisation was thus evaluated by training and testing models on different but similarly

sampled data, assumed to be independent and identically distributed (i.i.d.). In the past 20 years, we

have seen great strides on such random train–test splits in a range of different applications. Since the

ﬁrst release of the Penn Treebank (Marcus et al., 1993), F1scores for labelled constituency parsing went

from above 80% at the end of the previous century (Collins, 1996; Magerman, 1995) and close to 90% in

the ﬁrst ten years of the current one (e.g. Petrov and Klein, 2007; Sangati and Zuidema, 2011) to scores

up to 96% in recent years (Mrini et al., 2020; Yang and Deng, 2020). On the same corpus, performance

for language modelling went from per-word perplexity scores well above 100 in the mid-90s (Kneser

and Ney, 1995; ROSENFELD, 1996) to a score of 20.5 in 2020 (Brown et al., 2020). In many areas of

NLP, the rate of progress has become even faster in the recent past. Scores for the popular evaluation

suite GLUE went from values between 60 and 70 at its release in 2018 (Wang et al., 2018) to scores

exceeding 90 less than a year after (Devlin et al., 2019), with performances on a wide range of tasks

reaching and surpassing human-level scores by 2019 (e.g. Devlin et al., 2019; Liu et al., 2019b; Wang

et al., 2019, 2018). In 2022, strongly scaled-up models (e.g. Chowdhery et al., 2022) showed astounding

performances on almost all existing i.i.d. natural language understanding benchmarks.

With this progress, however, came the realisation that, for an NLP model, reaching very high or

human-level scores on an i.i.d. test set does not imply that the model robustly generalises to a wide range

of different scenarios in the way humans do. In the recent past, we witnessed a tide of different studies

pointing out generalisation failures in neural models that have state-of-the-art scores on random train–

test splits (Blodgett et al., 2016; Khishigsuren et al., 2022; Kim and Linzen, 2020; Lake and Baroni,

2018; Marcus, 2018; McCoy et al., 2019; Plank, 2016; Razeghi et al., 2022; Sinha et al., 2021, to give

just a few examples). Some show that when models perform well on i.i.d. test splits, they might rely

on simple heuristics that do not robustly generalise in a wide range of non-i.i.d. scenarios (Gardner

et al., 2020; Kaushik et al., 2019; McCoy et al., 2019), over-rely on stereotypes (Parrish et al., 2022;

Srivastava et al., 2022), or bank on memorisation rather than generalisation (Lewis et al., 2021; Razeghi

et al., 2022). Others, instead, display cases in which performances drop when the evaluation data differs

from the training data in terms of genre, domain or topic (e.g. Malinin et al., 2021; Michel and Neubig,

2018; Plank, 2016), or when it represents different subpopulations (e.g. Blodgett et al., 2016; Dixon

et al., 2018). Yet other studies focus on models’ inability to generalise compositionally (Dankers et al.,

2022; Kim and Linzen, 2020; Lake and Baroni, 2018; Li et al., 2021b), structurally (Sinha et al., 2021;

Weber et al., 2021; Wei et al., 2021), to longer sequences (Dubois et al., 2020; Raunak et al., 2019), or

to slightly different task formulations of the same problem (Srivastava et al., 2022).

By showing that good performance on traditional train–test splits does not equal good generalisation,

the examples above bring into question what kind of model capabilities recent breakthroughs actually

reﬂect, and they suggest that research on the evaluation of NLP models is catching up with the fast

(1) motivation

intrinsic

(3) data shift (4) source

(5) locus

practical

cognitive

(2) type

fairness &

inclusivity

compositional

structural

across task cross-lingual

domain

robustness

pre-training

training

test

naturally occurring shifts natural data splits

generated shifts fully generated splits

covariate shift

label shift

full shift

p(y|x)p(x)

between all stages

(multiple loci)

found 'in the wild'

e.g. different domains

curated splits on natural data

e.g. different lengths

generated evaluation data for

natural training data

e.g. HANS (McCoy et al., 2019),

or natural evaluation data for a

generated training set

generated training and

evaluation datae.g. SCAN

(Lake and Baroni, 2018)

from training to test data

from pre-trainingto training data

from pretrainingto test data

Generalisation studies have various

motivations (1)...

They involve data shifts (3), where the data can come from natural or

synthetic sources (4).

These data shifts can occur in different stages of the modelling pipeline (5)....and can be categorised into types (2).

Figure 1: A graphical representation of our proposed taxonomy of generalisation in NLP. The taxonomy

consists of ﬁve different (nominal) axes that describe the high-level motivation of the work (§2.1), the

type of generalisation the test is addressing (§2.2), what kind of data shift occurs between training and

testing (§2.3), and what the source and locus of this shift are (§2.4 and §2.5, respectively).

recent advances in architectures and training regimes. Unfortunately, this body of work also reveals

that there is no real agreement on what kind of generalisation is important for NLP models: different

studies encompass a wide range of generalisation-related research questions, and they use a wide range

of different methodologies and experimental setups. As of yet, it is unclear how the results of different

studies relate to each other: how should generalisation be assessed, if not with i.i.d. splits? How do we

determine what types of generalisation are already well addressed and which are neglected, or which

types of generalisation should be prioritised? Ultimately, on a meta-level, how can we provide answers

to these important questions without a systematic way to discuss generalisation in NLP? These missing

answers are standing in the way of better model evaluation and model development: what we cannot

measure, we cannot improve.

The current article introduces a new framework to systematise and understand generalisation re-

search, and it is an attempt to provide answers to the questions above. We present a generalisation

taxonomy, a meta-analysis of existing research on generalisation in NLP, a set of online tools that can

be used by researchers to explore and better understand generalisation studies through our website, and

we introduce evaluation cards that authors can use to comprehensively summarise the generalisation

experiments conducted in their papers. We believe that state-of-the-art generalisation testing should be

the new status quo in NLP, and with this work, we aim to lay the groundwork for facilitating this change.

In the remainder of this article, we ﬁrst describe the ﬁve axes of our taxonomy (§2.1-2.5); these are the

main axes along which generalisation studies differ. In §3, we present our analysis of the current state

of generalisation research, grounded on a review of 449 papers and a total of 619 generalisation experi-

ments. In §4, we summarise our main ﬁndings and make concrete recommendations for more sound and

exhaustive generalisation tests in NLP research.

Motivation

Practical Cognitive Intrinsic Fairness

□△ ⃝

Generalisation type

Compositional Structural Task Language Domain Robustness

□△ ⃝

Shift type

Covariate Label Full Assumed

⃝△ △ □

Shift source

Naturally occuring Partitioned natural Generated shift Fully generated

□△ ⃝

Shift locus

Train–test Finetune train–test Pretrain–train Pretrain–test

△ ⃝ △ □

Figure 2: Example of evaluation card that can be used to summarise all experiments in a paper. Authors

can mark where on the ﬁve taxonomy axes their experiments belong, as is illustrated with symbols for

three hypothetical experiments in this ﬁgure. In Appendix B, we will further discuss how to use the

evaluation cards, and we provide also a single-column version of it. On our website, we provide a tool

to automatically generate latex code for evaluation cards.

2 The generalisation taxonomy

We now begin a discussion of the ﬁve axes of the proposed generalisation taxonomy, which are also

visualised in Figure 1 and summarised in Appendix E. The proposed taxonomy intends to be beneﬁcial

to understanding generalisation research in NLP in hindsight but is also meant as an active device for

characterising ongoing studies as well as work that is still to come. We facilitate this through evaluation

cards – analogous to the model cards proposed by Mitchell et al. (2019) and the data sheets of Gebru

et al. (2021) – which researchers can ﬁll out for the experiments they conducted in their work and include

in their paper. Doing so aids the cause of making generalisation evaluation the status quo, and enables

effective monitoring of trends in generalisation research. An example of an evaluation card is provided

in Figure 2; Appendix B elaborates on how to use the cards.

2.1 Motivation: what is the high-level motivation for a generalisation test?

The ﬁrst axis we consider is the high-level motivation of a generalisation study. We identiﬁed four

closely intertwined goals of generalisation research in NLP, which we refer to as the practical, the

cognitive, the intrinsic, and the fairness motivation. The motivation of a study determines what type

of generalisation is desirable, it shapes the experimental design, and it affects which conclusions can

be drawn from a model’s display or lack of generalisation. It is therefore crucial for researchers to be

explicitly aware of the motivation underlying their studies to ensure that the experimental setup aligns

with the questions they seek to answer.1

1As we will see in what follows, the same questions can often be asked with different underlying motivations. This makes it

sometimes difﬁcult to identify what exactly the motivation of a generalisation study is. Often, studies may inform conclusions

along all four dimensions. However, given the importance of the motivation for the implications and design of the study, we

nevertheless try to identify the main guiding motive of a study in our review (§3), and we encourage researchers to be explicit

2.1.1 Practical: in what settings can the model be used or improved?

One frequent motivation to study generalisation is of a markedly practical nature. Studies that con-

sider generalisation from a practical perspective seek to assess in what kind of scenarios a model can

be deployed, or which modelling changes can improve performance in various evaluation scenarios. An

example of a research question that is often addressed with a primarily practical motivation is how well

models generalise to different text domains or to data collected in different ways. For instance, Michel

and Neubig (2018) consider how well machine translation models trained on canonical text can gener-

alise to noisy data from an internet platform, and Lazaridou et al. (2021) investigate language model

generalisation to texts written in different time periods. Other questions that are frequently addressed

from a practical perspective concern biases in the training data, and whether models robustly generalise

to datasets that do not share those biases, or whether they learnt spurious correlations due to that bias

(e.g. Behnke et al., 2022; Zhou et al., 2021).

2.1.2 Cognitive: does the model generalise like a human?

A second high-level motivation that drives generalisation research is cognitively oriented and can be sep-

arated into two underlying categories: one focusing on models and one aimed at learning about cognition

and the language faculty in humans through computational models. The ﬁrst category is related to model

behaviour: human generalisation is a useful reference point for the evaluation of models in NLP because

it is considered to be a hallmark of human intelligence (e.g. Lake et al., 2017; Marcus, 2003) and, per-

haps more importantly, because it is precisely the type of generalisation that is required to successfully

model natural language. Humans learn quickly, from fewer data than existing models, and they easily

(compositionally) recombine concepts they already know to understand concepts they have never before

encountered (Fodor and Pylyshyn, 1988; Linzen, 2020; Marcus, 2018). These feats are thus, arguably,

important desiderata for models.2In some cases, it might be difﬁcult to distinguish cognitive from prac-

tical motivations: a model that generalises like a human should score well also on practically motivated

tests, which is why the same experiments can be motivated in multiple ways. In our axes-based taxon-

omy, we rely on the motivations provided by the authors. Compositional generalisation experiments,

for instance, can be cognitively motivated – e.g. when the authors suggest machines ought to generalise

the way humans do – but also practically – e.g. when the authors question which machine learning tech-

niques improve performance on benchmarks that happen to be used to test compositional generalisation.

The second, more deeply cognitively inspired category embraces work that evaluates generalisation

in models to learn more about language and cognition (e.g. Baroni, 2021; Hupkes, 2020; Lakretz et al.,

2021b; Marcus, 1999; McClelland and Plaut, 1999). Studies in this category investigate what underlies

generalisation in computational models, not in order to improve the models’ generalisation capabilities

but to derive new hypotheses about the workings of human generalisation.

2.1.3 Intrinsic: does the model solve the task correctly?

A third motivation to evaluate generalisation in NLP models, which cuts through the two previous moti-

vations, appertains to the question of whether models learned the task we intended them to learn, in the

way we intended the task to be learned. The shared presupposition underpinning this type of research is

that if a model has truly learned the task it is trained to do, it should be able to execute this task also in set-

tings that differ from the exact training scenarios. What changes, across studies, is the set of conditions

about the motivation of their future studies.

2We do not always expect from a model the same type or level of generalisation a human exhibits. There are cases in which

it is desirable for models to generalise better than humans, for example across languages – something humans typically do not

excel at. In other cases, such as language identiﬁcation, models already generalise better than humans and would hardly be

useful if they did not.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

State-of-the-artgeneralisationresearchinNLP:AtaxonomyandreviewDieuwkeHupkes1,MarioGiulianelli2,3,VernaDankers1,4,MikelArtetxe5YanaiElazar6,7,TiagoPimentel8,ChristosChristodoulopoulos9,KarimLasri10,11NaomiSaphra12,ArabellaSinclair13,DennisUlmer14,FlorianSchottmann3,16KhuyagbaatarBatsuren17,KaiserSun1...

展开>> 收起<<

State-of-the-art generalisation research in NLP A taxonomy and review Dieuwke Hupkes1 Mario Giulianelli23 Verna Dankers14 Mikel Artetxe5.pdf

共86页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

State-of-the-art generalisation research in NLP A taxonomy and review Dieuwke Hupkes1 Mario Giulianelli23 Verna Dankers14 Mikel Artetxe5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: