State-of-the-art generalisation research in NLP A taxonomy and review Dieuwke Hupkes1 Mario Giulianelli23 Verna Dankers14 Mikel Artetxe5

2025-05-03 0 0 2.47MB 86 页 10玖币
侵权投诉
State-of-the-art generalisation research in NLP:
A taxonomy and review
Dieuwke Hupkes1, Mario Giulianelli2,3, Verna Dankers1,4, Mikel Artetxe5
Yanai Elazar6,7, Tiago Pimentel8, Christos Christodoulopoulos9, Karim Lasri10,11
Naomi Saphra12, Arabella Sinclair13, Dennis Ulmer14, Florian Schottmann3,16
Khuyagbaatar Batsuren17, Kaiser Sun1, Koustuv Sinha1, Leila Khalatbari18
Maria Ryskina19, Rita Frieske18, Ryan Cotterell3, Zhijing Jin3,20
dieuwkehupkes@meta.com mgiulianelli@inf.ethz.ch
vernadankers@gmail.com
1FAIR 2University of Amsterdam 3ETH Zürich 4University of Edinburgh
5Reka AI 6Allen Institute for AI 7University of Washington 8University of Cambridge
9Amazon Alexa AI 10École Normale Supérieure-PSL 11The World Bank 12Harvard University
13University of Aberdeen 14IT University of Copenhagen 15Pioneer Centre for Artificial Intelligence
16Textshuttle 17National University of Mongolia 18Hong Kong University of Science and Technology
19MIT 20Max Planck Institute for Intelligent Systems
Abstract
The ability to generalise well is one of the primary desiderata of natural language processing
(NLP). Yet, what ‘good generalisation’ entails and how it should be evaluated is not well understood,
nor are there any evaluation standards for generalisation. In this paper, we lay the groundwork to
address both of these issues. We present a taxonomy for characterising and understanding generali-
sation research in NLP. Our taxonomy is based on an extensive literature review of generalisation re-
search, and contains five axes along which studies can differ: their main motivation, the type of gen-
eralisation they investigate, the type of data shift they consider, the source of this data shift, and the
locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 papers that
test generalisation, for a total of more than 600 individual experiments. Considering the results of
this review, we present an in-depth analysis that maps out the current state of generalisation research
in NLP, and we make recommendations for which areas might deserve attention in the future. Along
with this paper, we release a webpage where the results of our review can be dynamically explored,
and which we intend to update as new NLP generalisation studies are published. With this work, we
aim to take steps towards making state-of-the-art generalisation testing the new status quo in NLP.
This preprint was published as an Analysis article in Nature Machine Intelligence.
Please refer to the published version when citing this work.
1 Introduction
Good generalisation, roughly defined as the ability to successfully transfer representations, knowledge,
and strategies from past experience to new experiences, is one of the primary desiderata for models of
natural language processing (NLP), as well as for models in the wider field of machine learning (Elan-
govan et al., 2021; Kirk et al., 2021; Lake et al., 2017; Linzen, 2020; Marcus, 2018, 1998; Schmidhuber,
1990; Shen et al., 2021; Wong and Wang, 2007; Yogatama et al., 2019, i.a.). For some, generalisation
1
arXiv:2210.03050v4 [cs.CL] 12 Jan 2024
is crucial to ensure that models behave robustly, reliably, and fairly when making predictions about data
different from the data that they learned from, which is of critical importance when models are employed
in the real world. Others see good generalisation as intrinsically equivalent to good performance and
believe that without it a model is not truly able to conduct the task we intend it to. Yet others strive for
good generalisation because they believe models should behave in a human-like way, and humans are
known to generalise well. While the importance of generalisation is almost undisputed – in the past five
years, in the ACL Anthology alone over 1200 papers mentioned it in their title or abstract – systematic
generalisation testing is not the status quo in the field of NLP.
At the root of this problem lies the fact that there is little understanding and agreement about what
good generalisation looks like, what types of generalisation exist, and which should be prioritised in
varying scenarios. Broadly speaking, generalisation is evaluated by assessing how well a model performs
on a test dataset, given the relationship of this dataset with the data the model was trained on. For
decades, it was common to exert only one simple constraint on this relationship: that the train and test
data are different. Typically, this was achieved by randomly splitting available data into a training and a
test partition. Generalisation was thus evaluated by training and testing models on different but similarly
sampled data, assumed to be independent and identically distributed (i.i.d.). In the past 20 years, we
have seen great strides on such random train–test splits in a range of different applications. Since the
first release of the Penn Treebank (Marcus et al., 1993), F1scores for labelled constituency parsing went
from above 80% at the end of the previous century (Collins, 1996; Magerman, 1995) and close to 90% in
the first ten years of the current one (e.g. Petrov and Klein, 2007; Sangati and Zuidema, 2011) to scores
up to 96% in recent years (Mrini et al., 2020; Yang and Deng, 2020). On the same corpus, performance
for language modelling went from per-word perplexity scores well above 100 in the mid-90s (Kneser
and Ney, 1995; ROSENFELD, 1996) to a score of 20.5 in 2020 (Brown et al., 2020). In many areas of
NLP, the rate of progress has become even faster in the recent past. Scores for the popular evaluation
suite GLUE went from values between 60 and 70 at its release in 2018 (Wang et al., 2018) to scores
exceeding 90 less than a year after (Devlin et al., 2019), with performances on a wide range of tasks
reaching and surpassing human-level scores by 2019 (e.g. Devlin et al., 2019; Liu et al., 2019b; Wang
et al., 2019, 2018). In 2022, strongly scaled-up models (e.g. Chowdhery et al., 2022) showed astounding
performances on almost all existing i.i.d. natural language understanding benchmarks.
With this progress, however, came the realisation that, for an NLP model, reaching very high or
human-level scores on an i.i.d. test set does not imply that the model robustly generalises to a wide range
of different scenarios in the way humans do. In the recent past, we witnessed a tide of different studies
pointing out generalisation failures in neural models that have state-of-the-art scores on random train–
test splits (Blodgett et al., 2016; Khishigsuren et al., 2022; Kim and Linzen, 2020; Lake and Baroni,
2018; Marcus, 2018; McCoy et al., 2019; Plank, 2016; Razeghi et al., 2022; Sinha et al., 2021, to give
just a few examples). Some show that when models perform well on i.i.d. test splits, they might rely
on simple heuristics that do not robustly generalise in a wide range of non-i.i.d. scenarios (Gardner
et al., 2020; Kaushik et al., 2019; McCoy et al., 2019), over-rely on stereotypes (Parrish et al., 2022;
Srivastava et al., 2022), or bank on memorisation rather than generalisation (Lewis et al., 2021; Razeghi
et al., 2022). Others, instead, display cases in which performances drop when the evaluation data differs
from the training data in terms of genre, domain or topic (e.g. Malinin et al., 2021; Michel and Neubig,
2018; Plank, 2016), or when it represents different subpopulations (e.g. Blodgett et al., 2016; Dixon
et al., 2018). Yet other studies focus on models’ inability to generalise compositionally (Dankers et al.,
2022; Kim and Linzen, 2020; Lake and Baroni, 2018; Li et al., 2021b), structurally (Sinha et al., 2021;
Weber et al., 2021; Wei et al., 2021), to longer sequences (Dubois et al., 2020; Raunak et al., 2019), or
to slightly different task formulations of the same problem (Srivastava et al., 2022).
By showing that good performance on traditional train–test splits does not equal good generalisation,
the examples above bring into question what kind of model capabilities recent breakthroughs actually
reflect, and they suggest that research on the evaluation of NLP models is catching up with the fast
2
(1) motivation
intrinsic
(3) data shift (4) source
(5) locus
practical
cognitive
(2) type
fairness &
inclusivity
compositional
structural
across task cross-lingual
domain
robustness
pre-training
training
test
naturally occurring shifts natural data splits
generated shifts fully generated splits
covariate shift
label shift
full shift
p(y|x)p(x)
between all stages
(multiple loci)
found 'in the wild'
e.g. different domains
curated splits on natural data
e.g. different lengths
generated evaluation data for
natural training data
e.g. HANS (McCoy et al., 2019),
or natural evaluation data for a
generated training set
generated training and
evaluation datae.g. SCAN
(Lake and Baroni, 2018)
from training to test data
from pre-trainingto training data
from pretrainingto test data
Generalisation studies have various
motivations (1)...
They involve data shifts (3), where the data can come from natural or
synthetic sources (4).
These data shifts can occur in different stages of the modelling pipeline (5)....and can be categorised into types (2).
Figure 1: A graphical representation of our proposed taxonomy of generalisation in NLP. The taxonomy
consists of five different (nominal) axes that describe the high-level motivation of the work (§2.1), the
type of generalisation the test is addressing (§2.2), what kind of data shift occurs between training and
testing (§2.3), and what the source and locus of this shift are (§2.4 and §2.5, respectively).
recent advances in architectures and training regimes. Unfortunately, this body of work also reveals
that there is no real agreement on what kind of generalisation is important for NLP models: different
studies encompass a wide range of generalisation-related research questions, and they use a wide range
of different methodologies and experimental setups. As of yet, it is unclear how the results of different
studies relate to each other: how should generalisation be assessed, if not with i.i.d. splits? How do we
determine what types of generalisation are already well addressed and which are neglected, or which
types of generalisation should be prioritised? Ultimately, on a meta-level, how can we provide answers
to these important questions without a systematic way to discuss generalisation in NLP? These missing
answers are standing in the way of better model evaluation and model development: what we cannot
measure, we cannot improve.
The current article introduces a new framework to systematise and understand generalisation re-
search, and it is an attempt to provide answers to the questions above. We present a generalisation
taxonomy, a meta-analysis of existing research on generalisation in NLP, a set of online tools that can
be used by researchers to explore and better understand generalisation studies through our website, and
we introduce evaluation cards that authors can use to comprehensively summarise the generalisation
experiments conducted in their papers. We believe that state-of-the-art generalisation testing should be
the new status quo in NLP, and with this work, we aim to lay the groundwork for facilitating this change.
In the remainder of this article, we first describe the five axes of our taxonomy (§2.1-2.5); these are the
main axes along which generalisation studies differ. In §3, we present our analysis of the current state
of generalisation research, grounded on a review of 449 papers and a total of 619 generalisation experi-
ments. In §4, we summarise our main findings and make concrete recommendations for more sound and
exhaustive generalisation tests in NLP research.
3
Motivation
Practical Cognitive Intrinsic Fairness
△ ⃝
Generalisation type
Compositional Structural Task Language Domain Robustness
△ ⃝
Shift type
Covariate Label Full Assumed
⃝△
Shift source
Naturally occuring Partitioned natural Generated shift Fully generated
△ ⃝
Shift locus
Train–test Finetune train–test Pretrain–train Pretrain–test
△ ⃝
Figure 2: Example of evaluation card that can be used to summarise all experiments in a paper. Authors
can mark where on the five taxonomy axes their experiments belong, as is illustrated with symbols for
three hypothetical experiments in this figure. In Appendix B, we will further discuss how to use the
evaluation cards, and we provide also a single-column version of it. On our website, we provide a tool
to automatically generate latex code for evaluation cards.
2 The generalisation taxonomy
We now begin a discussion of the five axes of the proposed generalisation taxonomy, which are also
visualised in Figure 1 and summarised in Appendix E. The proposed taxonomy intends to be beneficial
to understanding generalisation research in NLP in hindsight but is also meant as an active device for
characterising ongoing studies as well as work that is still to come. We facilitate this through evaluation
cards – analogous to the model cards proposed by Mitchell et al. (2019) and the data sheets of Gebru
et al. (2021) – which researchers can fill out for the experiments they conducted in their work and include
in their paper. Doing so aids the cause of making generalisation evaluation the status quo, and enables
effective monitoring of trends in generalisation research. An example of an evaluation card is provided
in Figure 2; Appendix B elaborates on how to use the cards.
2.1 Motivation: what is the high-level motivation for a generalisation test?
The first axis we consider is the high-level motivation of a generalisation study. We identified four
closely intertwined goals of generalisation research in NLP, which we refer to as the practical, the
cognitive, the intrinsic, and the fairness motivation. The motivation of a study determines what type
of generalisation is desirable, it shapes the experimental design, and it affects which conclusions can
be drawn from a model’s display or lack of generalisation. It is therefore crucial for researchers to be
explicitly aware of the motivation underlying their studies to ensure that the experimental setup aligns
with the questions they seek to answer.1
1As we will see in what follows, the same questions can often be asked with different underlying motivations. This makes it
sometimes difficult to identify what exactly the motivation of a generalisation study is. Often, studies may inform conclusions
along all four dimensions. However, given the importance of the motivation for the implications and design of the study, we
nevertheless try to identify the main guiding motive of a study in our review (§3), and we encourage researchers to be explicit
4
2.1.1 Practical: in what settings can the model be used or improved?
One frequent motivation to study generalisation is of a markedly practical nature. Studies that con-
sider generalisation from a practical perspective seek to assess in what kind of scenarios a model can
be deployed, or which modelling changes can improve performance in various evaluation scenarios. An
example of a research question that is often addressed with a primarily practical motivation is how well
models generalise to different text domains or to data collected in different ways. For instance, Michel
and Neubig (2018) consider how well machine translation models trained on canonical text can gener-
alise to noisy data from an internet platform, and Lazaridou et al. (2021) investigate language model
generalisation to texts written in different time periods. Other questions that are frequently addressed
from a practical perspective concern biases in the training data, and whether models robustly generalise
to datasets that do not share those biases, or whether they learnt spurious correlations due to that bias
(e.g. Behnke et al., 2022; Zhou et al., 2021).
2.1.2 Cognitive: does the model generalise like a human?
A second high-level motivation that drives generalisation research is cognitively oriented and can be sep-
arated into two underlying categories: one focusing on models and one aimed at learning about cognition
and the language faculty in humans through computational models. The first category is related to model
behaviour: human generalisation is a useful reference point for the evaluation of models in NLP because
it is considered to be a hallmark of human intelligence (e.g. Lake et al., 2017; Marcus, 2003) and, per-
haps more importantly, because it is precisely the type of generalisation that is required to successfully
model natural language. Humans learn quickly, from fewer data than existing models, and they easily
(compositionally) recombine concepts they already know to understand concepts they have never before
encountered (Fodor and Pylyshyn, 1988; Linzen, 2020; Marcus, 2018). These feats are thus, arguably,
important desiderata for models.2In some cases, it might be difficult to distinguish cognitive from prac-
tical motivations: a model that generalises like a human should score well also on practically motivated
tests, which is why the same experiments can be motivated in multiple ways. In our axes-based taxon-
omy, we rely on the motivations provided by the authors. Compositional generalisation experiments,
for instance, can be cognitively motivated – e.g. when the authors suggest machines ought to generalise
the way humans do – but also practically – e.g. when the authors question which machine learning tech-
niques improve performance on benchmarks that happen to be used to test compositional generalisation.
The second, more deeply cognitively inspired category embraces work that evaluates generalisation
in models to learn more about language and cognition (e.g. Baroni, 2021; Hupkes, 2020; Lakretz et al.,
2021b; Marcus, 1999; McClelland and Plaut, 1999). Studies in this category investigate what underlies
generalisation in computational models, not in order to improve the models’ generalisation capabilities
but to derive new hypotheses about the workings of human generalisation.
2.1.3 Intrinsic: does the model solve the task correctly?
A third motivation to evaluate generalisation in NLP models, which cuts through the two previous moti-
vations, appertains to the question of whether models learned the task we intended them to learn, in the
way we intended the task to be learned. The shared presupposition underpinning this type of research is
that if a model has truly learned the task it is trained to do, it should be able to execute this task also in set-
tings that differ from the exact training scenarios. What changes, across studies, is the set of conditions
about the motivation of their future studies.
2We do not always expect from a model the same type or level of generalisation a human exhibits. There are cases in which
it is desirable for models to generalise better than humans, for example across languages – something humans typically do not
excel at. In other cases, such as language identification, models already generalise better than humans and would hardly be
useful if they did not.
5
摘要:

State-of-the-artgeneralisationresearchinNLP:AtaxonomyandreviewDieuwkeHupkes1,MarioGiulianelli2,3,VernaDankers1,4,MikelArtetxe5YanaiElazar6,7,TiagoPimentel8,ChristosChristodoulopoulos9,KarimLasri10,11NaomiSaphra12,ArabellaSinclair13,DennisUlmer14,FlorianSchottmann3,16KhuyagbaatarBatsuren17,KaiserSun1...

展开>> 收起<<
State-of-the-art generalisation research in NLP A taxonomy and review Dieuwke Hupkes1 Mario Giulianelli23 Verna Dankers14 Mikel Artetxe5.pdf

共86页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:86 页 大小:2.47MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 86
客服
关注