Benchmarking Long-tail Generalization with Likelihood Splits Ameya Godbole and Robin Jia University of Southern California

2025-04-24 0 0 1.1MB 23 页 10玖币
侵权投诉
Benchmarking Long-tail Generalization with Likelihood Splits
Ameya Godbole and Robin Jia
University of Southern California
{ameyagod,robinjia}@usc.edu
Abstract
In order to reliably process natural language,
NLP systems must generalize to the long tail
of rare utterances. We propose a method to cre-
ate challenging benchmarks that require gen-
eralizing to the tail of the distribution by re-
splitting existing datasets. We create ‘Like-
lihood Splits’ where examples that are as-
signed lower likelihood by a pre-trained lan-
guage model (LM) are placed in the test set,
and more likely examples are in the training
set. This simple approach can be customized
to construct meaningful train-test splits for a
wide range of tasks. Likelihood Splits sur-
face more challenges than random splits: rel-
ative error rates of state-of-the-art models in-
crease by 59% for semantic parsing on SPI-
DER, 93% for natural language inference on
SNLI, and 33% for yes/no question answering
on BOOLQ, on our splits compared with the
corresponding random splits. Moreover, Like-
lihood Splits create fairer benchmarks than ad-
versarial filtering; when the LM used to create
the splits is also employed as the task model,
our splits do not unfairly penalize the LM.
1 Introduction
Success on in-distribution test data does not neces-
sarily show that a system has solved the underlying
task at hand. Systems can achieve artificially high
accuracy by exploiting dataset-specific shortcuts,
such as spurious feature-label correlations that hold
in the data but not in general (Gardner et al.,2021).
In many datasets, a large proportion of test exam-
ples are similar to training examples, further in-
flating in-distribution accuracy (Lewis et al.,2021;
Czarnowska et al.,2019;Orr et al.,2021). Out-
of-distribution (OOD) evaluation paints a clearer
picture of a system’s ability to perform the task.
Prior work has proposed a variety of methods
to test OOD generalization, each with their own
strengths and weaknesses. Task-specific behav-
ior tests (Ribeiro et al.,2020;Naik et al.,2018;
Figure 1: Likelihood Splits: We propose to partition
the dataset based on likelihood under a language model.
The high-likelihood “head” of the distribution becomes
the training set while we evaluate generalization to
the low-likelihood “tail” of the data. Shown here are
queries from the SPIDER dataset in different likelihood
buckets: one possible tail generalization could be the
handling uncommon entities with known query types.
Gardner et al.,2020) give insights into model be-
havior but require significant manual (often expert)
effort to create. Adversarial data collection, in
which annotators try to fool high-performing mod-
els (Nie et al.,2020;Potts et al.,2021), also col-
lects challenging examples, but runs the risk of
focusing only on a narrow subset of model weak-
nesses (Bowman and Dahl,2021;Kaushik et al.,
2021). Adversarial filtering removes easy exam-
ples from existing datasets (Sakaguchi et al.,2021),
but can disproportionately penalize the model used
during filtering (Phang et al.,2021). Domain gener-
alization tests transferability to new data domains
(Fisch et al.,2019;Miller et al.,2020), but there is
no guarantee that generalizing to a given new do-
main is possible—out-of-domain examples may re-
quire skills that are not learnable from the training
data (Geiger et al.,2019). Other approaches create
dataset splits that test for specific skills, such as
length generalization (Lake and Baroni,2018) and
compositional generalization (Shaw et al.,2021),
arXiv:2210.06799v2 [cs.CL] 2 May 2023
but they only apply to a narrow subset of tasks.
In this work, we propose
Likelihood Splits
, a
general-purpose method to create challenging OOD
splits for existing datasets. The principle behind
Likelihood Splits is that any system that claims
to reliably process natural language must be able
to generalize from more common utterances seen
during training to the long tail of rare utterances
at test time. Generalization, not merely memoriza-
tion, is necessary because even a very large training
dataset cannot exhaustively cover all possible long-
tail examples that may be encountered in the real
world. Moreover, standard annotation procedures
tend to over-sample examples from the head of the
distribution, further ignoring the challenge posed
by infrequent examples. We identify tail exam-
ples using the likelihood under the GPT-2 language
model (Radford et al.,2019). Examples with low
likelihood under GPT-2 are placed in the held-out
evaluation sets and the high likelihood examples
are used as the training set (see Figure 1).
Likelihood Splits are a novel, widely applicable
strategy that can create interesting generalization
benchmarks at no additional annotation cost. They
are more challenging than a random split across
a wide range of tasks: error rates relative to ran-
dom splits increase by 59% for T5 (Raffel et al.,
2020) on SPIDER (Yu et al.,2018), 93% for ELEC-
TRA (Clark et al.,2020) on SNLI (Bowman et al.,
2015), and 33% for ROBERTA(Liu et al.,2019)
on BOOLQ (Clark et al.,2019). Moreover, the
proposed splits do not unfairly penalize the GPT-2
model used to create the splits when it is used as a
task model, thus avoiding one of the downsides of
adversarial filtering. We identify many independent
challenges required by Likelihood Splits, including
generalizing to rare words, complex programs, and
syntactically complex sentences. We encourage
future benchmark creators to release Likelihood
Splits as a complementary evaluation to the stan-
dard IID evaluation to better test out-of-distribution
generalization performance. We will release the
splits discussed in this work along with the code to
easily create Likelihood Splits of other datasets.1
2 Related Work
Generalizing to the long-tail.
Evaluating sys-
tems on long-tail phenomena is important, espe-
cially because many datasets over-sample the head
of the distribution. For example, some question-
1github.com/ameyagodbole/long-tail-likelihood-splits
answering (QA) datasets limit their purview to pop-
ular web-pages (Yang et al.,2018) or frequent user
queries (Kwiatkowski et al.,2019). Lewis et al.
(2021); Liu et al. (2021) demonstrate that mod-
els trained on these datasets often fail on exam-
ples that do not match the most frequent training
cases. Similar observations have been made in en-
tity linking to rare entities, (Orr et al.,2021;Chen
et al.,2021), information retrieval for open-domain
QA (Sciavolino et al.,2021), relation extraction
for rare relations (Sabo et al.,2021) and lexicon
induction for rare senses in machine translation
(Czarnowska et al.,2019). Zero-shot performance
of large LMs on numerical reasoning and factoid
questions is also correlated with the frequency of
occurence of the facts in the pre-training corpus
(Razeghi et al.,2022;Kandpal et al.,2022;Elazar
et al.,2022). While we do not test whether models
can memorize long-tail knowledge, we instead test
whether models can process long-tail sentences.
Naik et al. (2022) note that it is challenging to cata-
logue and evaluate generalization along micro-level
dimensions and instead propose benchmarks that
vary along macro-level dimensions (such as the lan-
guage and domain) as a proxy. We hypothesize that
LMs learn which micro-level phenomena are rare,
as this would improve their overall language mod-
eling objective. In this work, we present a recipe
that leverages LMs to evaluate tail generalization
for any language task.
Task-specific test sets.
Ribeiro et al. (2020) use
templated queries to evaluate model performance
under various linguistic perturbations. This method
requires dataset designers to define phenomena of
interest and axes of perturbation along which labels
may be preserved or changed. Naik et al. (2018)
analyze model errors and instantiate tests that ex-
plicitly evaluate models on more examples from
each error class. Gardner et al. (2020) check for
model consistency under local perturbations of test
set examples. All of these approaches require anno-
tators to create new examples, whereas we propose
a method to resplit existing datasets.
Adversarial approaches.
Søgaard et al. (2021)
argue that random splits over-estimate model per-
formance on new in-domain data and recommend
the use of adversarial and heuristically challenging
splits to estimate generalizability. Adversarial data
collection promotes the creation of difficult exam-
ples by encouraging annotators to fool a model-
in-the-loop (Nie et al.,2020;Potts et al.,2021;
Kiela et al.,2021). Similarly, Adversarial Filtering
removes examples that are easy for a given task
model in order to create more challenging bench-
marks (Sakaguchi et al.,2021;Yang et al.,2018).
However, Kaushik et al. (2021) and Bowman and
Dahl (2021) point out that adversarially collected
or filtered examples may focus on a narrow set of
skills that the “in-the-loop” model lacks, instead of
covering all the abilities required for the underlying
task. Additionally, the “in-the-loop” task model is
disproportionately penalized by the adversarial test
sets (Phang et al.,2021). We show in §4.3 that
Likelihood Splits do not suffer from this issue.
Domain shift.
In NLP, domains can be charac-
terized by the changes in vocabulary and distribu-
tion of word use, styles used by authors, and the
intended audience. Fisch et al. (2019) pose the
challenge of developing QA systems that need to
generalize to unseen domains. Miller et al. (2020)
show that QA models trained on SQUAD show a
performance drop on new domains (while human
baseline performance remains unchanged); Miller
et al. (2021); Hendrycks et al. (2020) inter alia per-
form similar analyses of domain shift. SPIDER (Yu
et al.,2018) and GRAILQA (Gu et al.,2021) eval-
uate semantic parsing on unseen table and knowl-
edge base domains respectively. Domain shift is
an orthogonal axis of generalization; we focus on
generalizing to rare utterances in the same domain.
Out-of-distribution detection.
Previous work
in OOD detection has used high generative model
perplexity as a sign of outliers (Arora et al.,2021;
Ren et al.,2019;Lee et al.,2018). Our intuition
is similar: low likelihood (high perplexity) is an
indicator of rare examples. However, only our
work uses likelihood scores for benchmark creation.
Moreover, in our setting all examples have been
collected under the same data collection protocol,
so none of the examples are truly OOD.
Compositional generalization.
The ability to
“compose” the meaning of a new utterance from the
known meaning of its parts (Fodor and Pylyshyn,
1988) is an important aspect of language under-
standing. The deterministic grammar of program-
ming languages makes semantic parsing, the task
of translating a natural language utterance into a
logical program, a good testbed for evaluating com-
positional generalization (Lake and Baroni,2018;
Kim and Linzen,2020;Hupkes et al.,2020;Key-
sers et al.,2020;Shaw et al.,2021). However, for
tasks where the constituent blocks are not clearly
defined, it is unclear how to create such evaluation
splits of the data. We compare against composi-
tional generalization splits of the semantic parsing
dataset SPIDER (Yu et al.,2018) in §4.
3 Capturing the Tail of the Distribution
In order to find the tail within a dataset, we ap-
proximate likelihood of an utterance in the real
distribution with its likelihood under a language
model (LM). Our method can be easily modified to
create meaningful splits for any language task. We
demonstrate this by creating Likelihood Splits for:
SPIDER, a semantic parsing dataset (Yu et al.,
2018) consisting of natural language questions
and corresponding SQL programs;
SNLI, a natural language inference dataset (Bow-
man et al.,2015) consisting of premise and
hypothesis sentences paired with labels denot-
ing that the hypothesis is entailed by/neutral
to/contradictory to the premise;
BOOLQ, a question-answering dataset (Clark
et al.,2019) consisting of a passages, associated
questions, and binary yes/no labels.
3.1 General Approach
We consider language tasks where models must
map an input
x
to an output
y
(e.g., a SQL query
or a label). The input
x
may be either a single
sentence (e.g., semantic parsing) or a pair of sen-
tences (e.g., natural language inference), in which
case we write
x= (x1, x2)
. Given a dataset
D
of
(x, y)
pairs and desired proportion
p
of evaluation
examples, our method partitions
D
into subsets
Dtrain
and
Deval
where
|Deval| ≈ p· |D|
. More
specifically, we will first assign a likelihood score
s(x)
to each
xD
, then choose
Deval
to be the
bp· |D|c
examples in
D
with lowest value of
s(x)
,
and choose
Dtrain =D\Deval
. In §3.2, we describe
a few different ways to define
s
. In §3.3, we de-
scribe a modification to this procedure that controls
for varying length between examples. Finally, we
describe task-specific adjustments in §3.4.
3.2 Assigning Likelihood Scores s(x)
We use the total log-likelihood over the query to-
kens assigned by the GPT-2 language model as the
score
s(x)
for every example. There are two ways
to use the LM: (1) prompting a frozen LM or (2)
fine-tuning the LM on the dataset.
Task Prompting Fine-Tuning
SPIDER write a database question: {query} <|endoftext|> {query}
BOOLQPassage: {passage} Ask a question about the passage: {question}
SNLI Premise: {premise} This hypothesis is {entailed/neutral/a contradiction}: {hypothesis}
Table 1: Input formats for single-sentence and sentence-pair tasks in the prompting and fine-tuning settings. Values
in curly braces are plugged in from the example. For SNLI, we provide the label in the prompt to prime the LM to
the class of hypothesis. The LM is trained (when fine-tuning) and evaluated on generating the query in blue.
Past work has shown that prompting i.e. pre-
pending a task-specific string to the query, helps
GPT-2 generalize zero-shot to new tasks (Radford
et al.,2019). We use simple prompts that describe
the task and prime the LM to the text we expect it
to generate (see Table 1). For sentence pair tasks
(such as SNLI and BOOLQ), it is necessary to com-
pare the relation between two pieces of text and
not just each piece in isolation. Thus, it is intuitive
to describe unlikely examples by the conditional
likelihood of
x2
given
x1
. We demonstrate the flex-
ibility of our approach by providing the label in the
prompt if it adds additional information about the
text to be generated (e.g. in SNLI).
2
We will refer
to this setting which uses the prompted LM with
the tag ll_split pt in the rest of the work.
The dataset curator may also choose to fine-tune
the LM to better capture the task distribution. We
fine-tune the GPT-2 LM to maximize either the
probability of
x
for single sentence tasks or the
conditional probability of
x2
given the prompt for
sentence-pair tasks. When fine-tuning the LM on
the dataset, we need to ensure that it is not used to
assign scores to the examples it is trained on. Given
the dataset
D
, we first randomly partition
D
into
k
folds. For each fold, we fine-tune an LM on the
remaining folds and use it to assign log-likelihood
scores to examples in the held-out fold. We refer
the reader to Appendix A.2 for fine-tuning details.
We will refer to this setting as ll_split henceforth.
3.3 Controlling for Length
Since the likelihood of an utterance is negatively
correlated with its length, we create a split that
explicitly controls for the effect of length. After
assigning a likelihood score to every utterance, the
examples are bucketed based on length (defined
by tokenizing the utterance with NLTK (Loper and
2
We include the label in the prompt for SNLI but not
BOOLQ because the resulting prompts seemed most natu-
ral for each dataset. This choice was made before assessing
downstream behavior.
Bird,2002)). For single-sentence and sentence-
pair tasks, we use the length of the query (
x
and
x2
respectively) over which log-likelihood was com-
puted. Within each bucket, a fraction
p
of the ex-
amples with the lowest
s(x)
are put in the evalu-
ation set; aggregating examples from all buckets,
|Deval| ≈ p· |D|
. We will refer to this control
setting with the modifier (-len) henceforth.3
3.4 Dataset-specific Choices and Details
SPIDER.
We follow Shaw et al. (2021) and swap
examples between the train and evaluation sets such
that every logical program atom in the evaluation
set appears at least once in the train set. This en-
sures that the model is not required to generalize to
unseen function names and declarations.
SNLI and BOOLQ.
We ensure label balance in
our splits (as in the original data) by splitting the
examples for each label separately, then combining
the resulting train and evaluation sets.
Development sets.
Csordás et al. (2021) show
that without development sets that are in-
distribution to challenging test sets, models are
prone to over-fitting, which under-estimates their
ability to generalize. Thus, after dividing the data
into train and evaluation sets, we randomly divide
the evaluation set into a development set and test
set. Other details are reported in Appendix A.1.
4 Experiments
Next, we benchmark task models on our Likelihood
Splits. Splits created using GPT2-medium will be
the focus of our analysis. We will briefly study the
effect of switching the LM to GPT2-large in §4.4.
When creating Likelihood Splits, the number of
folds
k
for fine-tuning the LM (§3.2) can be chosen
by the dataset curator. For results in §4and §5,
we set
k= 3
arbitrarily. We analyse the effect of
3
We also considered using perplexity, which normalizes for
length, but it led to an over-correction where short examples
were filtered into the evaluation set.
choosing a different value of
k
in Appendix A.3.
Our results show that the trends and observations
discussed here hold true for other values of k.
4.1 Benchmarked Models
One of the goals of this work is to expose long-
tail generalization as a challenge to state-of-the-art
models; SotA models on the considered bench-
marks are all pre-trained models. We make efforts
to show that models with different pre-training data
and objectives are similarly affected by our pro-
posed splits. Hyperparameters and training details
for the reported models are in Appendix A.2.
Semantic parsing.
Following Shaw et al. (2021),
we benchmark the competitive T5-base model (Raf-
fel et al.,2020) on all splits of the SPIDER dataset.
In order to test whether these splits are adversarial
to the data splitting language model, we addition-
ally fine-tune GPT2-medium models for the seman-
tic parsing task. To study the effect of model size,
we fine-tune T5-small and GPT2-small variants.
SNLI and BOOLQ.
We fine-tune two compet-
itive models (ROBERTA(Liu et al.,2019) and
ELECTRA (Clark et al.,2020)) at two model sizes
(base and large). Additionally, following Poliak
et al. (2018), we train a ROBERTA-large model
to perform the task given just the hypothesis. The
performance of a hypothesis-only model estimates
the degree of spurious correlations that exist in the
dataset which give away the label.
4.2 Alternative Splits for Semantic Parsing
We compare the difficulty of the Likelihood Splits
with past work on heuristic challenges splits.
Length.
Past work has established that text gen-
eration models trained on short inputs struggle to
generalize to longer inputs at test time (Lake and
Baroni,2018;Hupkes et al.,2020;Newman et al.,
2020). We create Length splits by placing exam-
ples with the longest input queries in the evaluation
set and the remaining examples in the training set.
TMCD.
Systematicity is the ability to composi-
tionally derive the meaning of an utterance from the
known meaning of its parts. Past work studying sys-
tematicity in semantic parsing has defined “atoms"
as the smallest constituents of the grammar (e.g.
variables and function names) and “compounds"
as complex structures formed by composing atoms
(e.g. multi-argument functions and nested func-
tion calls) (Keysers et al.,2020). Following Shaw
Split T5 T5 GPT-2 GPT-2
base small medium(∆) small
Random 78.6 75.2 69.3 (9.3) 64.7
Length 50.0 44.5 39.9 (10.1) 34.0
Template 60.1 60.0 51.4 (8.7) 45.1
TMCD 66.2 64.1 56.2 (10) 51.4
Split LM: GPT2-medium
ll_split 66.0 64.2 57.2 (8.8) 51.8
ll_split (-len) 71.3 67.3 59.9 (11.4) 57.3
ll_split pt 60.6 59.7 50.9 (9.7) 45.9
ll_split pt (-len) 73.5 68.4 64.5 (9) 58.3
Split LM: GPT2-large
ll_split 61.8 61.8 53.7 (8.1) 48.3
ll_split (-len) 69.7 66.2 59.1 (10.6) 54.8
ll_split pt 63.0 58.3 51.4 (11.6) 45.7
ll_split pt (-len) 72.0 70.1 63.4 (8.6) 57.5
Table 2: SPIDER: Exact sequence prediction accu-
racy for Likelihood Splits created by GPT2-medium
and GPT2-large, and other challenge splits. Likelihood
Splits are more challenging than random splits while
not being adversarial to GPT2-medium. marks the
performance drop from T5-base to GPT2-medium.
et al. (2021), we create TMCD (Target Maximum
Compound Divergence) splits of SPIDER by maxi-
mizing the divergence between the distributions of
compounds in the train and evaluation sets.
Template.
These splits test the ability of parsers
to generate unseen program templates (canonical-
ized programs formed by anonymizing all variable
names and standardizing syntax). We group ex-
amples in the SPIDER dataset based on templates
defined by Finegan-Dollak et al. (2018). To cre-
ate the evaluation set, we randomly pick groups
of examples till the target set size is reached; the
remaining groups form the training set.
4.3 Model Performance on Likelihood Splits
In Table 2, we report exact match accuracy
4
on
the data splits using the SPIDER evaluation suite.
For SNLI and BOOLQ, we report the accuracy of
benchmarked models in Table 3. We create 3 ran-
dom splits and report mean and standard deviation
of accuracy of models trained on each split.
Likelihood Splits are more challenging than
random splits.
On SPIDER, for example, T5-
base accuracy on ll_split is 12.6 points lower than
the random split accuracy. Likelihood Splits lead
to drops in performance that are comparable to the
4
This metric accounts for the fact that SQL statements are
invariant to certain shuffling and change in variable names.
摘要:

BenchmarkingLong-tailGeneralizationwithLikelihoodSplitsAmeyaGodboleandRobinJiaUniversityofSouthernCalifornia{ameyagod,robinjia}@usc.eduAbstractInordertoreliablyprocessnaturallanguage,NLPsystemsmustgeneralizetothelongtailofrareutterances.Weproposeamethodtocre-atechallengingbenchmarksthatrequiregen-er...

展开>> 收起<<
Benchmarking Long-tail Generalization with Likelihood Splits Ameya Godbole and Robin Jia University of Southern California.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:1.1MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注