Benchmarking Long-tail Generalization with Likelihood Splits Ameya Godbole and Robin Jia University of Southern California

2025-04-24 0 0 1.1MB 23 页 10玖币

侵权投诉

Benchmarking Long-tail Generalization with Likelihood Splits

Ameya Godbole and Robin Jia

University of Southern California

{ameyagod,robinjia}@usc.edu

Abstract

In order to reliably process natural language,

NLP systems must generalize to the long tail

of rare utterances. We propose a method to cre-

ate challenging benchmarks that require gen-

eralizing to the tail of the distribution by re-

splitting existing datasets. We create ‘Like-

lihood Splits’ where examples that are as-

signed lower likelihood by a pre-trained lan-

guage model (LM) are placed in the test set,

and more likely examples are in the training

set. This simple approach can be customized

to construct meaningful train-test splits for a

wide range of tasks. Likelihood Splits sur-

face more challenges than random splits: rel-

ative error rates of state-of-the-art models in-

crease by 59% for semantic parsing on SPI-

DER, 93% for natural language inference on

SNLI, and 33% for yes/no question answering

on BOOLQ, on our splits compared with the

corresponding random splits. Moreover, Like-

lihood Splits create fairer benchmarks than ad-

versarial ﬁltering; when the LM used to create

the splits is also employed as the task model,

our splits do not unfairly penalize the LM.

1 Introduction

Success on in-distribution test data does not neces-

sarily show that a system has solved the underlying

task at hand. Systems can achieve artiﬁcially high

accuracy by exploiting dataset-speciﬁc shortcuts,

such as spurious feature-label correlations that hold

in the data but not in general (Gardner et al.,2021).

In many datasets, a large proportion of test exam-

ples are similar to training examples, further in-

ﬂating in-distribution accuracy (Lewis et al.,2021;

Czarnowska et al.,2019;Orr et al.,2021). Out-

of-distribution (OOD) evaluation paints a clearer

picture of a system’s ability to perform the task.

Prior work has proposed a variety of methods

to test OOD generalization, each with their own

strengths and weaknesses. Task-speciﬁc behav-

ior tests (Ribeiro et al.,2020;Naik et al.,2018;

Figure 1: Likelihood Splits: We propose to partition

the dataset based on likelihood under a language model.

The high-likelihood “head” of the distribution becomes

the training set while we evaluate generalization to

the low-likelihood “tail” of the data. Shown here are

queries from the SPIDER dataset in different likelihood

buckets: one possible tail generalization could be the

handling uncommon entities with known query types.

Gardner et al.,2020) give insights into model be-

havior but require signiﬁcant manual (often expert)

effort to create. Adversarial data collection, in

which annotators try to fool high-performing mod-

els (Nie et al.,2020;Potts et al.,2021), also col-

lects challenging examples, but runs the risk of

focusing only on a narrow subset of model weak-

nesses (Bowman and Dahl,2021;Kaushik et al.,

2021). Adversarial ﬁltering removes easy exam-

ples from existing datasets (Sakaguchi et al.,2021),

but can disproportionately penalize the model used

during ﬁltering (Phang et al.,2021). Domain gener-

alization tests transferability to new data domains

(Fisch et al.,2019;Miller et al.,2020), but there is

no guarantee that generalizing to a given new do-

main is possible—out-of-domain examples may re-

quire skills that are not learnable from the training

data (Geiger et al.,2019). Other approaches create

dataset splits that test for speciﬁc skills, such as

length generalization (Lake and Baroni,2018) and

compositional generalization (Shaw et al.,2021),

arXiv:2210.06799v2 [cs.CL] 2 May 2023

but they only apply to a narrow subset of tasks.

In this work, we propose

Likelihood Splits

, a

general-purpose method to create challenging OOD

splits for existing datasets. The principle behind

Likelihood Splits is that any system that claims

to reliably process natural language must be able

to generalize from more common utterances seen

during training to the long tail of rare utterances

at test time. Generalization, not merely memoriza-

tion, is necessary because even a very large training

dataset cannot exhaustively cover all possible long-

tail examples that may be encountered in the real

world. Moreover, standard annotation procedures

tend to over-sample examples from the head of the

distribution, further ignoring the challenge posed

by infrequent examples. We identify tail exam-

ples using the likelihood under the GPT-2 language

model (Radford et al.,2019). Examples with low

likelihood under GPT-2 are placed in the held-out

evaluation sets and the high likelihood examples

are used as the training set (see Figure 1).

Likelihood Splits are a novel, widely applicable

strategy that can create interesting generalization

benchmarks at no additional annotation cost. They

are more challenging than a random split across

a wide range of tasks: error rates relative to ran-

dom splits increase by 59% for T5 (Raffel et al.,

2020) on SPIDER (Yu et al.,2018), 93% for ELEC-

TRA (Clark et al.,2020) on SNLI (Bowman et al.,

2015), and 33% for ROBERTA(Liu et al.,2019)

on BOOLQ (Clark et al.,2019). Moreover, the

proposed splits do not unfairly penalize the GPT-2

model used to create the splits when it is used as a

task model, thus avoiding one of the downsides of

adversarial ﬁltering. We identify many independent

challenges required by Likelihood Splits, including

generalizing to rare words, complex programs, and

syntactically complex sentences. We encourage

future benchmark creators to release Likelihood

Splits as a complementary evaluation to the stan-

dard IID evaluation to better test out-of-distribution

generalization performance. We will release the

splits discussed in this work along with the code to

easily create Likelihood Splits of other datasets.1

2 Related Work

Generalizing to the long-tail.

Evaluating sys-

tems on long-tail phenomena is important, espe-

cially because many datasets over-sample the head

of the distribution. For example, some question-

1github.com/ameyagodbole/long-tail-likelihood-splits

answering (QA) datasets limit their purview to pop-

ular web-pages (Yang et al.,2018) or frequent user

queries (Kwiatkowski et al.,2019). Lewis et al.

(2021); Liu et al. (2021) demonstrate that mod-

els trained on these datasets often fail on exam-

ples that do not match the most frequent training

cases. Similar observations have been made in en-

tity linking to rare entities, (Orr et al.,2021;Chen

et al.,2021), information retrieval for open-domain

QA (Sciavolino et al.,2021), relation extraction

for rare relations (Sabo et al.,2021) and lexicon

induction for rare senses in machine translation

(Czarnowska et al.,2019). Zero-shot performance

of large LMs on numerical reasoning and factoid

questions is also correlated with the frequency of

occurence of the facts in the pre-training corpus

(Razeghi et al.,2022;Kandpal et al.,2022;Elazar

et al.,2022). While we do not test whether models

can memorize long-tail knowledge, we instead test

whether models can process long-tail sentences.

Naik et al. (2022) note that it is challenging to cata-

logue and evaluate generalization along micro-level

dimensions and instead propose benchmarks that

vary along macro-level dimensions (such as the lan-

guage and domain) as a proxy. We hypothesize that

LMs learn which micro-level phenomena are rare,

as this would improve their overall language mod-

eling objective. In this work, we present a recipe

that leverages LMs to evaluate tail generalization

for any language task.

Task-speciﬁc test sets.

Ribeiro et al. (2020) use

templated queries to evaluate model performance

under various linguistic perturbations. This method

requires dataset designers to deﬁne phenomena of

interest and axes of perturbation along which labels

may be preserved or changed. Naik et al. (2018)

analyze model errors and instantiate tests that ex-

plicitly evaluate models on more examples from

each error class. Gardner et al. (2020) check for

model consistency under local perturbations of test

set examples. All of these approaches require anno-

tators to create new examples, whereas we propose

a method to resplit existing datasets.

Adversarial approaches.

Søgaard et al. (2021)

argue that random splits over-estimate model per-

formance on new in-domain data and recommend

the use of adversarial and heuristically challenging

splits to estimate generalizability. Adversarial data

collection promotes the creation of difﬁcult exam-

ples by encouraging annotators to fool a model-

in-the-loop (Nie et al.,2020;Potts et al.,2021;

Kiela et al.,2021). Similarly, Adversarial Filtering

removes examples that are easy for a given task

model in order to create more challenging bench-

marks (Sakaguchi et al.,2021;Yang et al.,2018).

However, Kaushik et al. (2021) and Bowman and

Dahl (2021) point out that adversarially collected

or ﬁltered examples may focus on a narrow set of

skills that the “in-the-loop” model lacks, instead of

covering all the abilities required for the underlying

task. Additionally, the “in-the-loop” task model is

disproportionately penalized by the adversarial test

sets (Phang et al.,2021). We show in §4.3 that

Likelihood Splits do not suffer from this issue.

Domain shift.

In NLP, domains can be charac-

terized by the changes in vocabulary and distribu-

tion of word use, styles used by authors, and the

intended audience. Fisch et al. (2019) pose the

challenge of developing QA systems that need to

generalize to unseen domains. Miller et al. (2020)

show that QA models trained on SQUAD show a

performance drop on new domains (while human

baseline performance remains unchanged); Miller

et al. (2021); Hendrycks et al. (2020) inter alia per-

form similar analyses of domain shift. SPIDER (Yu

et al.,2018) and GRAILQA (Gu et al.,2021) eval-

uate semantic parsing on unseen table and knowl-

edge base domains respectively. Domain shift is

an orthogonal axis of generalization; we focus on

generalizing to rare utterances in the same domain.

Out-of-distribution detection.

Previous work

in OOD detection has used high generative model

perplexity as a sign of outliers (Arora et al.,2021;

Ren et al.,2019;Lee et al.,2018). Our intuition

is similar: low likelihood (high perplexity) is an

indicator of rare examples. However, only our

work uses likelihood scores for benchmark creation.

Moreover, in our setting all examples have been

collected under the same data collection protocol,

so none of the examples are truly OOD.

Compositional generalization.

The ability to

“compose” the meaning of a new utterance from the

known meaning of its parts (Fodor and Pylyshyn,

1988) is an important aspect of language under-

standing. The deterministic grammar of program-

ming languages makes semantic parsing, the task

of translating a natural language utterance into a

logical program, a good testbed for evaluating com-

positional generalization (Lake and Baroni,2018;

Kim and Linzen,2020;Hupkes et al.,2020;Key-

sers et al.,2020;Shaw et al.,2021). However, for

tasks where the constituent blocks are not clearly

deﬁned, it is unclear how to create such evaluation

splits of the data. We compare against composi-

tional generalization splits of the semantic parsing

dataset SPIDER (Yu et al.,2018) in §4.

3 Capturing the Tail of the Distribution

In order to ﬁnd the tail within a dataset, we ap-

proximate likelihood of an utterance in the real

distribution with its likelihood under a language

model (LM). Our method can be easily modiﬁed to

create meaningful splits for any language task. We

demonstrate this by creating Likelihood Splits for:

•

SPIDER, a semantic parsing dataset (Yu et al.,

2018) consisting of natural language questions

and corresponding SQL programs;

•

SNLI, a natural language inference dataset (Bow-

man et al.,2015) consisting of premise and

hypothesis sentences paired with labels denot-

ing that the hypothesis is entailed by/neutral

to/contradictory to the premise;

•

BOOLQ, a question-answering dataset (Clark

et al.,2019) consisting of a passages, associated

questions, and binary yes/no labels.

3.1 General Approach

We consider language tasks where models must

map an input

to an output

(e.g., a SQL query

or a label). The input

may be either a single

sentence (e.g., semantic parsing) or a pair of sen-

tences (e.g., natural language inference), in which

case we write

x= (x1, x2)

. Given a dataset

(x, y)

pairs and desired proportion

of evaluation

examples, our method partitions

into subsets

Dtrain

and

Deval

where

|Deval| ≈ p· |D|

. More

speciﬁcally, we will ﬁrst assign a likelihood score

s(x)

to each

x∈D

, then choose

Deval

to be the

bp· |D|c

examples in

with lowest value of

s(x)

and choose

Dtrain =D\Deval

. In §3.2, we describe

a few different ways to deﬁne

. In §3.3, we de-

scribe a modiﬁcation to this procedure that controls

for varying length between examples. Finally, we

describe task-speciﬁc adjustments in §3.4.

3.2 Assigning Likelihood Scores s(x)

We use the total log-likelihood over the query to-

kens assigned by the GPT-2 language model as the

score

s(x)

for every example. There are two ways

to use the LM: (1) prompting a frozen LM or (2)

ﬁne-tuning the LM on the dataset.

Task Prompting Fine-Tuning

SPIDER write a database question: {query} <|endoftext|> {query}

BOOLQPassage: {passage} Ask a question about the passage: {question}

SNLI Premise: {premise} This hypothesis is {entailed/neutral/a contradiction}: {hypothesis}

Table 1: Input formats for single-sentence and sentence-pair tasks in the prompting and ﬁne-tuning settings. Values

in curly braces are plugged in from the example. For SNLI, we provide the label in the prompt to prime the LM to

the class of hypothesis. The LM is trained (when ﬁne-tuning) and evaluated on generating the query in blue.

Past work has shown that prompting i.e. pre-

pending a task-speciﬁc string to the query, helps

GPT-2 generalize zero-shot to new tasks (Radford

et al.,2019). We use simple prompts that describe

the task and prime the LM to the text we expect it

to generate (see Table 1). For sentence pair tasks

(such as SNLI and BOOLQ), it is necessary to com-

pare the relation between two pieces of text and

not just each piece in isolation. Thus, it is intuitive

to describe unlikely examples by the conditional

likelihood of

given

. We demonstrate the ﬂex-

ibility of our approach by providing the label in the

prompt if it adds additional information about the

text to be generated (e.g. in SNLI).

We will refer

to this setting which uses the prompted LM with

the tag ll_split pt in the rest of the work.

The dataset curator may also choose to ﬁne-tune

the LM to better capture the task distribution. We

ﬁne-tune the GPT-2 LM to maximize either the

probability of

for single sentence tasks or the

conditional probability of

given the prompt for

sentence-pair tasks. When ﬁne-tuning the LM on

the dataset, we need to ensure that it is not used to

assign scores to the examples it is trained on. Given

the dataset

, we ﬁrst randomly partition

into

folds. For each fold, we ﬁne-tune an LM on the

remaining folds and use it to assign log-likelihood

scores to examples in the held-out fold. We refer

the reader to Appendix A.2 for ﬁne-tuning details.

We will refer to this setting as ll_split henceforth.

3.3 Controlling for Length

Since the likelihood of an utterance is negatively

correlated with its length, we create a split that

explicitly controls for the effect of length. After

assigning a likelihood score to every utterance, the

examples are bucketed based on length (deﬁned

by tokenizing the utterance with NLTK (Loper and

We include the label in the prompt for SNLI but not

BOOLQ because the resulting prompts seemed most natu-

ral for each dataset. This choice was made before assessing

downstream behavior.

Bird,2002)). For single-sentence and sentence-

pair tasks, we use the length of the query (

and

respectively) over which log-likelihood was com-

puted. Within each bucket, a fraction

of the ex-

amples with the lowest

s(x)

are put in the evalu-

ation set; aggregating examples from all buckets,

|Deval| ≈ p· |D|

. We will refer to this control

setting with the modiﬁer (-len) henceforth.3

3.4 Dataset-speciﬁc Choices and Details

SPIDER.

We follow Shaw et al. (2021) and swap

examples between the train and evaluation sets such

that every logical program atom in the evaluation

set appears at least once in the train set. This en-

sures that the model is not required to generalize to

unseen function names and declarations.

SNLI and BOOLQ.

We ensure label balance in

our splits (as in the original data) by splitting the

examples for each label separately, then combining

the resulting train and evaluation sets.

Development sets.

Csordás et al. (2021) show

that without development sets that are in-

distribution to challenging test sets, models are

prone to over-ﬁtting, which under-estimates their

ability to generalize. Thus, after dividing the data

into train and evaluation sets, we randomly divide

the evaluation set into a development set and test

set. Other details are reported in Appendix A.1.

4 Experiments

Next, we benchmark task models on our Likelihood

Splits. Splits created using GPT2-medium will be

the focus of our analysis. We will brieﬂy study the

effect of switching the LM to GPT2-large in §4.4.

When creating Likelihood Splits, the number of

folds

for ﬁne-tuning the LM (§3.2) can be chosen

by the dataset curator. For results in §4and §5,

we set

k= 3

arbitrarily. We analyse the effect of

We also considered using perplexity, which normalizes for

length, but it led to an over-correction where short examples

were ﬁltered into the evaluation set.

choosing a different value of

in Appendix A.3.

Our results show that the trends and observations

discussed here hold true for other values of k.

4.1 Benchmarked Models

One of the goals of this work is to expose long-

tail generalization as a challenge to state-of-the-art

models; SotA models on the considered bench-

marks are all pre-trained models. We make efforts

to show that models with different pre-training data

and objectives are similarly affected by our pro-

posed splits. Hyperparameters and training details

for the reported models are in Appendix A.2.

Semantic parsing.

Following Shaw et al. (2021),

we benchmark the competitive T5-base model (Raf-

fel et al.,2020) on all splits of the SPIDER dataset.

In order to test whether these splits are adversarial

to the data splitting language model, we addition-

ally ﬁne-tune GPT2-medium models for the seman-

tic parsing task. To study the effect of model size,

we ﬁne-tune T5-small and GPT2-small variants.

SNLI and BOOLQ.

We ﬁne-tune two compet-

itive models (ROBERTA(Liu et al.,2019) and

ELECTRA (Clark et al.,2020)) at two model sizes

(base and large). Additionally, following Poliak

et al. (2018), we train a ROBERTA-large model

to perform the task given just the hypothesis. The

performance of a hypothesis-only model estimates

the degree of spurious correlations that exist in the

dataset which give away the label.

4.2 Alternative Splits for Semantic Parsing

We compare the difﬁculty of the Likelihood Splits

with past work on heuristic challenges splits.

Length.

Past work has established that text gen-

eration models trained on short inputs struggle to

generalize to longer inputs at test time (Lake and

Baroni,2018;Hupkes et al.,2020;Newman et al.,

2020). We create Length splits by placing exam-

ples with the longest input queries in the evaluation

set and the remaining examples in the training set.

TMCD.

Systematicity is the ability to composi-

tionally derive the meaning of an utterance from the

known meaning of its parts. Past work studying sys-

tematicity in semantic parsing has deﬁned “atoms"

as the smallest constituents of the grammar (e.g.

variables and function names) and “compounds"

as complex structures formed by composing atoms

(e.g. multi-argument functions and nested func-

tion calls) (Keysers et al.,2020). Following Shaw

Split T5 T5 GPT-2 GPT-2

base small medium(∆) small

Random 78.6 75.2 69.3 (9.3) 64.7

Length 50.0 44.5 39.9 (10.1) 34.0

Template 60.1 60.0 51.4 (8.7) 45.1

TMCD 66.2 64.1 56.2 (10) 51.4

Split LM: GPT2-medium

ll_split 66.0 64.2 57.2 (8.8) 51.8

ll_split (-len) 71.3 67.3 59.9 (11.4) 57.3

ll_split pt 60.6 59.7 50.9 (9.7) 45.9

ll_split pt (-len) 73.5 68.4 64.5 (9) 58.3

Split LM: GPT2-large

ll_split 61.8 61.8 53.7 (8.1) 48.3

ll_split (-len) 69.7 66.2 59.1 (10.6) 54.8

ll_split pt 63.0 58.3 51.4 (11.6) 45.7

ll_split pt (-len) 72.0 70.1 63.4 (8.6) 57.5

Table 2: SPIDER: Exact sequence prediction accu-

racy for Likelihood Splits created by GPT2-medium

and GPT2-large, and other challenge splits. Likelihood

Splits are more challenging than random splits while

not being adversarial to GPT2-medium. ∆marks the

performance drop from T5-base to GPT2-medium.

et al. (2021), we create TMCD (Target Maximum

Compound Divergence) splits of SPIDER by maxi-

mizing the divergence between the distributions of

compounds in the train and evaluation sets.

Template.

These splits test the ability of parsers

to generate unseen program templates (canonical-

ized programs formed by anonymizing all variable

names and standardizing syntax). We group ex-

amples in the SPIDER dataset based on templates

deﬁned by Finegan-Dollak et al. (2018). To cre-

ate the evaluation set, we randomly pick groups

of examples till the target set size is reached; the

remaining groups form the training set.

4.3 Model Performance on Likelihood Splits

In Table 2, we report exact match accuracy

the data splits using the SPIDER evaluation suite.

For SNLI and BOOLQ, we report the accuracy of

benchmarked models in Table 3. We create 3 ran-

dom splits and report mean and standard deviation

of accuracy of models trained on each split.

Likelihood Splits are more challenging than

random splits.

On SPIDER, for example, T5-

base accuracy on ll_split is 12.6 points lower than

the random split accuracy. Likelihood Splits lead

to drops in performance that are comparable to the

This metric accounts for the fact that SQL statements are

invariant to certain shufﬂing and change in variable names.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BenchmarkingLong-tailGeneralizationwithLikelihoodSplitsAmeyaGodboleandRobinJiaUniversityofSouthernCalifornia{ameyagod,robinjia}@usc.eduAbstractInordertoreliablyprocessnaturallanguage,NLPsystemsmustgeneralizetothelongtailofrareutterances.Weproposeamethodtocre-atechallengingbenchmarksthatrequiregen-er...

展开>> 收起<<

Benchmarking Long-tail Generalization with Likelihood Splits Ameya Godbole and Robin Jia University of Southern California.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Benchmarking Long-tail Generalization with Likelihood Splits Ameya Godbole and Robin Jia University of Southern California

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: