Are Sample-Efficient NLP Models More Robust Nelson F. LiuAnanya KumarPercy LiangRobin Jia Computer Science Department Stanford University Stanford CA

2025-04-27 0 0 2.07MB 18 页 10玖币

侵权投诉

Are Sample-Efﬁcient NLP Models More Robust?

Nelson F. Liu♠Ananya Kumar♠Percy Liang♠Robin Jia♡

♠Computer Science Department, Stanford University, Stanford, CA

♡Department of Computer Science, University of Southern California, Los Angeles, CA

{nfliu,ananya,pliang}@cs.stanford.edu

robinjia@usc.edu

Abstract

Recent results in image classiﬁcation and ex-

tractive question answering have observed

that pre-trained models trained on less in-

distribution data have better out-of-distribution

performance. However, it is unclear how

broadly these trends hold. We conduct a

large empirical study across three tasks, three

broadly-applicable modeling interventions (in-

creasing model size, using a different adapta-

tion method, and pre-training on more data),

and 14 diverse datasets to investigate the rela-

tionship between sample efﬁciency (amount of

data needed to reach a given ID accuracy) and

robustness (how models fare on OOD evalu-

ation). We ﬁnd that higher sample efﬁciency

is only correlated with better average OOD ro-

bustness on some modeling interventions and

tasks, but not others. On individual datasets,

models with lower sample efﬁciency can even

be more robust. These results suggest that

general-purpose methods for improving sample

efﬁciency are unlikely to yield universal OOD

robustness improvements, since such improve-

ments are highly dataset- and task-dependent.

Even in an era of large, multi-purpose pre-

trained models, task-speciﬁc decisions may of-

ten be necessary for OOD generalization.

1 Introduction

NLP models perform well when evaluated on

data drawn from their training distribution (in-

distribution / ID), but they typically suffer large

drops in performance when evaluated on data distri-

butions unseen during training (out-of-distribution

/ OOD; Blitzer,2008).

How does exposure to ID training examples af-

fect the ID-OOD gap? If two models have the

same ID performance, will models trained on fewer

ID examples (higher sample efﬁciency) also have

higher OOD performance (higher robustness)? At

one extreme, zero-shot models will not learn ID-

speciﬁc patterns because they are not exposed to

any labeled ID examples. Similarly, few-shot mod-

els trained on very few ID examples may also rely

less on ID-speciﬁc patterns; if a model never sees

the token “cat” while training on SNLI, then it will

not learn that its presence is spuriously predictive

of the contradiction label (Gururangan et al.,2018;

Utama et al.,2021). Supporting this intuition, re-

cent work in image classiﬁcation (Radford et al.,

2021) and extractive question answering (Awadalla

et al.,2022) show that zero-shot inference and few-

shot ﬁne-tuning improve average robustness across

a range of OOD test sets. However, it is unclear

how universal these trends are across various tasks

and methods for reducing exposure to ID exam-

ples, or how predictive they are for any individual

test set of interest. Figure 1illustrates this central

question.

We conduct a broad empirical study over 14

datasets across three tasks to investigate the re-

lationship between exposure to ID training exam-

ples (sample efﬁciency) and robustness. We exper-

iment with three modeling interventions that im-

prove sample efﬁciency: (1) using natural language

prompts for zero-shot prediction and during ﬁne-

tuning (Brown et al.,2020;Schick and Schütze,

2021;Gao et al.,2021); (2) ﬁne-tuning models of

increasing size; (3) ﬁne-tuning models pre-trained

on increasing amounts of data.

We ﬁnd that higher sample efﬁciency is only

sometimes correlated with better robustness, and

the effect of speciﬁc modeling interventions varies

by task. For example, increasing pre-trained model

size substantially improves sample efﬁciency and

results in higher average robustness in sentiment ex-

periments, but these sample efﬁciency gains do not

translate to higher average robustness in NLI and

extractive QA experiments. On individual datasets,

models with better sample efﬁciency can even be

less robust (e.g., increasing model size when train-

ing on SST-2 and evaluating OOD on IMDb).

Overall, these results indicate that general-

arXiv:2210.06456v2 [cs.CL] 30 May 2023

Figure 1: In this example, model B has higher sample

efﬁciency than model A, since model B requires less ID

training data to reach a given ID performance threshold

(top). In this particular example, model B is also more

robust than model A (bottom), since it has higher OOD

performance for a given ID performance threshold.

purpose methods for improving sample efﬁciency

are far from guaranteed to yield signiﬁcant OOD

robustness improvements—their success is highly

dataset- and task-dependent. Furthermore, even

in this era of large, multi-purpose pre-trained lan-

guage models, task-speciﬁc decisions are often nec-

essary to achieve OOD generalization.

2 Measuring Sample Efﬁciency and

Robustness.

Consider two data distributions

Diid

and

Dood

. Let

be a model trained on examples drawn from

Diid

(i.e., the ID training data). We study the re-

lationship between three properties of

: (1) the

number of ID examples it was trained on; (2)

’s

performance on held-out examples from

Diid

(i.e.,

the ID performance); (3)

’s performance on ex-

amples from Dood (i.e., the OOD performance).

Let

and

be two models with equiva-

lent performance on held-out ID data. If

was

trained on fewer ID examples than

, then it has

higher sample efﬁciency. If

has higher OOD

performance than

, it has higher effective robust-

ness (henceforth “robustness”; Taori et al.,2020).

Comparing models with equivalent ID performance

controls for its effect on OOD performance, since

improving ID performance usually yields commen-

surate improvements on OOD performance—in

this study, we focus on OOD performance improve-

ments beyond what is expected from ID gains.

Satisfying this equivalent-ID constraint is often

difﬁcult in practice; given an arbitrary model

and its corresponding ID performance, it is difﬁcult

to produce a different model

with identical ID

performance. Rather than explicitly training mod-

els to identical ID performance, we train models on

varying-size subsamples of a given ID dataset and

interpolate between the results to estimate (1) the

number of labeled ID training examples necessary

to achieve a particular ID performance (sample ef-

ﬁciency) and (2) OOD performance, given ID per-

formance (robustness). These interpolated curves

approximate the ideal setting of training a model

for every possible ID value. Figure 1provides a

schematized example, with model

having better

sample efﬁciency and robustness than model A.

3 Experimental Setup

We study three modeling interventions—using

natural language prompts, increasing pre-trained

model size, and pre-training on more data—on 14

total datasets spanning natural language inference

(NLI), sentiment analysis, and extractive question

answering (QA). See Appendix Afor further de-

tails about experimental settings.

Tasks and Datasets. In our natural language

inference (NLI) experiments, we use MultiNLI

(Williams et al.,2018), SNLI (Bowman et al.,

2015), and MedNLI (Romanov and Shivade,2018).

For sentiment analysis, we use IMDb reviews Maas

et al. (2011), SST-2 (Socher et al.,2013), and

reviews from the “Movies and TV” subsection

of the Amazon Reviews corpus (Ni et al.,2019).

Lastly, for extractive question answering, we use

SQuAD (Rajpurkar et al.,2016), NaturalQuestions

(Kwiatkowski et al.,2019), TriviaQA, BioASQ

(Tsatsaronis et al.,2015), and the four SQuAD-

Shifts test sets (Miller et al.,2020).

Modeling Interventions. To understand the ef-

fect of a particular modeling intervention on sample

efﬁciency and robustness, we evaluate pre-trained

models that differ only along the axis of interest

(e.g., model size or ﬁne-tuning method). Since the

optimal ﬁne-tuning hyperparameters depend on the

ID training dataset size, we separately tune hyper-

parameters for each model on each training dataset

subsample size, taking the models that achieve the

best held-out ID performance for each setting. See

(a) (b) (c)

Figure 2: Prompt-based ﬁne-tuning improves sample efﬁciency (orange series above blue series) and average

robustness (orange series about blue series) across experimental settings (a,b). However, it can have no effect on

robustness on individual OOD settings (e.g., MNLI →SNLI; c).

Appendix Bfor details about hyperparameter opti-

mization.

4 Results and Discussion

Our results show that models with higher sample

efﬁciency may not necessarily have higher average

OOD robustness—different tasks and modeling in-

terventions affect robustness in different ways (Fig-

ures 2-4). For example, prompt-based ﬁne-tuning

consistently improves both sample efﬁciency and

average robustness, but only in low-data settings

(Figure 2). In contrast, increasing model size im-

proves sample efﬁciency across the range of train-

ing dataset sizes and tasks, but only improves aver-

age robustness on sentiment analysis (Figure 3). On

individual datasets, we even observe cases where

models with lower sample efﬁciency have higher

robustness (Figure 3d). See Appendix Cfor full

results on every ID-OOD setting.

Natural Language Prompting. We compare

BERT

BASE

models using (1) standard ﬁne-tuning,

(2) prompt-based ﬁne-tuning, and (3) zero-shot

prompting. We also compare these results with

zero-shot prompting of

text-davinci-001

, a

much larger model trained on substantially more

data. We run experiments on NLI and sentiment

analysis, since extractive QA is not amenable to

prompt-based ﬁne-tuning with masked language

models.

Figures 2a and 2b plot the average performance

on all OOD datasets as a function of ID perfor-

mance and the ID performance as a function of

the number of labeled training examples. Sample

efﬁciency improvements from prompt-based ﬁne-

tuning also translate to higher average robustness.

However these improvements only apply in the

few-shot setting. As the size of the training dataset

increases, the improvements in sample efﬁciency

and average robustness steadily diminish. When

using sufﬁciently large training datasets, models

trained with prompt-based ﬁne-tuning yield essen-

tially the same sample efﬁciency and robustness

results as standard ﬁne-tuning (

∼

1K examples for

NLI, ∼130 examples for sentiment).

However, results on individual OOD test sets

can signiﬁcantly differ from averaged-OOD trends.

For example, Figure 2c shows that prompt-based

ﬁne-tuning on MNLI and evaluating on SNLI im-

proves sample efﬁciency in the few-shot setting but

without any robustness improvements.

Surprisingly, we also ﬁnd that zero-shot infer-

ence does not necessarily improve average robust-

ness over prompt-based ﬁne-tuning—zero-shot per-

formance lies on or below the trend line formed

by prompt-based ﬁne-tuning, despite not using any

ID-speciﬁc data at all. See Appendix C.1 for full re-

sults of increasing pre-trained model size for every

ID-OOD setting.

Increasing Pre-Trained Model Size. We run

experiments with the checkpoints of Turc et al.

(2019), who pre-train BERT models with various

numbers of transformer layers (L) and hidden em-

bedding sizes (H). We run experiments on NLI,

sentiment analysis, and extractive QA to compare

pre-trained models of ﬁve sizes: (1) Large (L=24,

H=1024), (2) Base (L=12, H=768), (3) Medium

(a) (b) (c)

(d)

Figure 3: Although increasing pre-trained model size improves sample

efﬁciency in all settings, these sample efﬁciency improvements only

translate to better average robustness in sentiment analysis experiments

(b). In NLI and extractive QA, average robustness is unchanged (a,c).

Although increased model size improves averaged OOD performance on

IMDb, these conclusions do not apply to any ID-OOD pair. For example,

increasing pre-trained model size can decrease robustness when training

on SST-2 and evaluating on IMDb (d).

(L=8, H=512), (4) Small (L=4, H=512), and

(5) Tiny (L=2, H=128). Although increasing the

pre-trained model size improves sample efﬁciency

on every task, it does not always improve aver-

age robustness (Figure 3). In particular, increasing

model size minimally affects average robustness in

NLI and extractive QA (Figure 3a,3c), but substan-

tially improves average robustness on sentiment

analysis (Figure 3b).

However, results on indi-

vidual ID-OOD pairs can again signiﬁcantly differ

from average OOD performance trends. For ex-

ample, when training on SST-2 and evaluating on

IMDb, larger models actually have lower OOD

performance. This occurs because SST-2 exam-

ples (single sentences) are signiﬁcantly shorter than

IMDb examples (paragraphs). As a result, mod-

els trained on the shorter SST-2 examples struggle

when evaluated on IMDb because this particular

ID-OOD pair requires length extrapolation, and

Note that moving from BERT

BASE

to BERT

LARGE

does

not improve effective robustness until

∼

92% IMDb ID accu-

racy. We hypothesize this occurs because these BERT

LARGE

datapoints are ﬁne-tuned on small amounts of data (fewer than

1K examples), potentially leading to instability and reduced

effective robustness.

increasing pre-trained model size does not help

models generalize to longer input sequences. As

a result, effective robustness decreases because

larger models have higher ID (SST-2) performance

but unchanged OOD (IMDb) performance. See

Appendix C.2 for full results of natural language

prompting for every ID-OOD setting.

Pre-Training on More Data. We conduct NLI,

sentiment, and QA experiments with RoBERTa

models pre-trained on 10M, 100M, and 1B tokens

of web text (Zhang et al.,2021).

Pre-training on more data consistently improves

sample efﬁciency, but only yields average robust-

ness improvements in NLI and sentiment analysis

(Figure 4a,b). In extractive QA experiments, vary-

ing the amount of pre-training data does not sig-

niﬁcantly change average robustness (Figure 4c).

Again, we ﬁnd that results on average OOD perfor-

mance are not predictive of results on individual

test sets—despite unchanged average OOD robust-

ness when pre-training on more data, OOD per-

formance can be higher on individual extractive

QA test sets (e.g., SQuAD

→

BioASQ; Figure 4d).

See Appendix C.3 for full results of pre-training on

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AreSample-EfficientNLPModelsMoreRobust?NelsonF.Liu♠AnanyaKumar♠PercyLiang♠RobinJia♡♠ComputerScienceDepartment,StanfordUniversity,Stanford,CA♡DepartmentofComputerScience,UniversityofSouthernCalifornia,LosAngeles,CA{nfliu,ananya,pliang}@cs.stanford.edurobinjia@usc.eduAbstractRecentresultsinimageclassi...

展开>> 收起<<

Are Sample-Efficient NLP Models More Robust Nelson F. LiuAnanya KumarPercy LiangRobin Jia Computer Science Department Stanford University Stanford CA.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Are Sample-Efficient NLP Models More Robust Nelson F. LiuAnanya KumarPercy LiangRobin Jia Computer Science Department Stanford University Stanford CA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: