Are Sample-Efficient NLP Models More Robust Nelson F. LiuAnanya KumarPercy LiangRobin Jia Computer Science Department Stanford University Stanford CA

2025-04-27 0 0 2.07MB 18 页 10玖币
侵权投诉
Are Sample-Efficient NLP Models More Robust?
Nelson F. LiuAnanya KumarPercy LiangRobin Jia
Computer Science Department, Stanford University, Stanford, CA
Department of Computer Science, University of Southern California, Los Angeles, CA
{nfliu,ananya,pliang}@cs.stanford.edu
robinjia@usc.edu
Abstract
Recent results in image classification and ex-
tractive question answering have observed
that pre-trained models trained on less in-
distribution data have better out-of-distribution
performance. However, it is unclear how
broadly these trends hold. We conduct a
large empirical study across three tasks, three
broadly-applicable modeling interventions (in-
creasing model size, using a different adapta-
tion method, and pre-training on more data),
and 14 diverse datasets to investigate the rela-
tionship between sample efficiency (amount of
data needed to reach a given ID accuracy) and
robustness (how models fare on OOD evalu-
ation). We find that higher sample efficiency
is only correlated with better average OOD ro-
bustness on some modeling interventions and
tasks, but not others. On individual datasets,
models with lower sample efficiency can even
be more robust. These results suggest that
general-purpose methods for improving sample
efficiency are unlikely to yield universal OOD
robustness improvements, since such improve-
ments are highly dataset- and task-dependent.
Even in an era of large, multi-purpose pre-
trained models, task-specific decisions may of-
ten be necessary for OOD generalization.
1 Introduction
NLP models perform well when evaluated on
data drawn from their training distribution (in-
distribution / ID), but they typically suffer large
drops in performance when evaluated on data distri-
butions unseen during training (out-of-distribution
/ OOD; Blitzer,2008).
How does exposure to ID training examples af-
fect the ID-OOD gap? If two models have the
same ID performance, will models trained on fewer
ID examples (higher sample efficiency) also have
higher OOD performance (higher robustness)? At
one extreme, zero-shot models will not learn ID-
specific patterns because they are not exposed to
any labeled ID examples. Similarly, few-shot mod-
els trained on very few ID examples may also rely
less on ID-specific patterns; if a model never sees
the token “cat” while training on SNLI, then it will
not learn that its presence is spuriously predictive
of the contradiction label (Gururangan et al.,2018;
Utama et al.,2021). Supporting this intuition, re-
cent work in image classification (Radford et al.,
2021) and extractive question answering (Awadalla
et al.,2022) show that zero-shot inference and few-
shot fine-tuning improve average robustness across
a range of OOD test sets. However, it is unclear
how universal these trends are across various tasks
and methods for reducing exposure to ID exam-
ples, or how predictive they are for any individual
test set of interest. Figure 1illustrates this central
question.
We conduct a broad empirical study over 14
datasets across three tasks to investigate the re-
lationship between exposure to ID training exam-
ples (sample efficiency) and robustness. We exper-
iment with three modeling interventions that im-
prove sample efficiency: (1) using natural language
prompts for zero-shot prediction and during fine-
tuning (Brown et al.,2020;Schick and Schütze,
2021;Gao et al.,2021); (2) fine-tuning models of
increasing size; (3) fine-tuning models pre-trained
on increasing amounts of data.
We find that higher sample efficiency is only
sometimes correlated with better robustness, and
the effect of specific modeling interventions varies
by task. For example, increasing pre-trained model
size substantially improves sample efficiency and
results in higher average robustness in sentiment ex-
periments, but these sample efficiency gains do not
translate to higher average robustness in NLI and
extractive QA experiments. On individual datasets,
models with better sample efficiency can even be
less robust (e.g., increasing model size when train-
ing on SST-2 and evaluating OOD on IMDb).
Overall, these results indicate that general-
arXiv:2210.06456v2 [cs.CL] 30 May 2023
Figure 1: In this example, model B has higher sample
efficiency than model A, since model B requires less ID
training data to reach a given ID performance threshold
(top). In this particular example, model B is also more
robust than model A (bottom), since it has higher OOD
performance for a given ID performance threshold.
purpose methods for improving sample efficiency
are far from guaranteed to yield significant OOD
robustness improvements—their success is highly
dataset- and task-dependent. Furthermore, even
in this era of large, multi-purpose pre-trained lan-
guage models, task-specific decisions are often nec-
essary to achieve OOD generalization.
2 Measuring Sample Efficiency and
Robustness.
Consider two data distributions
Diid
and
Dood
. Let
M
be a model trained on examples drawn from
Diid
(i.e., the ID training data). We study the re-
lationship between three properties of
M
: (1) the
number of ID examples it was trained on; (2)
M
s
performance on held-out examples from
Diid
(i.e.,
the ID performance); (3)
M
s performance on ex-
amples from Dood (i.e., the OOD performance).
Let
M1
and
M2
be two models with equiva-
lent performance on held-out ID data. If
M1
was
trained on fewer ID examples than
M2
, then it has
higher sample efficiency. If
M1
has higher OOD
performance than
M2
, it has higher effective robust-
ness (henceforth “robustness”; Taori et al.,2020).
Comparing models with equivalent ID performance
controls for its effect on OOD performance, since
improving ID performance usually yields commen-
surate improvements on OOD performance—in
this study, we focus on OOD performance improve-
ments beyond what is expected from ID gains.
Satisfying this equivalent-ID constraint is often
difficult in practice; given an arbitrary model
M1
and its corresponding ID performance, it is difficult
to produce a different model
M2
with identical ID
performance. Rather than explicitly training mod-
els to identical ID performance, we train models on
varying-size subsamples of a given ID dataset and
interpolate between the results to estimate (1) the
number of labeled ID training examples necessary
to achieve a particular ID performance (sample ef-
ficiency) and (2) OOD performance, given ID per-
formance (robustness). These interpolated curves
approximate the ideal setting of training a model
for every possible ID value. Figure 1provides a
schematized example, with model
B
having better
sample efficiency and robustness than model A.
3 Experimental Setup
We study three modeling interventions—using
natural language prompts, increasing pre-trained
model size, and pre-training on more data—on 14
total datasets spanning natural language inference
(NLI), sentiment analysis, and extractive question
answering (QA). See Appendix Afor further de-
tails about experimental settings.
Tasks and Datasets. In our natural language
inference (NLI) experiments, we use MultiNLI
(Williams et al.,2018), SNLI (Bowman et al.,
2015), and MedNLI (Romanov and Shivade,2018).
For sentiment analysis, we use IMDb reviews Maas
et al. (2011), SST-2 (Socher et al.,2013), and
reviews from the “Movies and TV” subsection
of the Amazon Reviews corpus (Ni et al.,2019).
Lastly, for extractive question answering, we use
SQuAD (Rajpurkar et al.,2016), NaturalQuestions
(Kwiatkowski et al.,2019), TriviaQA, BioASQ
(Tsatsaronis et al.,2015), and the four SQuAD-
Shifts test sets (Miller et al.,2020).
Modeling Interventions. To understand the ef-
fect of a particular modeling intervention on sample
efficiency and robustness, we evaluate pre-trained
models that differ only along the axis of interest
(e.g., model size or fine-tuning method). Since the
optimal fine-tuning hyperparameters depend on the
ID training dataset size, we separately tune hyper-
parameters for each model on each training dataset
subsample size, taking the models that achieve the
best held-out ID performance for each setting. See
(a) (b) (c)
Figure 2: Prompt-based fine-tuning improves sample efficiency (orange series above blue series) and average
robustness (orange series about blue series) across experimental settings (a,b). However, it can have no effect on
robustness on individual OOD settings (e.g., MNLI SNLI; c).
Appendix Bfor details about hyperparameter opti-
mization.
4 Results and Discussion
Our results show that models with higher sample
efficiency may not necessarily have higher average
OOD robustness—different tasks and modeling in-
terventions affect robustness in different ways (Fig-
ures 2-4). For example, prompt-based fine-tuning
consistently improves both sample efficiency and
average robustness, but only in low-data settings
(Figure 2). In contrast, increasing model size im-
proves sample efficiency across the range of train-
ing dataset sizes and tasks, but only improves aver-
age robustness on sentiment analysis (Figure 3). On
individual datasets, we even observe cases where
models with lower sample efficiency have higher
robustness (Figure 3d). See Appendix Cfor full
results on every ID-OOD setting.
Natural Language Prompting. We compare
BERT
BASE
models using (1) standard fine-tuning,
(2) prompt-based fine-tuning, and (3) zero-shot
prompting. We also compare these results with
zero-shot prompting of
text-davinci-001
, a
much larger model trained on substantially more
data. We run experiments on NLI and sentiment
analysis, since extractive QA is not amenable to
prompt-based fine-tuning with masked language
models.
Figures 2a and 2b plot the average performance
on all OOD datasets as a function of ID perfor-
mance and the ID performance as a function of
the number of labeled training examples. Sample
efficiency improvements from prompt-based fine-
tuning also translate to higher average robustness.
However these improvements only apply in the
few-shot setting. As the size of the training dataset
increases, the improvements in sample efficiency
and average robustness steadily diminish. When
using sufficiently large training datasets, models
trained with prompt-based fine-tuning yield essen-
tially the same sample efficiency and robustness
results as standard fine-tuning (
1K examples for
NLI, 130 examples for sentiment).
However, results on individual OOD test sets
can significantly differ from averaged-OOD trends.
For example, Figure 2c shows that prompt-based
fine-tuning on MNLI and evaluating on SNLI im-
proves sample efficiency in the few-shot setting but
without any robustness improvements.
Surprisingly, we also find that zero-shot infer-
ence does not necessarily improve average robust-
ness over prompt-based fine-tuning—zero-shot per-
formance lies on or below the trend line formed
by prompt-based fine-tuning, despite not using any
ID-specific data at all. See Appendix C.1 for full re-
sults of increasing pre-trained model size for every
ID-OOD setting.
Increasing Pre-Trained Model Size. We run
experiments with the checkpoints of Turc et al.
(2019), who pre-train BERT models with various
numbers of transformer layers (L) and hidden em-
bedding sizes (H). We run experiments on NLI,
sentiment analysis, and extractive QA to compare
pre-trained models of five sizes: (1) Large (L=24,
H=1024), (2) Base (L=12, H=768), (3) Medium
(a) (b) (c)
(d)
Figure 3: Although increasing pre-trained model size improves sample
efficiency in all settings, these sample efficiency improvements only
translate to better average robustness in sentiment analysis experiments
(b). In NLI and extractive QA, average robustness is unchanged (a,c).
Although increased model size improves averaged OOD performance on
IMDb, these conclusions do not apply to any ID-OOD pair. For example,
increasing pre-trained model size can decrease robustness when training
on SST-2 and evaluating on IMDb (d).
(L=8, H=512), (4) Small (L=4, H=512), and
(5) Tiny (L=2, H=128). Although increasing the
pre-trained model size improves sample efficiency
on every task, it does not always improve aver-
age robustness (Figure 3). In particular, increasing
model size minimally affects average robustness in
NLI and extractive QA (Figure 3a,3c), but substan-
tially improves average robustness on sentiment
analysis (Figure 3b).
1
However, results on indi-
vidual ID-OOD pairs can again significantly differ
from average OOD performance trends. For ex-
ample, when training on SST-2 and evaluating on
IMDb, larger models actually have lower OOD
performance. This occurs because SST-2 exam-
ples (single sentences) are significantly shorter than
IMDb examples (paragraphs). As a result, mod-
els trained on the shorter SST-2 examples struggle
when evaluated on IMDb because this particular
ID-OOD pair requires length extrapolation, and
1
Note that moving from BERT
BASE
to BERT
LARGE
does
not improve effective robustness until
92% IMDb ID accu-
racy. We hypothesize this occurs because these BERT
LARGE
datapoints are fine-tuned on small amounts of data (fewer than
1K examples), potentially leading to instability and reduced
effective robustness.
increasing pre-trained model size does not help
models generalize to longer input sequences. As
a result, effective robustness decreases because
larger models have higher ID (SST-2) performance
but unchanged OOD (IMDb) performance. See
Appendix C.2 for full results of natural language
prompting for every ID-OOD setting.
Pre-Training on More Data. We conduct NLI,
sentiment, and QA experiments with RoBERTa
models pre-trained on 10M, 100M, and 1B tokens
of web text (Zhang et al.,2021).
Pre-training on more data consistently improves
sample efficiency, but only yields average robust-
ness improvements in NLI and sentiment analysis
(Figure 4a,b). In extractive QA experiments, vary-
ing the amount of pre-training data does not sig-
nificantly change average robustness (Figure 4c).
Again, we find that results on average OOD perfor-
mance are not predictive of results on individual
test sets—despite unchanged average OOD robust-
ness when pre-training on more data, OOD per-
formance can be higher on individual extractive
QA test sets (e.g., SQuAD
BioASQ; Figure 4d).
See Appendix C.3 for full results of pre-training on
摘要:

AreSample-EfficientNLPModelsMoreRobust?NelsonF.Liu♠AnanyaKumar♠PercyLiang♠RobinJia♡♠ComputerScienceDepartment,StanfordUniversity,Stanford,CA♡DepartmentofComputerScience,UniversityofSouthernCalifornia,LosAngeles,CA{nfliu,ananya,pliang}@cs.stanford.edurobinjia@usc.eduAbstractRecentresultsinimageclassi...

展开>> 收起<<
Are Sample-Efficient NLP Models More Robust Nelson F. LiuAnanya KumarPercy LiangRobin Jia Computer Science Department Stanford University Stanford CA.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:2.07MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注