Are Sample-Efficient NLP Models More Robust?
Nelson F. Liu♠Ananya Kumar♠Percy Liang♠Robin Jia♡
♠Computer Science Department, Stanford University, Stanford, CA
♡Department of Computer Science, University of Southern California, Los Angeles, CA
{nfliu,ananya,pliang}@cs.stanford.edu
robinjia@usc.edu
Abstract
Recent results in image classification and ex-
tractive question answering have observed
that pre-trained models trained on less in-
distribution data have better out-of-distribution
performance. However, it is unclear how
broadly these trends hold. We conduct a
large empirical study across three tasks, three
broadly-applicable modeling interventions (in-
creasing model size, using a different adapta-
tion method, and pre-training on more data),
and 14 diverse datasets to investigate the rela-
tionship between sample efficiency (amount of
data needed to reach a given ID accuracy) and
robustness (how models fare on OOD evalu-
ation). We find that higher sample efficiency
is only correlated with better average OOD ro-
bustness on some modeling interventions and
tasks, but not others. On individual datasets,
models with lower sample efficiency can even
be more robust. These results suggest that
general-purpose methods for improving sample
efficiency are unlikely to yield universal OOD
robustness improvements, since such improve-
ments are highly dataset- and task-dependent.
Even in an era of large, multi-purpose pre-
trained models, task-specific decisions may of-
ten be necessary for OOD generalization.
1 Introduction
NLP models perform well when evaluated on
data drawn from their training distribution (in-
distribution / ID), but they typically suffer large
drops in performance when evaluated on data distri-
butions unseen during training (out-of-distribution
/ OOD; Blitzer,2008).
How does exposure to ID training examples af-
fect the ID-OOD gap? If two models have the
same ID performance, will models trained on fewer
ID examples (higher sample efficiency) also have
higher OOD performance (higher robustness)? At
one extreme, zero-shot models will not learn ID-
specific patterns because they are not exposed to
any labeled ID examples. Similarly, few-shot mod-
els trained on very few ID examples may also rely
less on ID-specific patterns; if a model never sees
the token “cat” while training on SNLI, then it will
not learn that its presence is spuriously predictive
of the contradiction label (Gururangan et al.,2018;
Utama et al.,2021). Supporting this intuition, re-
cent work in image classification (Radford et al.,
2021) and extractive question answering (Awadalla
et al.,2022) show that zero-shot inference and few-
shot fine-tuning improve average robustness across
a range of OOD test sets. However, it is unclear
how universal these trends are across various tasks
and methods for reducing exposure to ID exam-
ples, or how predictive they are for any individual
test set of interest. Figure 1illustrates this central
question.
We conduct a broad empirical study over 14
datasets across three tasks to investigate the re-
lationship between exposure to ID training exam-
ples (sample efficiency) and robustness. We exper-
iment with three modeling interventions that im-
prove sample efficiency: (1) using natural language
prompts for zero-shot prediction and during fine-
tuning (Brown et al.,2020;Schick and Schütze,
2021;Gao et al.,2021); (2) fine-tuning models of
increasing size; (3) fine-tuning models pre-trained
on increasing amounts of data.
We find that higher sample efficiency is only
sometimes correlated with better robustness, and
the effect of specific modeling interventions varies
by task. For example, increasing pre-trained model
size substantially improves sample efficiency and
results in higher average robustness in sentiment ex-
periments, but these sample efficiency gains do not
translate to higher average robustness in NLI and
extractive QA experiments. On individual datasets,
models with better sample efficiency can even be
less robust (e.g., increasing model size when train-
ing on SST-2 and evaluating OOD on IMDb).
Overall, these results indicate that general-
arXiv:2210.06456v2 [cs.CL] 30 May 2023