Exploring The Landscape of Distributional Robustness for Question Answering Models Anas Awadalla1Mitchell Wortsman1Gabriel Ilharco1Sewon Min1

2025-05-06 0 0 1.26MB 17 页 10玖币
侵权投诉
Exploring The Landscape of Distributional Robustness
for Question Answering Models
Anas Awadalla1Mitchell Wortsman1Gabriel Ilharco1Sewon Min1
Ian Magnusson2Hannaneh Hajishirzi1,2Ludwig Schmidt1,2
Abstract
We conduct a large empirical evaluation to
investigate the landscape of distributional ro-
bustness in question answering. Our in-
vestigation spans over 350 models and 16
question answering datasets, including a di-
verse set of architectures, model sizes, and
adaptation methods (e.g., fine-tuning, adapter
tuning, in-context learning, etc.). We find
that, in many cases, model variations do
not affect robustness and in-distribution per-
formance alone determines out-of-distribution
performance. Moreover, our findings indicate
that i) zero-shot and in-context learning meth-
ods are more robust to distribution shifts than
fully fine-tuned models; ii) few-shot prompt
fine-tuned models exhibit better robustness
than few-shot fine-tuned span prediction mod-
els; iii) parameter-efficient and robustness en-
hancing training methods provide no signif-
icant robustness improvements. In addition,
we publicly release all evaluations to encour-
age researchers to further analyze robustness
trends for question answering models.
1 Introduction
Over the past few years, natural language process-
ing has seen substantial progress. In many bench-
marks, large pre-trained models adapted to a target
dataset reach or even surpass human performance
(Devlin et al.,2019;Raffel et al.,2019;Radford
et al.,2019;Brown et al.,2020b;Hoffmann et al.,
2022;Chowdhery et al.,2022, inter alia). At the
same time, current methods still fail to generalize
reliably in a variety of test conditions (Ribeiro et al.,
2020;Gardner et al.,2020;Koh et al.,2021;Luu
et al.,2021;Ribeiro and Lundberg,2022), which
limits their applicability and raises questions about
what exactly the methods learn (Bender and Koller,
2020). One limitation of current benchmarks is that
they often measure performance only on data that
1University of Washington. 2Allen Institute for AI.
Correspondence to anasa2@cs.washington.edu.
comes from the same distribution as the training set
(Wang et al.,2018,2019a). However, evaluating
models on a single test set provides no information
on whether a method also performs well under dis-
tribution shift. While there is an increasing amount
of research on robustness in NLP (Ribeiro et al.,
2020;Tu et al.,2020;Hendrycks et al.,2020;Gard-
ner et al.,2020;Arora et al.,2021;Veitch et al.,
2021;Goel et al.,2021;Miller et al.,2020, inter
alia), the community has not yet adopted a com-
mon set of best practices for evaluating robustness.
As a result, new methods often do not evaluate on
comparable or even any robustness test sets, which
makes it challenging to understand which meth-
ods generalize more reliably and whether NLP is
making progress on robustness to distribution shift.
To address this challenge and shed light on the
robustness landscape in NLP, we conduct a large
empirical evaluation of distributional robustness
in question answering (QA). Building on recent
research on robustness in computer vision (Taori
et al.,2020;Miller et al.,2021a), we focus on dis-
tribution shifts that arise between two related but
different test sets. These distribution shifts are
sometimes called dataset shift to distinguish them
from other kinds of distribution shift. An exam-
ple of dataset shift is a pair of QA test sets where
one test set is constructed from Wikipedia articles
and the other from Amazon product reviews, pos-
sibly also with a different crowdsourcing process.
In contrast to other notions of robustness such as
adversarial robustness, dataset shifts involve no
synthetic perturbations of existing test examples
and are therefore more representative of generaliza-
tion challenges arising “in the wild” (Taori et al.,
2020).
Within the scope of dataset shifts for QA, our
robustness evaluation includes a wide range of
models and distribution shifts. Specifically, we
assembled a testbed of over 350 QA models and
16 QA datasets, including SQuAD v1.1 (Rajpurkar
arXiv:2210.12517v1 [cs.CL] 22 Oct 2022
RoBERTa
Large
T5 Large
OPT 175B
(4 examlpes)
GPT-J
GPT-2 XL
(16 examples)
SpanBERT Base
(512 examples)
Figure 1: We evaluate over 350 models on
16 datasets to characterize the landscape
of distributional robustness in question an-
swering. Our results span a variety of archi-
tectures and adaptation strategies, includ-
ing zero-shot inference, fine-tuning, and in-
context learning (ICL). The x-axis shows
performance on SQuAD (in-distribution),
while the y-axis shows the average perfor-
mance on the 15 other QA datasets (out-of-
distribution). Almost all models lie under
the y=xdiagonal, i.e., performance drops
under distribution shift. Moreover, within
certain groups of models—for instance,
ICL models—in-distribution performance
accurately predicts out-of-distribution per-
formance. As in Taori et al. (2020), we
apply logit axis scaling to clarify that the
relationship between in-distribution and
out-of-distribution performance is approx-
imately linear in the logit domain.
et al.,2016), SquadShifts (Miller et al.,2020), and
MRQA test sets (Fisch et al.,2019). Our testbed
spans different model architectures, model sizes,
and pre-training setups. In addition, we evaluate
a variety of approaches for applying pre-trained
models to question answering including supervised
fine-tuning, in-context learning, parameter-efficient
fine-tuning, zero-shot inference, and more. Finally,
we also include methods specifically designed to
enhance robustness such as RXF (Aghajanyan et al.,
2021) and FreeLB (Zhu et al.,2020).
Our testbed enables us to both identify overarch-
ing trends spanning many models, and to contextu-
alize the robustness behavior of individual models.
Among our findings are the following key results:
Dataset shift still is an unsolved problem in
QA: most models suffer a large performance
drop under this kind of distribution shift.
Despite different architectures and model
sizes, many models follow a consistent trend
relating in-distribution and out-of-distribution
performance. Improving in-distribution
performance usually also increases out-of-
distribution performance in a predictable way.
Current robustness interventions follow the
same trend as models without such interven-
tions, i.e., the robustness interventions do not
increase robustness to dataset shifts.
The only exception to the otherwise universal
performance trend are zero-shot, in-context
learning, and few-shot prompt fine-tuned mod-
els. These models are more robust than the
baseline given by the other models in our
testbed. However, the robustness of large
decoder-only models decreases as the mod-
els are fine-tuned on more data from the target
task.
Figure 1summarizes our findings and shows the
average F1 score on all distribution shifts as a func-
tion of the F1 score on SQuAD. Interestingly, our
overall results are analogous to similar large-scale
robustness evaluations in computer vision (Taori
et al.,2020;Miller et al.,2021a;Radford et al.,
2021), which suggests that there may be a shared
underlying mechanism behind these distribution
shifts that warrants further investigation.
We hope that our work helps clarify the state of
robustness in NLP and provides a starting point
for future work. To simplify measuring robustness
to dataset shift and enable future robustness im-
provements, we will release our testbed including
all 350+ models and evaluation results.
The remainder of the paper is organized as fol-
lows: first, we detail background and experimental
setup (§2). Next, we introduce and answer our spe-
cific research questions (§3,4). Finally, we discuss
the limitations of our approach, overall conclusions,
and directions for future investigation (§6,7).
Figure 2: A schematic which illustrates the robustness
measuring technique we use. Effective robustness scat-
ter plots (Recht et al.,2019;Taori et al.,2020) display
performance on the distribution from which training
data is from (in-distribution) on the x-axis, and out-
of-distribution performance on the y-axis. Effective
robustness is vertical movement towards the y=x
diagonal beyond the baseline trend fit to fully fine-
tuned models—a model with higher effective robust-
ness has more consistent performance in- and out-of-
distribution.
2 Experimental Setup
Our testbed includes over 350 models, covering
a broad range of model architectures, pre-training
datasets, and adaptation strategies. We use SQuAD
v1.1 (Rajpurkar et al.,2016) as our reference
point for question answering performance because
SQuAD is a popular dataset and the performance
ceiling is comparatively well understood since hu-
mans can achieve an F1 score around 95 (Miller
et al.,2020). For all models except those perform-
ing zero-shot inference, we adapt the models to
question answering with the SQuAD training set.
We evaluate robustness to distribution shift on
the remaining 15 question answering datasets (Ta-
ble 1). We follow Taori et al. (2020) in defining
robustness, i.e., we say a model is robust if it has
consistent performance under a distribution shift
from a reference distribution to another distribution.
We refer to SQuAD as in-distribution (ID) and the
other 15 datasets as out-of-distribution (OOD). In
the remainder of this section, we describe the dif-
ferent models, adaptation strategies, datasets, and
evaluation details.
2.1 Models
Our testbed focuses on transformer models rang-
ing from 11 million to 175 billion parame-
ters. We explore several encoder-only models—
ALBERT (Lan et al.,2020), BERT (Devlin
et al.,2019), SpanBERT (Joshi et al.,2020),
RoBERTa (Liu et al.,2019), and Splinter (Ram
et al.,2021a)—encoder-decoder models —T5 (Raf-
fel et al.,2019) and BART (Lewis et al.,2020)—
and decoder-only models (GPT-2 (Radford et al.,
2019), OPT (Zhang et al.,2022), GPT-Neo (Black
et al.,2021), and GPT-J (Wang and Komatsuzaki,
2021)).
2.2 Adaptation strategies
We evaluate multiple adaptation strategies—
methods that adapt the pre-trained language model
to perform better on a downstream task using la-
beled, in-distribution training data, e.g., through
gradient based learning and in-context learning. We
also examine models evaluated in a zero-shot set-
ting, which we also refer to as an adaption method
for consistency, even though no data from the in-
distribution dataset is observed. For a subset of
these models we also explore few-shot instead of
full-shot adaptation to assess the impact of the num-
ber of training examples on robustness.
2.2.1 Fine-tuning (baseline)
We include a common fine-tuning method: adding
a span prediction head and updating all the param-
eters in a language model via additional training
on a downstream dataset, as done in Devlin et al.
(2019) and subsequent work.
2.2.2 Prompt fine-tuning
Prompt fine-tuning adds no additional task specific
layers and fine-tunes the existing weights to gen-
erate the answer. We use next token prediction
when fine-tuning auto-regressive models like GPT.
For T5 and BART models we use two fine-tuning
tasks: 1) casting QA as an infilling task and gen-
erate the answer by predicting a masked span 2)
conditioning the model on the context and question
and fine-tune it to generate the answer.
2.2.3 Parameter-efficient fine-tuning
Parameter-efficient fine-tuning modifies only a
small percentage of existing or auxiliary param-
eters, while freezing all other parameters. We
evaluate Houlsby (Houlsby et al.,2019) and Pfeif-
fer (Pfeiffer et al.,2021) adapters, prefix tuning (Li
Dataset name Test set size Domains
SQuAD v1.1 dev. set (Rajpurkar et al.,2016) 10,570 Wikipedia
SquadShifts New-Wiki (Miller et al.,2020) 7,938 Wikipedia
SquadShifts Reddit (Miller et al.,2020) 9,803 Reddit
SquadShifts NYT (Miller et al.,2020) 10,065 New York Times
SquadShifts Amazon (Miller et al.,2020) 9,885 Amazon reviews
RACE (Lai et al.,2017) 674 English exams from China
DROP (Dua et al.,2019) 1,503 Wikipedia
NewsQA (Trischler et al.,2017) 4,212 CNN articles
SearchQA (Dunn et al.,2017) 16,980 Jeopardy! questions with contexts from Google search
NaturalQuestions (Kwiatkowski et al.,2019) 12,836 Google search questions with contexts from Wikipedia
DuoRC (ParaphraseRC) (Saha et al.,2018) 1,501 Movie plots from IMDB and Wikipedia
HotpotQA (Yang et al.,2018) 5,904 Wikipedia
TextbookQA (Kembhavi et al.,2017) 1,503 Middle school science questions from textbooks
TriviaQA (Joshi et al.,2017) 7,785 Trivia questions with contexts collected using a Bing search
RelationExtraction (Levy et al.,2017) 2,948 Generated samples using a knowledge base
BioASQ (Tsatsaronis et al.,2015) 1,504 Medical articles
Table 1: Question answering datasets used to evaluate models in this work. SQuAD is used as the in-distribution
reference dataset—we use training data from SQuAD to adapt models. The remaining datasets are used to answer
the question of how SQuAD models perform under dataset shift—we use these other datasets for evaluation only.
and Liang,2021), and LoRA (Hu et al.,2021).
While these methods modify only a small number
of parameters, they have been shown to be com-
petitive with full fine-tuning when measuring in-
distribution performance. Previous work suggests
freezing a majority of model weights may make
these methods more robust (Lester et al.,2021).
2.2.4 Robustness enhancing fine-tuning
We evaluate methods which have been designed
to improve model robustness. In particular, we
evaluate RXF (Aghajanyan et al.,2021) and
FreeLB (Zhu et al.,2020), which apply adversarial
training strategies to improve generalization. Previ-
ous work evaluated robustness by comparing only
to a few models and do not run extensive evalua-
tions in question answering. Our work conducts
evaluations on a large number of distribution shifts.
2.2.5 In-context learning
In-context learning is an adaptation method pro-
posed by Brown et al. (2020a) that does not require
any gradient updates. This is particularly useful
for very large language models, where fine-tuning
is expensive. In-context learning refers to the pro-
cess of conditioning a language model on one or
more samples from a training set at inference time,
allowing the model to perform a task without up-
dating any parameters. For our experiments, we
condition the model on triplets of context, question,
and answer, as in Brown et al. (2020a).
2.2.6 Zero-shot inference
We evaluate models using prompting or zero-shot
inference (Radford et al.,2019), where a model
is conditioned only on the context and question
of each test example. In other words, the model
generates an answer without conditioning on train-
ing examples. Zero-shot models do not observe
data from the reference distribution and have been
shown to exhibit consistent performance across
many distributions in computer vision (Radford
et al.,2021).
2.3 Distribution shifts
We consider models which are trained on a refer-
ence distribution, which we also refer to as the in-
distribution, with the exception of zero-shot models.
In addition to measuring model performance on
this reference distribution, we also evaluate model
performance on other datasets where data distribu-
tion changes from the reference distribution. We
refer to these other datasets as out-of-distribution,
and we are interested in model behavior under dis-
tribution shift. Concretely, we want to measure
how model performance changes when evaluated
in- and out-of-distribution.
While there is extensive literature studying ad-
versarial distribution shifts (Wu et al.,2021), our
work focuses on natural distribution shifts (Taori
et al.,2020), where the out-of-distribution datasets
are not generated via synthetic perturbations to ex-
isting datasets.
In this work, we use the popular SQuAD (Ra-
jpurkar et al.,2016) dataset as the reference
(in-distribution) dataset. In addition, we evalu-
ate model performance on 15 out-of-distribution
datasets. We choose SQuAD as the reference dis-
tribution as it is one of the largest and the most
摘要:

ExploringTheLandscapeofDistributionalRobustnessforQuestionAnsweringModelsAnasAwadalla1MitchellWortsman1GabrielIlharco1SewonMin1IanMagnusson2HannanehHajishirzi1;2LudwigSchmidt1;2AbstractWeconductalargeempiricalevaluationtoinvestigatethelandscapeofdistributionalro-bustnessinquestionanswering.Ourin-ves...

展开>> 收起<<
Exploring The Landscape of Distributional Robustness for Question Answering Models Anas Awadalla1Mitchell Wortsman1Gabriel Ilharco1Sewon Min1.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:17 页 大小:1.26MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注