Exploring The Landscape of Distributional Robustness for Question Answering Models Anas Awadalla1Mitchell Wortsman1Gabriel Ilharco1Sewon Min1

2025-05-06 0 0 1.26MB 17 页 10玖币

侵权投诉

Exploring The Landscape of Distributional Robustness

for Question Answering Models

Anas Awadalla1Mitchell Wortsman1Gabriel Ilharco1Sewon Min1

Ian Magnusson2Hannaneh Hajishirzi1,2Ludwig Schmidt1,2

Abstract

We conduct a large empirical evaluation to

investigate the landscape of distributional ro-

bustness in question answering. Our in-

vestigation spans over 350 models and 16

question answering datasets, including a di-

verse set of architectures, model sizes, and

adaptation methods (e.g., ﬁne-tuning, adapter

tuning, in-context learning, etc.). We ﬁnd

that, in many cases, model variations do

not affect robustness and in-distribution per-

formance alone determines out-of-distribution

performance. Moreover, our ﬁndings indicate

that i) zero-shot and in-context learning meth-

ods are more robust to distribution shifts than

fully ﬁne-tuned models; ii) few-shot prompt

ﬁne-tuned models exhibit better robustness

than few-shot ﬁne-tuned span prediction mod-

els; iii) parameter-efﬁcient and robustness en-

hancing training methods provide no signif-

icant robustness improvements. In addition,

we publicly release all evaluations to encour-

age researchers to further analyze robustness

trends for question answering models.

1 Introduction

Over the past few years, natural language process-

ing has seen substantial progress. In many bench-

marks, large pre-trained models adapted to a target

dataset reach or even surpass human performance

(Devlin et al.,2019;Raffel et al.,2019;Radford

et al.,2019;Brown et al.,2020b;Hoffmann et al.,

2022;Chowdhery et al.,2022, inter alia). At the

same time, current methods still fail to generalize

reliably in a variety of test conditions (Ribeiro et al.,

2020;Gardner et al.,2020;Koh et al.,2021;Luu

et al.,2021;Ribeiro and Lundberg,2022), which

limits their applicability and raises questions about

what exactly the methods learn (Bender and Koller,

2020). One limitation of current benchmarks is that

they often measure performance only on data that

1University of Washington. 2Allen Institute for AI.

Correspondence to anasa2@cs.washington.edu.

comes from the same distribution as the training set

(Wang et al.,2018,2019a). However, evaluating

models on a single test set provides no information

on whether a method also performs well under dis-

tribution shift. While there is an increasing amount

of research on robustness in NLP (Ribeiro et al.,

2020;Tu et al.,2020;Hendrycks et al.,2020;Gard-

ner et al.,2020;Arora et al.,2021;Veitch et al.,

2021;Goel et al.,2021;Miller et al.,2020, inter

alia), the community has not yet adopted a com-

mon set of best practices for evaluating robustness.

As a result, new methods often do not evaluate on

comparable or even any robustness test sets, which

makes it challenging to understand which meth-

ods generalize more reliably and whether NLP is

making progress on robustness to distribution shift.

To address this challenge and shed light on the

robustness landscape in NLP, we conduct a large

empirical evaluation of distributional robustness

in question answering (QA). Building on recent

research on robustness in computer vision (Taori

et al.,2020;Miller et al.,2021a), we focus on dis-

tribution shifts that arise between two related but

different test sets. These distribution shifts are

sometimes called dataset shift to distinguish them

from other kinds of distribution shift. An exam-

ple of dataset shift is a pair of QA test sets where

one test set is constructed from Wikipedia articles

and the other from Amazon product reviews, pos-

sibly also with a different crowdsourcing process.

In contrast to other notions of robustness such as

adversarial robustness, dataset shifts involve no

synthetic perturbations of existing test examples

and are therefore more representative of generaliza-

tion challenges arising “in the wild” (Taori et al.,

2020).

Within the scope of dataset shifts for QA, our

robustness evaluation includes a wide range of

models and distribution shifts. Speciﬁcally, we

assembled a testbed of over 350 QA models and

16 QA datasets, including SQuAD v1.1 (Rajpurkar

arXiv:2210.12517v1 [cs.CL] 22 Oct 2022

RoBERTa

Large

T5 Large

OPT 175B

(4 examlpes)

GPT-J

GPT-2 XL

(16 examples)

SpanBERT Base

(512 examples)

Figure 1: We evaluate over 350 models on

16 datasets to characterize the landscape

of distributional robustness in question an-

swering. Our results span a variety of archi-

tectures and adaptation strategies, includ-

ing zero-shot inference, ﬁne-tuning, and in-

context learning (ICL). The x-axis shows

performance on SQuAD (in-distribution),

while the y-axis shows the average perfor-

mance on the 15 other QA datasets (out-of-

distribution). Almost all models lie under

the y=xdiagonal, i.e., performance drops

under distribution shift. Moreover, within

certain groups of models—for instance,

ICL models—in-distribution performance

accurately predicts out-of-distribution per-

formance. As in Taori et al. (2020), we

apply logit axis scaling to clarify that the

relationship between in-distribution and

out-of-distribution performance is approx-

imately linear in the logit domain.

et al.,2016), SquadShifts (Miller et al.,2020), and

MRQA test sets (Fisch et al.,2019). Our testbed

spans different model architectures, model sizes,

and pre-training setups. In addition, we evaluate

a variety of approaches for applying pre-trained

models to question answering including supervised

ﬁne-tuning, in-context learning, parameter-efﬁcient

ﬁne-tuning, zero-shot inference, and more. Finally,

we also include methods speciﬁcally designed to

enhance robustness such as RXF (Aghajanyan et al.,

2021) and FreeLB (Zhu et al.,2020).

Our testbed enables us to both identify overarch-

ing trends spanning many models, and to contextu-

alize the robustness behavior of individual models.

Among our ﬁndings are the following key results:

•

Dataset shift still is an unsolved problem in

QA: most models suffer a large performance

drop under this kind of distribution shift.

•

Despite different architectures and model

sizes, many models follow a consistent trend

relating in-distribution and out-of-distribution

performance. Improving in-distribution

performance usually also increases out-of-

distribution performance in a predictable way.

•

Current robustness interventions follow the

same trend as models without such interven-

tions, i.e., the robustness interventions do not

increase robustness to dataset shifts.

•

The only exception to the otherwise universal

performance trend are zero-shot, in-context

learning, and few-shot prompt ﬁne-tuned mod-

els. These models are more robust than the

baseline given by the other models in our

testbed. However, the robustness of large

decoder-only models decreases as the mod-

els are ﬁne-tuned on more data from the target

task.

Figure 1summarizes our ﬁndings and shows the

average F1 score on all distribution shifts as a func-

tion of the F1 score on SQuAD. Interestingly, our

overall results are analogous to similar large-scale

robustness evaluations in computer vision (Taori

et al.,2020;Miller et al.,2021a;Radford et al.,

2021), which suggests that there may be a shared

underlying mechanism behind these distribution

shifts that warrants further investigation.

We hope that our work helps clarify the state of

robustness in NLP and provides a starting point

for future work. To simplify measuring robustness

to dataset shift and enable future robustness im-

provements, we will release our testbed including

all 350+ models and evaluation results.

The remainder of the paper is organized as fol-

lows: ﬁrst, we detail background and experimental

setup (§2). Next, we introduce and answer our spe-

ciﬁc research questions (§3,4). Finally, we discuss

the limitations of our approach, overall conclusions,

and directions for future investigation (§6,7).

Effective

Robustness

Figure 2: A schematic which illustrates the robustness

measuring technique we use. Effective robustness scat-

ter plots (Recht et al.,2019;Taori et al.,2020) display

performance on the distribution from which training

data is from (in-distribution) on the x-axis, and out-

of-distribution performance on the y-axis. Effective

robustness is vertical movement towards the y=x

diagonal beyond the baseline trend ﬁt to fully ﬁne-

tuned models—a model with higher effective robust-

ness has more consistent performance in- and out-of-

distribution.

2 Experimental Setup

Our testbed includes over 350 models, covering

a broad range of model architectures, pre-training

datasets, and adaptation strategies. We use SQuAD

v1.1 (Rajpurkar et al.,2016) as our reference

point for question answering performance because

SQuAD is a popular dataset and the performance

ceiling is comparatively well understood since hu-

mans can achieve an F1 score around 95 (Miller

et al.,2020). For all models except those perform-

ing zero-shot inference, we adapt the models to

question answering with the SQuAD training set.

We evaluate robustness to distribution shift on

the remaining 15 question answering datasets (Ta-

ble 1). We follow Taori et al. (2020) in deﬁning

robustness, i.e., we say a model is robust if it has

consistent performance under a distribution shift

from a reference distribution to another distribution.

We refer to SQuAD as in-distribution (ID) and the

other 15 datasets as out-of-distribution (OOD). In

the remainder of this section, we describe the dif-

ferent models, adaptation strategies, datasets, and

evaluation details.

2.1 Models

Our testbed focuses on transformer models rang-

ing from 11 million to 175 billion parame-

ters. We explore several encoder-only models—

ALBERT (Lan et al.,2020), BERT (Devlin

et al.,2019), SpanBERT (Joshi et al.,2020),

RoBERTa (Liu et al.,2019), and Splinter (Ram

et al.,2021a)—encoder-decoder models —T5 (Raf-

fel et al.,2019) and BART (Lewis et al.,2020)—

and decoder-only models (GPT-2 (Radford et al.,

2019), OPT (Zhang et al.,2022), GPT-Neo (Black

et al.,2021), and GPT-J (Wang and Komatsuzaki,

2021)).

2.2 Adaptation strategies

We evaluate multiple adaptation strategies—

methods that adapt the pre-trained language model

to perform better on a downstream task using la-

beled, in-distribution training data, e.g., through

gradient based learning and in-context learning. We

also examine models evaluated in a zero-shot set-

ting, which we also refer to as an adaption method

for consistency, even though no data from the in-

distribution dataset is observed. For a subset of

these models we also explore few-shot instead of

full-shot adaptation to assess the impact of the num-

ber of training examples on robustness.

2.2.1 Fine-tuning (baseline)

We include a common ﬁne-tuning method: adding

a span prediction head and updating all the param-

eters in a language model via additional training

on a downstream dataset, as done in Devlin et al.

(2019) and subsequent work.

2.2.2 Prompt ﬁne-tuning

Prompt ﬁne-tuning adds no additional task speciﬁc

layers and ﬁne-tunes the existing weights to gen-

erate the answer. We use next token prediction

when ﬁne-tuning auto-regressive models like GPT.

For T5 and BART models we use two ﬁne-tuning

tasks: 1) casting QA as an inﬁlling task and gen-

erate the answer by predicting a masked span 2)

conditioning the model on the context and question

and ﬁne-tune it to generate the answer.

2.2.3 Parameter-efﬁcient ﬁne-tuning

Parameter-efﬁcient ﬁne-tuning modiﬁes only a

small percentage of existing or auxiliary param-

eters, while freezing all other parameters. We

evaluate Houlsby (Houlsby et al.,2019) and Pfeif-

fer (Pfeiffer et al.,2021) adapters, preﬁx tuning (Li

Dataset name Test set size Domains

SQuAD v1.1 dev. set (Rajpurkar et al.,2016) 10,570 Wikipedia

SquadShifts New-Wiki (Miller et al.,2020) 7,938 Wikipedia

SquadShifts Reddit (Miller et al.,2020) 9,803 Reddit

SquadShifts NYT (Miller et al.,2020) 10,065 New York Times

SquadShifts Amazon (Miller et al.,2020) 9,885 Amazon reviews

RACE (Lai et al.,2017) 674 English exams from China

DROP (Dua et al.,2019) 1,503 Wikipedia

NewsQA (Trischler et al.,2017) 4,212 CNN articles

SearchQA (Dunn et al.,2017) 16,980 Jeopardy! questions with contexts from Google search

NaturalQuestions (Kwiatkowski et al.,2019) 12,836 Google search questions with contexts from Wikipedia

DuoRC (ParaphraseRC) (Saha et al.,2018) 1,501 Movie plots from IMDB and Wikipedia

HotpotQA (Yang et al.,2018) 5,904 Wikipedia

TextbookQA (Kembhavi et al.,2017) 1,503 Middle school science questions from textbooks

TriviaQA (Joshi et al.,2017) 7,785 Trivia questions with contexts collected using a Bing search

RelationExtraction (Levy et al.,2017) 2,948 Generated samples using a knowledge base

BioASQ (Tsatsaronis et al.,2015) 1,504 Medical articles

Table 1: Question answering datasets used to evaluate models in this work. SQuAD is used as the in-distribution

reference dataset—we use training data from SQuAD to adapt models. The remaining datasets are used to answer

the question of how SQuAD models perform under dataset shift—we use these other datasets for evaluation only.

and Liang,2021), and LoRA (Hu et al.,2021).

While these methods modify only a small number

of parameters, they have been shown to be com-

petitive with full ﬁne-tuning when measuring in-

distribution performance. Previous work suggests

freezing a majority of model weights may make

these methods more robust (Lester et al.,2021).

2.2.4 Robustness enhancing ﬁne-tuning

We evaluate methods which have been designed

to improve model robustness. In particular, we

evaluate RXF (Aghajanyan et al.,2021) and

FreeLB (Zhu et al.,2020), which apply adversarial

training strategies to improve generalization. Previ-

ous work evaluated robustness by comparing only

to a few models and do not run extensive evalua-

tions in question answering. Our work conducts

evaluations on a large number of distribution shifts.

2.2.5 In-context learning

In-context learning is an adaptation method pro-

posed by Brown et al. (2020a) that does not require

any gradient updates. This is particularly useful

for very large language models, where ﬁne-tuning

is expensive. In-context learning refers to the pro-

cess of conditioning a language model on one or

more samples from a training set at inference time,

allowing the model to perform a task without up-

dating any parameters. For our experiments, we

condition the model on triplets of context, question,

and answer, as in Brown et al. (2020a).

2.2.6 Zero-shot inference

We evaluate models using prompting or zero-shot

inference (Radford et al.,2019), where a model

is conditioned only on the context and question

of each test example. In other words, the model

generates an answer without conditioning on train-

ing examples. Zero-shot models do not observe

data from the reference distribution and have been

shown to exhibit consistent performance across

many distributions in computer vision (Radford

et al.,2021).

2.3 Distribution shifts

We consider models which are trained on a refer-

ence distribution, which we also refer to as the in-

distribution, with the exception of zero-shot models.

In addition to measuring model performance on

this reference distribution, we also evaluate model

performance on other datasets where data distribu-

tion changes from the reference distribution. We

refer to these other datasets as out-of-distribution,

and we are interested in model behavior under dis-

tribution shift. Concretely, we want to measure

how model performance changes when evaluated

in- and out-of-distribution.

While there is extensive literature studying ad-

versarial distribution shifts (Wu et al.,2021), our

work focuses on natural distribution shifts (Taori

et al.,2020), where the out-of-distribution datasets

are not generated via synthetic perturbations to ex-

isting datasets.

In this work, we use the popular SQuAD (Ra-

jpurkar et al.,2016) dataset as the reference

(in-distribution) dataset. In addition, we evalu-

ate model performance on 15 out-of-distribution

datasets. We choose SQuAD as the reference dis-

tribution as it is one of the largest and the most

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExploringTheLandscapeofDistributionalRobustnessforQuestionAnsweringModelsAnasAwadalla1MitchellWortsman1GabrielIlharco1SewonMin1IanMagnusson2HannanehHajishirzi1;2LudwigSchmidt1;2AbstractWeconductalargeempiricalevaluationtoinvestigatethelandscapeofdistributionalro-bustnessinquestionanswering.Ourin-ves...

展开>> 收起<<

Exploring The Landscape of Distributional Robustness for Question Answering Models Anas Awadalla1Mitchell Wortsman1Gabriel Ilharco1Sewon Min1.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploring The Landscape of Distributional Robustness for Question Answering Models Anas Awadalla1Mitchell Wortsman1Gabriel Ilharco1Sewon Min1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: