Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four

2025-05-06 0 0 1.94MB 26 页 10玖币

侵权投诉

Understanding Prior Bias and Choice

Paralysis in Transformer-based Language

Representation Models through Four

Experimental Probes

Ke Shen∗

Information Sciences Institute,

University of Southern California, 4676

Admiralty Way, Suite 1001, Marina del

Rey, California 90292

Mayank Kejriwal∗∗

Information Sciences Institute,

University of Southern California, 4676

Admiralty Way, Suite 1001, Marina del

Rey, California 90292

Recent work on transformer-based neural networks has led to impressive advances on multiple-

choice natural language understanding (NLU) problems, such as Question Answering (QA)

and abductive reasoning. Despite these advances, there is limited work still on understanding

whether these models respond to perturbed multiple-choice instances in a sufﬁciently robust

manner that would allow them to be trusted in real-world situations. We present four confusion

probes, inspired by similar phenomena ﬁrst identiﬁed in the behavioral science community, to

test for problems such as prior bias and choice paralysis. Experimentally, we probe a widely used

transformer-based multiple-choice NLU system using four established benchmark datasets. Here

we show that the model exhibits signiﬁcant prior bias and to a lesser, but still highly signiﬁcant

degree, choice paralysis, in addition to other problems. Our results suggest that stronger testing

protocols and additional benchmarks may be necessary before the language models are used in

front-facing systems or decision making with real world consequences.

1. Background

Question Answering (QA) (Hirschman and Gaizauskas 2001) and inference are im-

portant problems in natural language processing (NLP) and applied AI, including

development of conversational ‘chatbot’ agents (Siblini et al. 2019). Developments over

the last ﬁve years in deep neural transformer-based models have led to signiﬁcant

improvements in QA performance, especially in the multiple-choice setting. Bidirec-

tional Encoder Representations from Transformers (BERT) (Devlin et al. 2018) is a

neural transformer-based model that was pre-trained by Google and that consequently

achieved state-of-the-art performance in a range of NLP tasks, including QA and Web

search. BERT is designed to help computers understand the meaning of ambiguous

language in the text by using the surrounding text to establish context, and depends

on capabilities such as bidirectional encoding capability, masked language modeling

(MLM) and next sentence prediction.

∗E-mail: keshen@isi.edu.

∗∗ Corresponding Author: kejriwal@isi.edu

arXiv:2210.01258v1 [cs.CL] 3 Oct 2022

Computational Linguistics Volume 1, Number 1

BERT, and other models based on BERT, such as Patentbert (Lee and Hsiang 2019),

Docbert (Adhikari et al. 2019), SciBERT (Beltagy, Lo, and Cohan 2019), DistilBERT (Sanh

et al. 2019) and K-bert (Liu et al. 2020), have achieved groundbreaking results in diverse

language understanding tasks, including QA (Reddy, Chen, and Manning 2019; Fan

et al. 2019; Lewis et al. 2019), text summarization (Liu and Lapata 2019; Zhang, Wei, and

Zhou 2019), sentence prediction (Shin, Lee, and Jung 2019; Lan et al. 2019), dialogue

response generation (Zhang et al. 2019; Wang et al. 2019), natural language inference

(McCoy, Pavlick, and Linzen 2019; Richardson et al. 2020), and sentiment classiﬁcation

(Gao et al. 2019; Thongtan and Phienthrakul 2019; Munikar, Shakya, and Shrestha 2019).

The model studied in this paper, RoBERTa (Liu et al. 2019b), is a highly optimized

version of the original BERT architecture that was ﬁrst published in 2019 and improved

over BERT on various benchmarks by margins of 0.9 [on the Quora Question Pairs

dataset (Iyer, Dandekar, and Csernai 2016)] - 16.2 percent [on the Recognizing Textual

Entailment dataset (Dagan, Glickman, and Magnini 2005; Haim et al. 2006; Giampiccolo

et al. 2007; Bentivogli et al. 2009)].

Speciﬁcally, RoBERTa is trained with larger mini-batches and learning rates, re-

moves the next-sentence pre-training objective, and focuses on improving the MLM

objective to deliver improved performance, compared to BERT, on problems such

as Multi-Genre Natural Language Inference (Williams, Nangia, and Bowman 2017),

and Question-Based Natural Language Inference (Rajpurkar et al. 2016). RoBERTa-

based models have approached near-human performance on various (subsequently

described) commonsense natural language understanding (NLU) benchmarks.

BERT’s original success on these NLU tasks has also motivated researchers to adapt

it for multi-modal language representation (Lu et al. 2019; Sun et al. 2019), cross-

lingual language models (Lample and Conneau 2019), and domain-speciﬁc language

models, including in the medicine- (Alsentzer et al. 2019; Wang et al. 2020) and biology-

related domains (Lee et al. 2020). Due to this widespread use, and the fact that even

recent, more advanced models based on billions of parameters are based on similar

technology (deep transformers), it has become important to systematically study the

linguistic properties of BERT using a battery of tests inspired by work ﬁrst conducted

in the behavioral sciences. In prior work, for example, several proposed approaches

aimed to study the knowledge encoded within BERT, including ﬁll-in-the-gap probes

of MLM (Rogers, Kovaleva, and Rumshisky 2020; Wu et al. 2019), analysis of self-

attention weights (Kobayashi et al. 2020; Ettinger 2020), the probing of classiﬁers with

different BERT representations as inputs (Liu et al. 2019a; Warstadt and Bowman 2020),

and a ‘CheckList’ style approach to systematically evaluate the linguistic capability of a

BERT-based model (Ribeiro et al. 2020). Evidence from this line of research suggests that

BERT encodes a hierarchy of linguistic information, with surface features at the bottom,

syntactic features in the middle and semantic features at the top (Jawahar, Sagot, and

Seddah 2019). It ‘naturally’ learns syntactic information from pre-training text.

However, it has been found that while information can be recovered from its token

representation (Wu et al. 2020), it does not fully ‘understand’ naturalistic concepts like

negation, and is insensitive to malformed input (Rogers, Kovaleva, and Rumshisky

2020). The latter is similar to adversarial experiments (not dissimilar to adversarial

experiments in the computer vision community) that researchers have conducted to

test BERT’s robustness. Some of these experiments have shown that, even though BERT

encodes information about entity types, relations, semantic roles, and proto-roles well,

it struggles with the representations of numbers (Wallace et al. 2019b) and is also brittle

to named entity replacements (Balasubramanian et al. 2020). Besides, (?) also found

Shen and Kejriwal Confusion Probes

clear evidence that ﬁne-tuned BERT-based language representation models still do not

generalize well, and may, in fact, be susceptible to dataset bias.

We describe a novel set of systematic confusion probes to test linguistically relevant

properties of a standard, and currently widely-used, multiple-choice NLU system based

on RoBERTa1. Our probes, described below, are not only replicable, but can be extended

to other benchmarks and even newer language representation models as we show

through preliminary additional experiments (described in Discussion) involving the

recent T5-11B model. Unlike much of the prior work on this subject, we are not seeking

to understand the layers of a speciﬁc network or how it encodes knowledge, but rather,

to understand the commonsense properties of these models. A clear understanding of

such properties allows us to test whether such language models, which are continuing

to be rolled out in commercial products, are truly answering questions in a robust

manner, or are disproportionately impacted by problems such as prior bias and choice

paralysis. We subsequently deﬁne these notions more precisely, but intuitively, prior bias

occurs when a language representation model has a consistent and statistically signiﬁcant

preference for selecting an incorrect candidate choice over another. Such a bias is usually

undesirable, as it indicates the model may be amenable to being ‘tricked’ e.g., by

introducing perturbations of the kind explored in this paper.

Choice paralysis, on the other hand, occurs when the preference of the model for the

correct candidate choice signiﬁcantly and consistently diminishes as more (incorrect)

choices are offered to a model in response to a prompt. Choice paralysis is inspired

by a similarly named phenomenon in the behavioral and decision sciences2A related

(although somewhat broader) problem in decision sciences is analysis paralysis (Lenz

and Lyles 1985), wherein it was found that giving people too many options can make it

more difﬁcult for them to choose between them. We experimentally test whether an

analogous problem is observed in multiple-choice NLI QA systems (Schwartz 2004;

Kahneman 2011).

Before proceeding with working deﬁnitions of choice paralysis and prior bias, we

introduce some basic formalism for placing the remainder of the paper in context.

First, let us deﬁne an instance I= (p, A)as a pair composed of a prompt p, and a set

A={a1, a2, ..., an}of ncandidate choices. We clarify the reasons for this terminology in

Methods, but intuitively, we use prompt (rather than question) because the input may not

be a proper question3. An example of such a case is provided in the center of Fig 1.

Given an instance I= (p, A), we assume that exactly one of the choices ˆa∈Ais

correct. Given a language representation model (such as one based on RoBERTa) fthat

is designed to handle multi-choice NLI instances, we assume the output of f, given I,

to be (a0, C), where a0∈Ais the model’s predicted choice, and C={c1, c2, ..., cn}is a

conﬁdence set that includes the model’s conﬁdence ciper candidate choice ai∈A. We

denote the variance of C(calculated per instance) as σC>0. We say that Iis perturbed

either if pis changed in some manner (including being assigned the ‘empty string’)

or if Ais modiﬁed through addition, deletion, substitution or any other modiﬁcation

of candidate choices. Finally, if a perturbation applied on Iresults in the perturbed

1Further detailed, with links to the publicly available code, in Methods.

2Other common names include overchoice and choice overload.

3In general, commonsense benchmarks are NLU benchmarks, which may involve QA, but do not have to.

In some cases, the task is abductive reasoning, while in others, the task is NLI or even goal satisfaction, as

in the case of the subsequently described Physical IQA benchmark.

Computational Linguistics Volume 1, Number 1

Figure 1

An example (from the real-world abductive NLI benchmark) of the four confusion probes used

in this paper as perturbation-based interventions and detailed on the next page. Prompt-based

interventions are shown at the top, and choice-based interventions are shown at the bottom. In

aNLI, the prompt in an instance comprises two observations, and the candidate choices are the

two hypotheses, of which exactly one is considered correct in the original unperturbed instance.

instance Ipnot having any theoretically correct choice in response to the prompt4, but ˆais

still a candidate choice, we refer to ˆaas the pseudo-correct choice. An obvious example of

when this occurs is a perturbation that ‘deleted’ the prompt by assigning it the empty

string. Since there is no prompt, none of the candidate choices are theoretically correct

or incorrect. Assuming that Awas not modiﬁed, the pseudo-correct choice would be ˆa.

With these basic preliminaries in place, we deﬁne the prior bias of a multiple-choice

NLI model with respect to a perturbed instance below:

Deﬁnition 1 (Prior Bias). Given an original instance I= (p, A)(with correct choice ˆa∈

A), a perturbation of that instance Ipthat does not have any correct choice, and a multiple-choice

NLI model fthat respectively outputs (a0, C)and (a0

p, Cp)given Iand Ip, we deﬁne σ0

cas the

prior bias of fwith respect to Ip.

Note that Deﬁnition 1 above only applies to perturbed instances where there is

no correct choice (although there may potentially be a pseudo-correct choice ˆa∈A,

depending on the speciﬁc type of perturbation applied). Deﬁnition 1 can also be gener-

alized to quantify the prior bias of fon an NLI benchmark (with respect to a speciﬁc

perturbation) by aggregating σ0

cacross all perturbed instances in the benchmark. Only

if the null hypothesis that σ0

c= 0 cannot be rejected at a given level of conﬁdence can

we say with statistical certitude that fdoes not have prior bias on that benchmark, with

respect to the applied perturbation.

Next, using the same formalism, we can deﬁne choice paralysis in the context of

multiple-choice NLI models:

Deﬁnition 2 (Choice Paralysis). Given an original instance I= (p, A)(with correct

choice ˆa∈A), a perturbation of that instance Ip= (pp, Ap)(with |Ap|>|A|,ˆa∈Ap, and ˆa

4We discuss four speciﬁc types of perturbations or ‘confusion probes’ used in our experimental study in

the next section, but for present purposes, we note that these deﬁnitions and formalism apply regardless

of the form of the perturbation itself.

Shen and Kejriwal Confusion Probes

being the correct choice in response to prompt pp), and a multiple-choice NLI model f,fis said

to have choice paralysis with respect to Ipif the conﬁdence of f(Ip)in ˆais signiﬁcantly lower

than the conﬁdence of f(I)in ˆa. Denoting these two conﬁdences respectively as ˆcpand ˆc, the

magnitude of choice paralysis is given by ˆcp−ˆc.

Note that the direction of the subtraction matters i.e., fcan theoretically have

negative choice paralysis whereby its conﬁdence in the correct answer actually increases

when a speciﬁc perturbation introduces a choice-set that is larger than the original

choice-set. However, one important aspect that we note about Deﬁnition 2 is that Ap

does not have to be a super-set of A, although it is required to be larger, and at minimum,

must contain the correct choice ˆa, similar to A. Furthermore, while there is no restriction

on also perturbing the prompt p(to a new prompt pp), the perturbation must not be such

that the theoretically correct answer changes. In practice, as we subsequently describe,

our perturbation functions operate either at the level of the prompt, or choice-set, but

not both.

Finally, we note that, although both Deﬁnitions 1 and 2 impose some constraints on

the types of perturbations that can be applied as interventions on the original instances

in a benchmark, they can work with any perturbation functions that adhere to these

constraints. For instance, as noted earlier, Deﬁnition 1 can be used to measure prior

bias as long as there is no theoretically correct choice in the perturbed instance. While

it may be possible to modify the deﬁnition to also measure prior bias if this were not

the case, we leave such an expanded deﬁnition and its empirical validation to future

research. Conversely, Deﬁnition 2 assumes that, even after the perturbation, the instance

continues to contain a single theoretically correct choice in its (expanded) candidate

choice-set Ain response to the prompt.

Since the deﬁnitions do not dictate speciﬁc perturbation functions, in order to

conduct experiments, we need to devise one or more perturbation functions that enable

us to quantify these phenomena for real-world multi-choice NLI benchmarks, and a suf-

ﬁciently powerful language representation model that can handle not only the original

benchmarks, but also their perturbed versions. Next, we describe the four perturbation

functions used in this paper for studying these phenomena.

1.1 Perturbation Methodology: Prompt-Based and Choice-Based Confusion Probes

We designed a set of four perturbation functions, also called confusion probes, that

operate by systematically transforming multiple-choice NLI instances in four publicly

available benchmarks, which have been widely used in the literature for assessing ma-

chine commonsense performance. These four benchmarks test the ability of a language

representation model to select the best possible explanation for a given set of obser-

vations [aNLI (Bhagavatula et al. 2020, 2019)], do grounded commonsense inference

[HellaSwag (Zellers et al. 2019a,b)], reason about both the prototypical use of objects and

non-prototypical, but practically plausible, use of objects [PIQA (Bisk et al. 2020a,b)],

and answer social commonsense questions [SocialIQA (Sap et al. 2019a,b)].

Each of the four confusion probes intervenes either at the level of the prompt, or the

candidate choices, but not both. As noted earlier, an instance comprises both a prompt

and a candidate choice-set. The prompt may or may not be an actual ‘question’, as

understood grammatically. For example, as shown at the center of Fig 1, a single QA

instance in the aNLI benchmark consists of two observations (the prompt) and two

hypotheses (the choices or answers), of which one must be selected as being the best

possible explanation for the given observations (abductive reasoning). In each of the

four benchmarks, the structure of the instance is ﬁxed, including the way in which the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnderstandingPriorBiasandChoiceParalysisinTransformer-basedLanguageRepresentationModelsthroughFourExperimentalProbesKeShenInformationSciencesInstitute,UniversityofSouthernCalifornia,4676AdmiraltyWay,Suite1001,MarinadelRey,California90292MayankKejriwalInformationSciencesInstitute,UniversityofSouth...

展开>> 收起<<

Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: