Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four

2025-05-06 0 0 1.94MB 26 页 10玖币
侵权投诉
Understanding Prior Bias and Choice
Paralysis in Transformer-based Language
Representation Models through Four
Experimental Probes
Ke Shen
Information Sciences Institute,
University of Southern California, 4676
Admiralty Way, Suite 1001, Marina del
Rey, California 90292
Mayank Kejriwal∗∗
Information Sciences Institute,
University of Southern California, 4676
Admiralty Way, Suite 1001, Marina del
Rey, California 90292
Recent work on transformer-based neural networks has led to impressive advances on multiple-
choice natural language understanding (NLU) problems, such as Question Answering (QA)
and abductive reasoning. Despite these advances, there is limited work still on understanding
whether these models respond to perturbed multiple-choice instances in a sufficiently robust
manner that would allow them to be trusted in real-world situations. We present four confusion
probes, inspired by similar phenomena first identified in the behavioral science community, to
test for problems such as prior bias and choice paralysis. Experimentally, we probe a widely used
transformer-based multiple-choice NLU system using four established benchmark datasets. Here
we show that the model exhibits significant prior bias and to a lesser, but still highly significant
degree, choice paralysis, in addition to other problems. Our results suggest that stronger testing
protocols and additional benchmarks may be necessary before the language models are used in
front-facing systems or decision making with real world consequences.
1. Background
Question Answering (QA) (Hirschman and Gaizauskas 2001) and inference are im-
portant problems in natural language processing (NLP) and applied AI, including
development of conversational ‘chatbot’ agents (Siblini et al. 2019). Developments over
the last five years in deep neural transformer-based models have led to significant
improvements in QA performance, especially in the multiple-choice setting. Bidirec-
tional Encoder Representations from Transformers (BERT) (Devlin et al. 2018) is a
neural transformer-based model that was pre-trained by Google and that consequently
achieved state-of-the-art performance in a range of NLP tasks, including QA and Web
search. BERT is designed to help computers understand the meaning of ambiguous
language in the text by using the surrounding text to establish context, and depends
on capabilities such as bidirectional encoding capability, masked language modeling
(MLM) and next sentence prediction.
E-mail: keshen@isi.edu.
∗∗ Corresponding Author: kejriwal@isi.edu
© 2020 Association for Computational Linguistics
arXiv:2210.01258v1 [cs.CL] 3 Oct 2022
Computational Linguistics Volume 1, Number 1
BERT, and other models based on BERT, such as Patentbert (Lee and Hsiang 2019),
Docbert (Adhikari et al. 2019), SciBERT (Beltagy, Lo, and Cohan 2019), DistilBERT (Sanh
et al. 2019) and K-bert (Liu et al. 2020), have achieved groundbreaking results in diverse
language understanding tasks, including QA (Reddy, Chen, and Manning 2019; Fan
et al. 2019; Lewis et al. 2019), text summarization (Liu and Lapata 2019; Zhang, Wei, and
Zhou 2019), sentence prediction (Shin, Lee, and Jung 2019; Lan et al. 2019), dialogue
response generation (Zhang et al. 2019; Wang et al. 2019), natural language inference
(McCoy, Pavlick, and Linzen 2019; Richardson et al. 2020), and sentiment classification
(Gao et al. 2019; Thongtan and Phienthrakul 2019; Munikar, Shakya, and Shrestha 2019).
The model studied in this paper, RoBERTa (Liu et al. 2019b), is a highly optimized
version of the original BERT architecture that was first published in 2019 and improved
over BERT on various benchmarks by margins of 0.9 [on the Quora Question Pairs
dataset (Iyer, Dandekar, and Csernai 2016)] - 16.2 percent [on the Recognizing Textual
Entailment dataset (Dagan, Glickman, and Magnini 2005; Haim et al. 2006; Giampiccolo
et al. 2007; Bentivogli et al. 2009)].
Specifically, RoBERTa is trained with larger mini-batches and learning rates, re-
moves the next-sentence pre-training objective, and focuses on improving the MLM
objective to deliver improved performance, compared to BERT, on problems such
as Multi-Genre Natural Language Inference (Williams, Nangia, and Bowman 2017),
and Question-Based Natural Language Inference (Rajpurkar et al. 2016). RoBERTa-
based models have approached near-human performance on various (subsequently
described) commonsense natural language understanding (NLU) benchmarks.
BERT’s original success on these NLU tasks has also motivated researchers to adapt
it for multi-modal language representation (Lu et al. 2019; Sun et al. 2019), cross-
lingual language models (Lample and Conneau 2019), and domain-specific language
models, including in the medicine- (Alsentzer et al. 2019; Wang et al. 2020) and biology-
related domains (Lee et al. 2020). Due to this widespread use, and the fact that even
recent, more advanced models based on billions of parameters are based on similar
technology (deep transformers), it has become important to systematically study the
linguistic properties of BERT using a battery of tests inspired by work first conducted
in the behavioral sciences. In prior work, for example, several proposed approaches
aimed to study the knowledge encoded within BERT, including fill-in-the-gap probes
of MLM (Rogers, Kovaleva, and Rumshisky 2020; Wu et al. 2019), analysis of self-
attention weights (Kobayashi et al. 2020; Ettinger 2020), the probing of classifiers with
different BERT representations as inputs (Liu et al. 2019a; Warstadt and Bowman 2020),
and a ‘CheckList’ style approach to systematically evaluate the linguistic capability of a
BERT-based model (Ribeiro et al. 2020). Evidence from this line of research suggests that
BERT encodes a hierarchy of linguistic information, with surface features at the bottom,
syntactic features in the middle and semantic features at the top (Jawahar, Sagot, and
Seddah 2019). It ‘naturally’ learns syntactic information from pre-training text.
However, it has been found that while information can be recovered from its token
representation (Wu et al. 2020), it does not fully ‘understand’ naturalistic concepts like
negation, and is insensitive to malformed input (Rogers, Kovaleva, and Rumshisky
2020). The latter is similar to adversarial experiments (not dissimilar to adversarial
experiments in the computer vision community) that researchers have conducted to
test BERT’s robustness. Some of these experiments have shown that, even though BERT
encodes information about entity types, relations, semantic roles, and proto-roles well,
it struggles with the representations of numbers (Wallace et al. 2019b) and is also brittle
to named entity replacements (Balasubramanian et al. 2020). Besides, (?) also found
2
Shen and Kejriwal Confusion Probes
clear evidence that fine-tuned BERT-based language representation models still do not
generalize well, and may, in fact, be susceptible to dataset bias.
We describe a novel set of systematic confusion probes to test linguistically relevant
properties of a standard, and currently widely-used, multiple-choice NLU system based
on RoBERTa1. Our probes, described below, are not only replicable, but can be extended
to other benchmarks and even newer language representation models as we show
through preliminary additional experiments (described in Discussion) involving the
recent T5-11B model. Unlike much of the prior work on this subject, we are not seeking
to understand the layers of a specific network or how it encodes knowledge, but rather,
to understand the commonsense properties of these models. A clear understanding of
such properties allows us to test whether such language models, which are continuing
to be rolled out in commercial products, are truly answering questions in a robust
manner, or are disproportionately impacted by problems such as prior bias and choice
paralysis. We subsequently define these notions more precisely, but intuitively, prior bias
occurs when a language representation model has a consistent and statistically significant
preference for selecting an incorrect candidate choice over another. Such a bias is usually
undesirable, as it indicates the model may be amenable to being ‘tricked’ e.g., by
introducing perturbations of the kind explored in this paper.
Choice paralysis, on the other hand, occurs when the preference of the model for the
correct candidate choice significantly and consistently diminishes as more (incorrect)
choices are offered to a model in response to a prompt. Choice paralysis is inspired
by a similarly named phenomenon in the behavioral and decision sciences2A related
(although somewhat broader) problem in decision sciences is analysis paralysis (Lenz
and Lyles 1985), wherein it was found that giving people too many options can make it
more difficult for them to choose between them. We experimentally test whether an
analogous problem is observed in multiple-choice NLI QA systems (Schwartz 2004;
Kahneman 2011).
Before proceeding with working definitions of choice paralysis and prior bias, we
introduce some basic formalism for placing the remainder of the paper in context.
First, let us define an instance I= (p, A)as a pair composed of a prompt p, and a set
A={a1, a2, ..., an}of ncandidate choices. We clarify the reasons for this terminology in
Methods, but intuitively, we use prompt (rather than question) because the input may not
be a proper question3. An example of such a case is provided in the center of Fig 1.
Given an instance I= (p, A), we assume that exactly one of the choices ˆaAis
correct. Given a language representation model (such as one based on RoBERTa) fthat
is designed to handle multi-choice NLI instances, we assume the output of f, given I,
to be (a0, C), where a0Ais the model’s predicted choice, and C={c1, c2, ..., cn}is a
confidence set that includes the model’s confidence ciper candidate choice aiA. We
denote the variance of C(calculated per instance) as σC>0. We say that Iis perturbed
either if pis changed in some manner (including being assigned the ‘empty string’)
or if Ais modified through addition, deletion, substitution or any other modification
of candidate choices. Finally, if a perturbation applied on Iresults in the perturbed
1Further detailed, with links to the publicly available code, in Methods.
2Other common names include overchoice and choice overload.
3In general, commonsense benchmarks are NLU benchmarks, which may involve QA, but do not have to.
In some cases, the task is abductive reasoning, while in others, the task is NLI or even goal satisfaction, as
in the case of the subsequently described Physical IQA benchmark.
3
Computational Linguistics Volume 1, Number 1
Figure 1
An example (from the real-world abductive NLI benchmark) of the four confusion probes used
in this paper as perturbation-based interventions and detailed on the next page. Prompt-based
interventions are shown at the top, and choice-based interventions are shown at the bottom. In
aNLI, the prompt in an instance comprises two observations, and the candidate choices are the
two hypotheses, of which exactly one is considered correct in the original unperturbed instance.
instance Ipnot having any theoretically correct choice in response to the prompt4, but ˆais
still a candidate choice, we refer to ˆaas the pseudo-correct choice. An obvious example of
when this occurs is a perturbation that ‘deleted’ the prompt by assigning it the empty
string. Since there is no prompt, none of the candidate choices are theoretically correct
or incorrect. Assuming that Awas not modified, the pseudo-correct choice would be ˆa.
With these basic preliminaries in place, we define the prior bias of a multiple-choice
NLI model with respect to a perturbed instance below:
Definition 1 (Prior Bias). Given an original instance I= (p, A)(with correct choice ˆa
A), a perturbation of that instance Ipthat does not have any correct choice, and a multiple-choice
NLI model fthat respectively outputs (a0, C)and (a0
p, Cp)given Iand Ip, we define σ0
cas the
prior bias of fwith respect to Ip.
Note that Definition 1 above only applies to perturbed instances where there is
no correct choice (although there may potentially be a pseudo-correct choice ˆaA,
depending on the specific type of perturbation applied). Definition 1 can also be gener-
alized to quantify the prior bias of fon an NLI benchmark (with respect to a specific
perturbation) by aggregating σ0
cacross all perturbed instances in the benchmark. Only
if the null hypothesis that σ0
c= 0 cannot be rejected at a given level of confidence can
we say with statistical certitude that fdoes not have prior bias on that benchmark, with
respect to the applied perturbation.
Next, using the same formalism, we can define choice paralysis in the context of
multiple-choice NLI models:
Definition 2 (Choice Paralysis). Given an original instance I= (p, A)(with correct
choice ˆaA), a perturbation of that instance Ip= (pp, Ap)(with |Ap|>|A|,ˆaAp, and ˆa
4We discuss four specific types of perturbations or ‘confusion probes’ used in our experimental study in
the next section, but for present purposes, we note that these definitions and formalism apply regardless
of the form of the perturbation itself.
4
Shen and Kejriwal Confusion Probes
being the correct choice in response to prompt pp), and a multiple-choice NLI model f,fis said
to have choice paralysis with respect to Ipif the confidence of f(Ip)in ˆais significantly lower
than the confidence of f(I)in ˆa. Denoting these two confidences respectively as ˆcpand ˆc, the
magnitude of choice paralysis is given by ˆcpˆc.
Note that the direction of the subtraction matters i.e., fcan theoretically have
negative choice paralysis whereby its confidence in the correct answer actually increases
when a specific perturbation introduces a choice-set that is larger than the original
choice-set. However, one important aspect that we note about Definition 2 is that Ap
does not have to be a super-set of A, although it is required to be larger, and at minimum,
must contain the correct choice ˆa, similar to A. Furthermore, while there is no restriction
on also perturbing the prompt p(to a new prompt pp), the perturbation must not be such
that the theoretically correct answer changes. In practice, as we subsequently describe,
our perturbation functions operate either at the level of the prompt, or choice-set, but
not both.
Finally, we note that, although both Definitions 1 and 2 impose some constraints on
the types of perturbations that can be applied as interventions on the original instances
in a benchmark, they can work with any perturbation functions that adhere to these
constraints. For instance, as noted earlier, Definition 1 can be used to measure prior
bias as long as there is no theoretically correct choice in the perturbed instance. While
it may be possible to modify the definition to also measure prior bias if this were not
the case, we leave such an expanded definition and its empirical validation to future
research. Conversely, Definition 2 assumes that, even after the perturbation, the instance
continues to contain a single theoretically correct choice in its (expanded) candidate
choice-set Ain response to the prompt.
Since the definitions do not dictate specific perturbation functions, in order to
conduct experiments, we need to devise one or more perturbation functions that enable
us to quantify these phenomena for real-world multi-choice NLI benchmarks, and a suf-
ficiently powerful language representation model that can handle not only the original
benchmarks, but also their perturbed versions. Next, we describe the four perturbation
functions used in this paper for studying these phenomena.
1.1 Perturbation Methodology: Prompt-Based and Choice-Based Confusion Probes
We designed a set of four perturbation functions, also called confusion probes, that
operate by systematically transforming multiple-choice NLI instances in four publicly
available benchmarks, which have been widely used in the literature for assessing ma-
chine commonsense performance. These four benchmarks test the ability of a language
representation model to select the best possible explanation for a given set of obser-
vations [aNLI (Bhagavatula et al. 2020, 2019)], do grounded commonsense inference
[HellaSwag (Zellers et al. 2019a,b)], reason about both the prototypical use of objects and
non-prototypical, but practically plausible, use of objects [PIQA (Bisk et al. 2020a,b)],
and answer social commonsense questions [SocialIQA (Sap et al. 2019a,b)].
Each of the four confusion probes intervenes either at the level of the prompt, or the
candidate choices, but not both. As noted earlier, an instance comprises both a prompt
and a candidate choice-set. The prompt may or may not be an actual ‘question’, as
understood grammatically. For example, as shown at the center of Fig 1, a single QA
instance in the aNLI benchmark consists of two observations (the prompt) and two
hypotheses (the choices or answers), of which one must be selected as being the best
possible explanation for the given observations (abductive reasoning). In each of the
four benchmarks, the structure of the instance is fixed, including the way in which the
5
摘要:

UnderstandingPriorBiasandChoiceParalysisinTransformer-basedLanguageRepresentationModelsthroughFourExperimentalProbesKeShenInformationSciencesInstitute,UniversityofSouthernCalifornia,4676AdmiraltyWay,Suite1001,MarinadelRey,California90292MayankKejriwalInformationSciencesInstitute,UniversityofSouth...

展开>> 收起<<
Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:26 页 大小:1.94MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注