Uncertainty Quantification with Pre-trained Language Models A Large-Scale Empirical Analysis Yuxin Xiao1 Paul Pu Liang2 Umang Bhatt3

2025-05-06 0 0 554.11KB 12 页 10玖币
侵权投诉
Uncertainty Quantification with Pre-trained Language Models:
A Large-Scale Empirical Analysis
Yuxin Xiao1, Paul Pu Liang2, Umang Bhatt3,
Willie Neiswanger4, Ruslan Salakhutdinov2, Louis-Philippe Morency2
1Massachusetts Institute of Technology, 2Carnegie Mellon University,
3University of Cambridge, 4Stanford University
1yuxin102@mit.edu 2{pliang,rsalakhu,morency}@cs.cmu.edu
3usb20@cam.ac.uk 4neiswanger@cs.stanford.edu
Abstract
Pre-trained language models (PLMs) have
gained increasing popularity due to their com-
pelling prediction performance in diverse nat-
ural language processing (NLP) tasks. When
formulating a PLM-based prediction pipeline
for NLP tasks, it is also crucial for the pipeline
to minimize the calibration error, especially
in safety-critical applications. That is, the
pipeline should reliably indicate when we can
trust its predictions. In particular, there are
various considerations behind the pipeline: (1)
the choice and (2) the size of PLM, (3) the
choice of uncertainty quantifier, (4) the choice
of fine-tuning loss, and many more. Although
prior work has looked into some of these
considerations, they usually draw conclusions
based on a limited scope of empirical studies.
There still lacks a holistic analysis on how to
compose a well-calibrated PLM-based predic-
tion pipeline. To fill this void, we compare a
wide range of popular options for each consid-
eration based on three prevalent NLP classifi-
cation tasks and the setting of domain shift. In
response, we recommend the following: (1)
use ELECTRA for PLM encoding, (2) use
larger PLMs if possible, (3) use Temp Scaling
as the uncertainty quantifier, and (4) use Focal
Loss for fine-tuning.
1 Introduction
PLMs (Qiu et al.,2020;Min et al.,2021) have
achieved state-of-the-art performance on a broad
spectrum of NLP benchmarks (Rajpurkar et al.,
2016,2018;Wang et al.,2019a,b) and are increas-
ingly popular in various downstream applications
such as question answering (Yoon et al.,2019;Garg
et al.,2020), text classification (Arslan et al.,2021;
Limsopatham,2021), and relation extraction (Zhou
et al.,2021;Xiao et al.,2022). Consequently, it
is paramount for PLMs to faithfully communicate
when to (or not to) rely on their predictions for
decision-making, especially in high-stakes scenar-
ios. In these cases, we need PLMs to quantify their
uncertainty accurately and calibrate well (Abdar
et al.,2021), meaning that their predictive con-
fidence should be a valid estimate of how likely
they are to make a correct prediction. Consider
an example of medical question answering (Yoon
et al.,2019;Zhang et al.,2021) where a PLM is
asked to assist doctors when diagnosing diseases.
If the PLM is
90%
sure that a patient is healthy,
the predicted outcome should occur
90%
of the
time in practice. Otherwise, it may adversely affect
doctors’ judgment and lead to catastrophic conse-
quences. Hence, since PLMs have become the de
facto paradigm for many NLP tasks, it is necessary
to assess their calibration quality.
When constructing a well-calibrated PLM-based
prediction pipeline for NLP tasks, various consid-
erations are involved. To name a few:
1.
Due to the use of diverse pre-training datasets
and strategies, different PLMs may behave
differently regarding calibration.
2.
The model size of PLMs may also affect their
capability in calibration.
3.
Leveraging uncertainty quantifiers (e.g., Temp
Scaling (Guo et al.,2017) and MC Dropout
(Gal and Ghahramani,2016)) alongside PLMs
in the pipeline may reduce calibration error.
4.
Some losses (e.g., Focal Loss (Mukhoti et al.,
2020) and Label Smoothing (Müller et al.,
2019)) may fine-tune PLMs to calibrate better.
Although some of these considerations have been
studied before, the ideal choice for each consid-
eration remains obscure. On the one hand, Desai
and Durrett (2020) report unconventional calibra-
tion behavior for PLMs, which casts doubts on the
prior beliefs drawn on traditional neural networks
by Guo et al. (2017). On the other hand, exist-
ing work (Desai and Durrett,2020;Dan and Roth,
2021) on PLMs’ empirical calibration performance
often looks at a single consideration and concludes
by comparing only one or two types of PLMs.
Therefore, in this paper, we present a compre-
arXiv:2210.04714v2 [cs.CL] 14 Oct 2022
hensive analysis of the four pivotal considerations
introduced above via large-scale empirical evalua-
tions. To ensure that our analysis is applicable to
various NLP tasks and resilient to domain shift, we
set up three NLP tasks (i.e., Sentiment Analysis,
Natural Language Inference, and Commonsense
Reasoning) and prepare both in-domain and out-
of-domain testing sets for each task. In addition
to the explicit metrics of prediction and calibra-
tion error, we also utilize two evaluation tasks to
examine calibration qualities implicitly. Selective
prediction lowers prediction error by avoiding un-
certain testing points, and out-of-domain detection
checks if a pipeline is less confident on unseen do-
mains. By comparing four to five options for each
consideration, we recommend the following:
1.
Use ELECTRA (Clark et al.,2020) as the
PLM to encode input text sequences.
2. Use the larger version of a PLM if possible.
3.
Use Temp Scaling (Guo et al.,2017) for post
hoc uncertainty recalibration.
4.
Use Focal Loss (Mukhoti et al.,2020) during
the fine-tuning stage.
Compared to prior work, our extensive empirical
evaluations also reveal the following novel obser-
vations that are unique to PLM-based pipelines:
The calibration quality of PLMs is relatively
consistent across tasks and domains, except
XLNet (Yang et al.,2019) being the most vul-
nerable to domain shift.
In contrast to other NLP tasks, larger PLMs
are better calibrated in-domain in Common-
sense Reasoning.
Uncertainty quantifiers (e.g., Temp Scaling)
are generally more effective in improving cal-
ibration out-of-domain.
Ensemble (Lakshminarayanan et al.,2017) is
less effective in PLM-based pipelines.
To encourage future work towards better uncer-
tainty quantification in NLP, we release our code
and large-scale evaluation benchmarks containing
120
PLM-based pipelines based on four metrics
(prediction and calibration error, selective predic-
tion, and out-of-domain detection). These pipelines
consist of distinct choices concerning the four con-
siderations and are tested on all three NLP tasks
under both in- and out-of-domain settings.1
1
Our data and code are available at
https://github.
com/xiaoyuxin1002/UQ-PLM.git.
2 Background
2.1 Problem Formulation
Datasets.
In this work, we focus on utilizing PLMs
for NLP classification tasks. More specifically,
consider such a task where the training set
Dtrain =
{(xi, yi)}Ntrain
i=1
consists of pairs of a text sequence
xi∈ Xin
and an associated label
yi∈ Y
. Similarly,
the validation set
Dval
and the in-domain testing
set
Din
come from the same domain
Xin
and share
the same label space
Y
. We also prepare an out-
of-domain testing set
Dout
, which differs from the
others by coming from a distinct domain Xout.
PLM-based Pipeline.
We apply a PLM
M
to
encode an input text sequence
xi
and feed the en-
coding vector to a classifier
F
, which outputs a
predictive distribution
ui
over the label space
Y
via the softmax operation. Here, parameters in
M
and
F
are fine-tuned by minimizing a loss function
`
on
Dtrain
. It is optional to modify the distribu-
tion
ui
post hoc by an uncertainty quantifier
Q
to
reduce calibration error. We define the predicted
label as
ˆyi= arg maxj∈{1,...,|Y|} uij
with the cor-
responding confidence ˆci=uiˆyi.
Calibration.
One crucial goal of uncertainty
quantification is to improve calibration. That is, the
predicted confidence should match the empirical
likelihood:
P(yi= ˆyi|ˆci) = ˆci
. We follow
Guo et al. (2017) by using the expected calibration
error (ECE) to assess the calibration performance.
The calculation of ECE is described in Section 3.1.
To reduce ECE, our main experimental evaluation
lies in examining four considerations involved in
a PLM-based pipeline: (1) the choice of PLM
M
(Section 3), (2) the size of PLM
M
(Section 4), (3)
the choice of uncertainty quantifier
Q
(Section 5),
and (4) the choice of loss function `(Section 6).
2.2 Related Work
Uncertainty quantification has drawn long-lasting
attention from various domains (Bhatt et al.,2021),
such as weather forecasting (Brier et al.,1950;
Raftery et al.,2005), medical practice (Yang and
Thompson,2010;Jiang et al.,2012), and machine
translation (Ott et al.,2018;Zhou et al.,2020;
Wei et al.,2020). Researchers have approached
this question from both Bayesian (Kendall and
Gal,2017;Depeweg et al.,2018) and frequentist
perspectives (Alaa and Van Der Schaar,2020a,b).
They have also proposed different techniques to
improve uncertainty calibration for classification
(Kong et al.,2020;Krishnan and Tickoo,2020) and
regression (Kuleshov et al.,2018;Cui et al.,2020;
Chung et al.,2021) tasks. Recent work has inves-
tigated connections between uncertainty and other
properties, such as model interpretability (Antoran
et al.,2021;Ley et al.,2022), selective prediction
(Xin et al.,2021;Varshney et al.,2022a,b), and
out-of-domain generalization (Wald et al.,2021;
Qin et al.,2021).
PLMs (Qiu et al.,2020;Min et al.,2021) have
achieved state-of-the-art prediction performance
on diverse NLP benchmarks (Rajpurkar et al.,
2016,2018;Wang et al.,2019a,b) and demon-
strated many desired properties like stronger out-
of-domain robustness (Hendrycks et al.,2020) and
better uncertainty calibration (Desai and Durrett,
2020). They typically leverage a Transformer archi-
tecture (Vaswani et al.,2017) and are pre-trained
by self-supervised learning (Jaiswal et al.,2021).
Although Guo et al. (2017) report that larger
models tend to calibrate worse, PLMs have been
shown to produce well-calibrated uncertainty in
practice (Desai and Durrett,2020), albeit for giant
model sizes. Their unusual calibration behavior
puts the observations drawn on traditional neural
networks (Ovadia et al.,2019;Mukhoti et al.,2020)
or pre-trained vision models (Minderer et al.,2021)
in doubt. Prior work (Desai and Durrett,2020;Dan
and Roth,2021) on the calibration of PLMs often
explores only one or two types of PLMs and ig-
nores uncertainty quantifiers and fine-tuning losses
beyond Temp Scaling and Cross Entropy, respec-
tively. As a result, there lacks a holistic analysis
that explores the full set of these considerations in a
PLM-based pipeline. Therefore, our paper aspires
to fill this void via extensive empirical studies.
3 Which Pre-trained Language Model?
3.1 Experiment Setup
To evaluate the calibration performance of PLMs,
we consider a series of NLP classification tasks:
1. Sentiment Analysis
identifies the binary sen-
timent of a text sequence. We treat the IMDb
movie review dataset (Maas et al.,2011) as in-
domain and the Yelp restaurant review dataset
(Zhang et al.,2015) as out-of-domain.
2. Natural Language Inference
predicts the re-
lationship between a hypothesis and a premise.
We regard the Multi-Genre Natural Language
Inference (MNLI) dataset (Williams et al.,
2018) covering a range of genres of spoken
and written text as in-domain and the Stanford
Sentiment Natural Language Commonsense
Analysis Inference Reasoning
Xin IMDb MNLI SWAG
Xout Yelp SNLI HellaSWAG
|Y| 2 3 4
|Dtrain|25,000 392,702 73,546
|Dval|12,500 4,907 10,003
|Din|12,500 4,908 10,003
|Dout|19,000 4,923 5,021
Table 1: In- and out-of-domain datasets, label space
size, and each data split size of the three NLP tasks.
Hugging Face Model Pre-training Pre-training
Name Size Corpus Size Task
bert-base-cased 109M 16G Masked LM, NSP
xlnet-base-cased 110M 161G Permuted LM
electra-base-discriminator 110M 161G Replacement Detection
roberta-base 125M 161G Dynamic Masked LM
deberta-base 140M 85G Dynamic Masked LM
bert-large-cased 335M 16G Masked LM, NSP
xlnet-large-cased 340M 161G Permuted LM
electra-large-discriminator 335M 161G Replacement Detection
roberta-large 335M 161G Dynamic Masked LM
deberta-large 350M 85G Dynamic Masked LM
Table 2: Model size, pre-training corpus size, and pre-
training task of the five PLMs, separated into the base
(upper) and the large (lower) versions.
Natural Language Inference (SNLI) dataset
(Bowman et al.,2015) derived from image
captions only as out-of-domain.
3. Commonsense Reasoning
determines the
most reasonable continuation of a sentence
among four candidates. We view the Situa-
tions With Adversarial Generations (SWAG)
dataset (Zellers et al.,2018) as in-domain and
its adversarial variant (HellaSWAG) (Zellers
et al.,2019) as out-of-domain.
For each task, we construct
Dtrain
,
Dval
, and
Din
from the corresponding in-domain dataset, and
Dout
from the corresponding out-of-domain dataset. The
original validation set of each dataset is split in half
randomly to form a held-out non-blind testing set
(i.e.,
Din
or
Dout
). Table 1describes the task details.
To understand which PLM delivers the lowest
calibration error, we examine five popular options:
1. BERT
(Devlin et al.,2019) utilizes a bidi-
rectional Transformer architecture pre-trained
by masked language modeling (LM) and next
sentence prediction (NSP).
2. XLNet
(Yang et al.,2019) proposes a two-
stream self-attention mechanism and a pre-
training objective of permuted LM.
3. ELECTRA
(Clark et al.,2020) pre-trains a
摘要:

UncertaintyQuanticationwithPre-trainedLanguageModels:ALarge-ScaleEmpiricalAnalysisYuxinXiao1,PaulPuLiang2,UmangBhatt3,WillieNeiswanger4,RuslanSalakhutdinov2,Louis-PhilippeMorency21MassachusettsInstituteofTechnology,2CarnegieMellonUniversity,3UniversityofCambridge,4StanfordUniversity1yuxin102@mit.ed...

展开>> 收起<<
Uncertainty Quantification with Pre-trained Language Models A Large-Scale Empirical Analysis Yuxin Xiao1 Paul Pu Liang2 Umang Bhatt3.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:554.11KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注