Uncertainty Quantiﬁcation with Pre-trained Language Models A Large-Scale Empirical Analysis Yuxin Xiao1 Paul Pu Liang2 Umang Bhatt3

2025-05-06 2 0 554.11KB 12 页 10玖币

侵权投诉

Uncertainty Quantiﬁcation with Pre-trained Language Models:

A Large-Scale Empirical Analysis

Yuxin Xiao1, Paul Pu Liang2, Umang Bhatt3,

Willie Neiswanger4, Ruslan Salakhutdinov2, Louis-Philippe Morency2

1Massachusetts Institute of Technology, 2Carnegie Mellon University,

3University of Cambridge, 4Stanford University

1yuxin102@mit.edu 2{pliang,rsalakhu,morency}@cs.cmu.edu

3usb20@cam.ac.uk 4neiswanger@cs.stanford.edu

Abstract

Pre-trained language models (PLMs) have

gained increasing popularity due to their com-

pelling prediction performance in diverse nat-

ural language processing (NLP) tasks. When

formulating a PLM-based prediction pipeline

for NLP tasks, it is also crucial for the pipeline

to minimize the calibration error, especially

in safety-critical applications. That is, the

pipeline should reliably indicate when we can

trust its predictions. In particular, there are

various considerations behind the pipeline: (1)

the choice and (2) the size of PLM, (3) the

choice of uncertainty quantiﬁer, (4) the choice

of ﬁne-tuning loss, and many more. Although

prior work has looked into some of these

considerations, they usually draw conclusions

based on a limited scope of empirical studies.

There still lacks a holistic analysis on how to

compose a well-calibrated PLM-based predic-

tion pipeline. To ﬁll this void, we compare a

wide range of popular options for each consid-

eration based on three prevalent NLP classiﬁ-

cation tasks and the setting of domain shift. In

response, we recommend the following: (1)

use ELECTRA for PLM encoding, (2) use

larger PLMs if possible, (3) use Temp Scaling

as the uncertainty quantiﬁer, and (4) use Focal

Loss for ﬁne-tuning.

1 Introduction

PLMs (Qiu et al.,2020;Min et al.,2021) have

achieved state-of-the-art performance on a broad

spectrum of NLP benchmarks (Rajpurkar et al.,

2016,2018;Wang et al.,2019a,b) and are increas-

ingly popular in various downstream applications

such as question answering (Yoon et al.,2019;Garg

et al.,2020), text classiﬁcation (Arslan et al.,2021;

Limsopatham,2021), and relation extraction (Zhou

et al.,2021;Xiao et al.,2022). Consequently, it

is paramount for PLMs to faithfully communicate

when to (or not to) rely on their predictions for

decision-making, especially in high-stakes scenar-

ios. In these cases, we need PLMs to quantify their

uncertainty accurately and calibrate well (Abdar

et al.,2021), meaning that their predictive con-

ﬁdence should be a valid estimate of how likely

they are to make a correct prediction. Consider

an example of medical question answering (Yoon

et al.,2019;Zhang et al.,2021) where a PLM is

asked to assist doctors when diagnosing diseases.

If the PLM is

90%

sure that a patient is healthy,

the predicted outcome should occur

90%

of the

time in practice. Otherwise, it may adversely affect

doctors’ judgment and lead to catastrophic conse-

quences. Hence, since PLMs have become the de

facto paradigm for many NLP tasks, it is necessary

to assess their calibration quality.

When constructing a well-calibrated PLM-based

prediction pipeline for NLP tasks, various consid-

erations are involved. To name a few:

Due to the use of diverse pre-training datasets

and strategies, different PLMs may behave

differently regarding calibration.

The model size of PLMs may also affect their

capability in calibration.

Leveraging uncertainty quantiﬁers (e.g., Temp

Scaling (Guo et al.,2017) and MC Dropout

(Gal and Ghahramani,2016)) alongside PLMs

in the pipeline may reduce calibration error.

Some losses (e.g., Focal Loss (Mukhoti et al.,

2020) and Label Smoothing (Müller et al.,

2019)) may ﬁne-tune PLMs to calibrate better.

Although some of these considerations have been

studied before, the ideal choice for each consid-

eration remains obscure. On the one hand, Desai

and Durrett (2020) report unconventional calibra-

tion behavior for PLMs, which casts doubts on the

prior beliefs drawn on traditional neural networks

by Guo et al. (2017). On the other hand, exist-

ing work (Desai and Durrett,2020;Dan and Roth,

2021) on PLMs’ empirical calibration performance

often looks at a single consideration and concludes

by comparing only one or two types of PLMs.

Therefore, in this paper, we present a compre-

arXiv:2210.04714v2 [cs.CL] 14 Oct 2022

hensive analysis of the four pivotal considerations

introduced above via large-scale empirical evalua-

tions. To ensure that our analysis is applicable to

various NLP tasks and resilient to domain shift, we

set up three NLP tasks (i.e., Sentiment Analysis,

Natural Language Inference, and Commonsense

Reasoning) and prepare both in-domain and out-

of-domain testing sets for each task. In addition

to the explicit metrics of prediction and calibra-

tion error, we also utilize two evaluation tasks to

examine calibration qualities implicitly. Selective

prediction lowers prediction error by avoiding un-

certain testing points, and out-of-domain detection

checks if a pipeline is less conﬁdent on unseen do-

mains. By comparing four to ﬁve options for each

consideration, we recommend the following:

Use ELECTRA (Clark et al.,2020) as the

PLM to encode input text sequences.

2. Use the larger version of a PLM if possible.

Use Temp Scaling (Guo et al.,2017) for post

hoc uncertainty recalibration.

Use Focal Loss (Mukhoti et al.,2020) during

the ﬁne-tuning stage.

Compared to prior work, our extensive empirical

evaluations also reveal the following novel obser-

vations that are unique to PLM-based pipelines:

•

The calibration quality of PLMs is relatively

consistent across tasks and domains, except

XLNet (Yang et al.,2019) being the most vul-

nerable to domain shift.

•

In contrast to other NLP tasks, larger PLMs

are better calibrated in-domain in Common-

sense Reasoning.

•

Uncertainty quantiﬁers (e.g., Temp Scaling)

are generally more effective in improving cal-

ibration out-of-domain.

•

Ensemble (Lakshminarayanan et al.,2017) is

less effective in PLM-based pipelines.

To encourage future work towards better uncer-

tainty quantiﬁcation in NLP, we release our code

and large-scale evaluation benchmarks containing

120

PLM-based pipelines based on four metrics

(prediction and calibration error, selective predic-

tion, and out-of-domain detection). These pipelines

consist of distinct choices concerning the four con-

siderations and are tested on all three NLP tasks

under both in- and out-of-domain settings.1

Our data and code are available at

https://github.

com/xiaoyuxin1002/UQ-PLM.git.

2 Background

2.1 Problem Formulation

Datasets.

In this work, we focus on utilizing PLMs

for NLP classiﬁcation tasks. More speciﬁcally,

consider such a task where the training set

Dtrain =

{(xi, yi)}Ntrain

i=1

consists of pairs of a text sequence

xi∈ Xin

and an associated label

yi∈ Y

. Similarly,

the validation set

Dval

and the in-domain testing

set

Din

come from the same domain

Xin

and share

the same label space

. We also prepare an out-

of-domain testing set

Dout

, which differs from the

others by coming from a distinct domain Xout.

PLM-based Pipeline.

We apply a PLM

encode an input text sequence

and feed the en-

coding vector to a classiﬁer

, which outputs a

predictive distribution

over the label space

via the softmax operation. Here, parameters in

and

are ﬁne-tuned by minimizing a loss function

Dtrain

. It is optional to modify the distribu-

tion

post hoc by an uncertainty quantiﬁer

reduce calibration error. We deﬁne the predicted

label as

ˆyi= arg maxj∈{1,...,|Y|} uij

with the cor-

responding conﬁdence ˆci=uiˆyi.

Calibration.

One crucial goal of uncertainty

quantiﬁcation is to improve calibration. That is, the

predicted conﬁdence should match the empirical

likelihood:

P(yi= ˆyi|ˆci) = ˆci

. We follow

Guo et al. (2017) by using the expected calibration

error (ECE) to assess the calibration performance.

The calculation of ECE is described in Section 3.1.

To reduce ECE, our main experimental evaluation

lies in examining four considerations involved in

a PLM-based pipeline: (1) the choice of PLM

(Section 3), (2) the size of PLM

(Section 4), (3)

the choice of uncertainty quantiﬁer

(Section 5),

and (4) the choice of loss function `(Section 6).

2.2 Related Work

Uncertainty quantiﬁcation has drawn long-lasting

attention from various domains (Bhatt et al.,2021),

such as weather forecasting (Brier et al.,1950;

Raftery et al.,2005), medical practice (Yang and

Thompson,2010;Jiang et al.,2012), and machine

translation (Ott et al.,2018;Zhou et al.,2020;

Wei et al.,2020). Researchers have approached

this question from both Bayesian (Kendall and

Gal,2017;Depeweg et al.,2018) and frequentist

perspectives (Alaa and Van Der Schaar,2020a,b).

They have also proposed different techniques to

improve uncertainty calibration for classiﬁcation

(Kong et al.,2020;Krishnan and Tickoo,2020) and

regression (Kuleshov et al.,2018;Cui et al.,2020;

Chung et al.,2021) tasks. Recent work has inves-

tigated connections between uncertainty and other

properties, such as model interpretability (Antoran

et al.,2021;Ley et al.,2022), selective prediction

(Xin et al.,2021;Varshney et al.,2022a,b), and

out-of-domain generalization (Wald et al.,2021;

Qin et al.,2021).

PLMs (Qiu et al.,2020;Min et al.,2021) have

achieved state-of-the-art prediction performance

on diverse NLP benchmarks (Rajpurkar et al.,

2016,2018;Wang et al.,2019a,b) and demon-

strated many desired properties like stronger out-

of-domain robustness (Hendrycks et al.,2020) and

better uncertainty calibration (Desai and Durrett,

2020). They typically leverage a Transformer archi-

tecture (Vaswani et al.,2017) and are pre-trained

by self-supervised learning (Jaiswal et al.,2021).

Although Guo et al. (2017) report that larger

models tend to calibrate worse, PLMs have been

shown to produce well-calibrated uncertainty in

practice (Desai and Durrett,2020), albeit for giant

model sizes. Their unusual calibration behavior

puts the observations drawn on traditional neural

networks (Ovadia et al.,2019;Mukhoti et al.,2020)

or pre-trained vision models (Minderer et al.,2021)

in doubt. Prior work (Desai and Durrett,2020;Dan

and Roth,2021) on the calibration of PLMs often

explores only one or two types of PLMs and ig-

nores uncertainty quantiﬁers and ﬁne-tuning losses

beyond Temp Scaling and Cross Entropy, respec-

tively. As a result, there lacks a holistic analysis

that explores the full set of these considerations in a

PLM-based pipeline. Therefore, our paper aspires

to ﬁll this void via extensive empirical studies.

3 Which Pre-trained Language Model?

3.1 Experiment Setup

To evaluate the calibration performance of PLMs,

we consider a series of NLP classiﬁcation tasks:

1. Sentiment Analysis

identiﬁes the binary sen-

timent of a text sequence. We treat the IMDb

movie review dataset (Maas et al.,2011) as in-

domain and the Yelp restaurant review dataset

(Zhang et al.,2015) as out-of-domain.

2. Natural Language Inference

predicts the re-

lationship between a hypothesis and a premise.

We regard the Multi-Genre Natural Language

Inference (MNLI) dataset (Williams et al.,

2018) covering a range of genres of spoken

and written text as in-domain and the Stanford

Sentiment Natural Language Commonsense

Analysis Inference Reasoning

Xin IMDb MNLI SWAG

Xout Yelp SNLI HellaSWAG

|Y| 2 3 4

|Dtrain|25,000 392,702 73,546

|Dval|12,500 4,907 10,003

|Din|12,500 4,908 10,003

|Dout|19,000 4,923 5,021

Table 1: In- and out-of-domain datasets, label space

size, and each data split size of the three NLP tasks.

Hugging Face Model Pre-training Pre-training

Name Size Corpus Size Task

bert-base-cased 109M 16G Masked LM, NSP

xlnet-base-cased 110M 161G Permuted LM

electra-base-discriminator 110M 161G Replacement Detection

roberta-base 125M 161G Dynamic Masked LM

deberta-base 140M 85G Dynamic Masked LM

bert-large-cased 335M 16G Masked LM, NSP

xlnet-large-cased 340M 161G Permuted LM

electra-large-discriminator 335M 161G Replacement Detection

roberta-large 335M 161G Dynamic Masked LM

deberta-large 350M 85G Dynamic Masked LM

Table 2: Model size, pre-training corpus size, and pre-

training task of the ﬁve PLMs, separated into the base

(upper) and the large (lower) versions.

Natural Language Inference (SNLI) dataset

(Bowman et al.,2015) derived from image

captions only as out-of-domain.

3. Commonsense Reasoning

determines the

most reasonable continuation of a sentence

among four candidates. We view the Situa-

tions With Adversarial Generations (SWAG)

dataset (Zellers et al.,2018) as in-domain and

its adversarial variant (HellaSWAG) (Zellers

et al.,2019) as out-of-domain.

For each task, we construct

Dtrain

Dval

, and

Din

from the corresponding in-domain dataset, and

Dout

from the corresponding out-of-domain dataset. The

original validation set of each dataset is split in half

randomly to form a held-out non-blind testing set

(i.e.,

Din

Dout

). Table 1describes the task details.

To understand which PLM delivers the lowest

calibration error, we examine ﬁve popular options:

1. BERT

(Devlin et al.,2019) utilizes a bidi-

rectional Transformer architecture pre-trained

by masked language modeling (LM) and next

sentence prediction (NSP).

2. XLNet

(Yang et al.,2019) proposes a two-

stream self-attention mechanism and a pre-

training objective of permuted LM.

3. ELECTRA

(Clark et al.,2020) pre-trains a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UncertaintyQuanticationwithPre-trainedLanguageModels:ALarge-ScaleEmpiricalAnalysisYuxinXiao1,PaulPuLiang2,UmangBhatt3,WillieNeiswanger4,RuslanSalakhutdinov2,Louis-PhilippeMorency21MassachusettsInstituteofTechnology,2CarnegieMellonUniversity,3UniversityofCambridge,4StanfordUniversity1yuxin102@mit.ed...

展开>> 收起<<

Uncertainty Quantiﬁcation with Pre-trained Language Models A Large-Scale Empirical Analysis Yuxin Xiao1 Paul Pu Liang2 Umang Bhatt3.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Uncertainty Quantiﬁcation with Pre-trained Language Models A Large-Scale Empirical Analysis Yuxin Xiao1 Paul Pu Liang2 Umang Bhatt3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: