Exploring Predictive Uncertainty and Calibration in NLP A Study on the Impact of Method Data Scarcity Dennis UlmerJes FrellsenrobotChristian Hardmeier

2025-05-06 0 0 1.82MB 30 页 10玖币

侵权投诉

Exploring Predictive Uncertainty and Calibration in NLP:

A Study on the Impact of Method & Data Scarcity

Dennis Ulmer☼Jes FrellsenÄChristian Hardmeier☼

☼Department of Computer Science, IT University of Copenhagen

ÄDepartment of Applied Mathematics & Computer Science, Technical University of Denmark

dennis.ulmer@mailbox.org

Abstract

We investigate the problem of determining the

predictive conﬁdence (or, conversely, uncer-

tainty) of a neural classiﬁer through the lens

of low-resource languages. By training mod-

els on sub-sampled datasets in three differ-

ent languages, we assess the quality of esti-

mates from a wide array of approaches and

their dependence on the amount of available

data. We ﬁnd that while approaches based on

pre-trained models and ensembles achieve the

best results overall, the quality of uncertainty

estimates can surprisingly suffer with more

data. We also perform a qualitative analysis

of uncertainties on sequences, discovering that

a model’s total uncertainty seems to be inﬂu-

enced to a large degree by its data uncertainty,

not model uncertainty. All model implementa-

tions are open-sourced in a software package.

1 Introduction

In 1877, Italian astronomer Giovanni Schiaparelli

described the existence of “canals” on the surface

of Mars, a ﬁnding that was described by a contem-

porary as a “very important and perplexing [prob-

lem]” (Young,1895; p. 355). It later turned out that

the structures, originally termed canali in Italian,

were simply mistranslated, since the word can also

refer to (natural) channels of water. By that point

however, the possibility of irrigation on the red

planet had already sept into popular culture, and is

still being referenced to this day. In the meantime,

translation has become a task that is increasingly

performed by neural networks, which — in the face

of a word such as canali — might simply fall back

on the most likely translation given the training

data. And while the error above seems fairly in-

nocuous, there are more safety-critical scenarios in

which such ambiguities matter and can potentially

have negative real-word consequences. Besides

translation, there also exist other language-based

problems in which the uncertainty surrounding a

Figure 1: Schematic of our experiments. Training

sets are sub-sampled and used to train LSTM-based

models and ﬁne-tune transformer-based ones, which

are evaluated on in- and out-of-distribution test data.

model prediction can convey critical information,

such as medical analyses (Esteva et al.,2019), le-

gal case data (Frankenreiter and Livermore,2020)

or analyzing job applications (Zimmermann et al.,

2016). Determining model conﬁdence, or, con-

versely, uncertainty, consequently is an important

mean to instill trust in end users and avert harm

(Bhatt et al.,2021;Jacovi et al.,2021). While there

exist many works on images (Lakshminarayanan

et al.,2017;Snoek et al.,2019) and tabular data

(Ruhe et al.,2019;Ulmer et al.,2020;Malinin et al.,

2021), the quality of uncertainty estimates provided

by neural networks remains underexplored in Nat-

ural Language Processing (NLP). In addition, as

model underspeciﬁcation due to insufﬁcient data

presents a risk (D’Amour et al.,2020), the increas-

ing interest in less-researched languages with lim-

ited resources raises the question of how reliably

uncertain predictions can be identiﬁed. This lets us

pose the following research questions:

RQ1

What are the best approaches in terms of un-

certainty quality and calibration?

RQ2

How are models impacted by the amount of

available training data?

RQ3

What are differences in how the different ap-

proaches estimate uncertainty?

arXiv:2210.15452v1 [cs.CL] 20 Oct 2022

Contributions 1

We address these questions

by conducting a comprehensive empirical study

of eight different models for uncertainty estima-

tion for classiﬁcation and evaluate their effective-

ness on three languages spanning distinct NLP

tasks, involving sequence labeling and classiﬁca-

tion.

We show that while approaches based on

pre-trained models and ensembles achieve the best

results overall, the quality of uncertainty estimates

on OOD data can become worse using more data.

In a qualitative analysis, we also discover that a

model’s total uncertainty seems to mostly consist

of its data uncertainty.

We make our experi-

mental code and model implementations available

open-source in separate repositories, aiding future

research in this direction.1

2 Related Work

Notions of Uncertainty

In the absence of addi-

tional information, the introductionary example

canali has two valid translations — canals and

channels. This is an instance of data or aleatoric

uncertainty, describing the irreducible ambiguity

and noise in the data generating process. The other

notion is model or epistemic uncertainty: Fitting

parameters, there remains a degree of incertitude

about the optimal values due to ﬁnite data. We can

usually reduce this uncertainty by amassing more

data,

for instance by supplying a translation sys-

tem with other meanings of canali. These two con-

cepts form the basis for uncertainty estimation in

Machine Learning (Der Kiureghian and Ditlevsen,

2009;Hüllermeier and Waegeman,2021).

Uncertainty in NLP

Since uncertainty estima-

tion literature is manifold on image data, we dedi-

cate this part to related works in the realm of Natu-

ral Language Processing. There are several exam-

ples trying to incorporate uncertainty into models to

either increase trustworthiness or performance, for

instance in Machine Translation (Glushkova et al.,

2021;Wei et al.,2020;Xiao et al.,2020), Sum-

marization (Gidiotis and Tsoumakas,2021), Infor-

mation Retrieval (Penha and Hauff,2021) and Ac-

tive Learning (Siddhant and Lipton,2018). To ob-

tain uncertainties, Gan et al. (2017) use Stochastic-

Gradient Langevin Dynamics (Welling and Teh,

The model zoo is available under

https://github.com/

Kaleidophon/nlp-uncertainty-zoo

, with the code for

the experiments available under

https://github.com/

Kaleidophon/nlp-low-resource-uncertainty.

That is, unless the model class we chose is too restrictive.

2011) to obtain posterior weight samples for a

LSTM. Shelmanov et al. (2021) apply MC Dropout

with determinantal point processes to transformers

for Natural Language Understanding. Several au-

thors have also highlighted connections of multi-

head attention to Bayesian inference (An et al.,

2020;Hron et al.,2020). Shen et al. (2020) attempt

to transfer the idea of prior networks (Malinin and

Gales,2018;Joo et al.,2020) onto recurrent neu-

ral networks. Another line of works investigates

uncertainty properties themselves; For instance,

Chen and Ji (2022) try to explain uncertainty es-

timates for BERT and RoBERTa. Another exam-

ple is given by Xiao and Wang (2021), who use

predictive uncertainty to explain hallucination in

Language Generation. Xu et al. (2020) similarly

use uncertainty as a tool to investigate challenges

of neural summarization approaches. Lastly, due

to the way that uncertainty estimates are evaluated,

investigating distributional shift in NLP is also of

interest, for instance through the work of Arora

et al. (2021), Kamath et al. (2020), who focus on

question answering and Tan et al. (2019) for text

classiﬁcation. The most similar work to ours is

the text classiﬁcation uncertainty benchmark by

Van Landeghem et al. (2022), however they do not

consider the impact of data or language, and test a

different selection of models.

Calibration

Calibration denotes the property of

a model’s output to accurately reﬂect the true

chance of a correct prediction — i.e. predicting

a class with a conﬁdence of

90%

should yield the

correct prediction for

90%

of similar inputs, when

repeated. There have been several studies testing

this property in modern neural networks (Guo et al.,

2017;Nixon et al.,2019;Minderer et al.,2021;

Wang et al.,2021b) and proposing ways to im-

prove it (Thulasidasan et al.,2019;Mukhoti et al.,

2020;Karandikar et al.,2021;Zhao et al.,2021a;

Tian et al.,2021). In NLP, calibration as been ex-

plored for pre-trained models (Desai and Durrett,

2020), including on out-of-distribution data (Dan

and Roth,2021), for neural machine translation

(Wang et al.,2020) and for question-answering

(Jiang et al.,2021). Likewise, authors have pro-

posed several calibration schemes, for instance by

focusing on classes of interest (Jagannatha and

Yu,2020), generating synthetic examples for reg-

ularization (Kong et al.,2020), using richer input

representations (Zhang et al.,2021) and adapting

prompts in a zero-shot setting (Zhao et al.,2021b).

3 Methodology

3.1 Models

We choose a variety of models that cover a range

of different approaches based on the two most

prominently used architectures in NLP: Long-Short

Term Memory networks (LSTMs; Hochreiter and

Schmidhuber,1997) and transformers (Vaswani

et al.,2017). Inside the ﬁrst family, we use the Vari-

ational LSTM (Gal and Ghahramani,2016b) based

on MC Dropout (Gal and Ghahramani,2016a), the

Bayesian LSTM (Fortunato et al.,2017) imple-

menting Bayes-by-backprop (Blundell et al.,2015)

and the ST-

LSTM (Wang et al.,2021a), mod-

elling transitions in a ﬁnite-state automaton, as well

as an ensemble (Lakshminarayanan et al.,2017). In

the second family, we count the Variational Trans-

former (Xiao et al.,2020), also using MC Dropout,

the SNGP Transformer (Liu et al.,2022), using a

Gaussian Process output layer, and the Deep Deter-

ministic Uncertainty transformer (DDU; Mukhoti

et al.,2021), ﬁtting a Gaussian mixture model on

extracted features. We elaborate on implementation

details in Appendix C.1.

3.2 Uncertainty Metrics

We employ the following metrics to quantify conﬁ-

dence or uncertainty — in all cases, lower values in-

dicate lower conﬁdence / certainty and conversely,

higher values mean higher conﬁdence / certainty.

The following metrics were either chosen due to

their frequent use in the literature, or because they

are trying to capture uncertainty in a novel way.

Single prediction metrics

We distinguish be-

tween metrics suitable for models using only a

single prediction (or using the mean of multiple pre-

dictions, e.g. for an ensemble). The most straight-

forward of them is the maximum softmax proba-

bility by Hendrycks and Gimpel (2017). A variant

of this is the softmax-gap, measuring the differ-

ence between the two largest predicted probabili-

ties (Tagasovska and Lopez-Paz,2019). Another

common metric, predictive entropy, involves mea-

suring the Shannon entropy of the output distribu-

tion, which is maximized for a uniform prediction:

−

k=1

pθ(y=k|x) log pθ(y=k|x)

Lastly, we consider the Dempster-Shafer met-

ric (Sensoy et al.,2018), deﬁned as

K/(K+

k=1 exp(zk))

, where

denotes the logit cor-

responding to class

. It has been shown that prob-

abilities for (ReLU) networks tend to saturate in the

limit (Hein et al.,2019;Ulmer and Cinà,2021), and

since this metric considers logits, it might provide

more informative estimates on OOD data.

Multiple prediction metrics

For some of the in-

cluded models, we can express uncertainty as some

score based on a number of predicted distributions,

e.g. from different ensemble members or forward

passes for MC Dropout. Here we use the expecta-

tion with respect to the weight posterior to express

the aggregation of multiple predictions, which will

simply be evaluated using the mean of a number of

Monte Carlo samples in practice. A simple uncer-

tainty metric on this basis is the predictive variance

between predictions for a class:

k=1

Eq(θ)pθ(y=k|x)−Eq(θ)hpθ(y=k|x)i2,

where the expectation is evaluated over multiple

sets of parameters, e.g. stemming from different

dropout masks. Another possibility lies in using the

mutual information between the label and model

parameters given the data and input sample, which

was introduced by Smith and Gal (2018):

HEq(θ)hpθ(y|x)i−Eq(θ)Hhpθ(y|x)i(1)

where H denotes the Shannon entropy as used

for predictive entropy. The two terms of this equa-

tion can be identiﬁed as the total entropy and the

aleatoric uncertainty, respectively. In theory, the

remaining epistemic uncertainty of the model — in

form of the the mutual information — should be

particularly high on OOD inputs.

Model-speciﬁc metrics

Lastly, DDU by

Mukhoti et al. (2021) uses the log-probability of

the last layer network activation under a Gaussian

Mixture Model ﬁtted on the training set as an

additional metric. Since all others models are

trained or ﬁne-tuned as classiﬁers, they are not

able to assign log-probabilities to sequences.

Uncertainty for sequences

Since some tasks re-

quire predictions for every time step of a sequence,

we determine the uncertainty of a whole sequence

in these cases by taking the mean over all step-wise

uncertainties.

A more principled approach for se-

quences is for instance provided by Malinin and

We also just considered the maximum uncertainty over a

sequence, with similar results.

Gales (2021), and we leave the extension and ex-

ploration of such methods for different uncertainty

metrics, models and tasks to future work.

3.3 Dataset Selection & Creation

In-distribution training sets

We choose three

different languages, namely English (Clinc Plus;

Larson et al.,2019), Danish in the form of the Dan+

dataset (Plank et al.,2020) based on News texts

from PAROLE-DK (Bilgram and Keson,1998),

Finnish (UD Treebank; Haverinen et al.,2014;

Pyysalo et al.,2015;Kanerva and Ginter,2022),

corresponding to NLP tasks such as sequence

classiﬁcation, named entity recognition and

part-of-speech tagging. An overview over the used

the data is given in Table 1. We do use standardized

low-resource languages in the case of Finnish and

Danish, and simulate a low-resource setting using

English data.

Starting with a sufﬁciently-sized

training set and then sub-sampling allows us to

create training sets of arbitrary sizes. By using

languages from different families, we hope to be

able draw conclusions that generalize across a

single language. We employ a speciﬁc sampling

scheme that tries to maintain the sequence length

and class distribution of the original corpus, which

we explain and verify in Appendix A.2.

Out-of-distribution Test Sets

While it is possi-

ble to create OOD text by for instance withholding

classes from the training set or appending text from

a different source (Arora et al.,2021), we choose

to pick entirely new OOD test sets that are quali-

tatively different: Out-of-scope voice commands

by users in Larson et al. (2019),

the Twitter split

of the Dan+ dataset (Plank et al.,2020), and the

Finnish OOD treebank (Kanerva and Ginter,2022).

In similar works for the image domain, OOD test

sets are often chosen to be convincingly different

from the training distribution, for instance MNIST

versus Fashion-MNIST (Nalisnick et al.,2019;van

The deﬁnition of low-resource actually differs greatly be-

tween works. One deﬁnition by Bird (2022) advocates the us-

age for (would-be) standardized languages with a large amount

of speakers and a written tradition, but a lack of resources for

language technologies. Another way is a task-dependent deﬁ-

nition: For dependency parsing, Müller-Eberstein et al. (2021)

deﬁne low-resource as providing less than

5000

annotated sen-

tences in the Universal Dependencies Treebank. Hedderich

et al. (2021); Lignos et al. (2022) lay out a task-dependent

spectrum, from a several hundred to thousands of instances.

Since all instances in this test set correspond to out-of-

scope inputs and not to classes the model was trained on, we

cannot evaluate certain metrics in Table 2.

Amersfoort et al.,2021). While there exist a va-

riety of formalizations of types of distributional

shift (Moreno-Torres et al.,2012;Wald et al.,2021;

Arora et al.,2021;Federici et al.,2021), it is often

hard to determine if and what kind of shift is taking

place. Winkens et al. (2020) deﬁne near OOD as a

scenario in which the inlier and outlier distribution

are meaningfully related, and far OOD as a case in

which they are unrelated. Unfortunately, this dis-

tinction is somewhat arbitrary and hard to apply in

a language context, where OOD could be deﬁned

as anything ranging from a different language or

dialect to a different demographic on an author or

speaker or a new genre. Therefore, we use a similar

methodology to the validation of the sub-sampled

training sets to make an argument that the selected

OOD splits are sufﬁciently different in nature from

the training splits. The exact procedure along some

more detailed results is described in Appendix A.3.

3.4 Model Training

Unfortunately, our datasets do not contain enough

data to train transformer-based models from

scratch. Therefore, we only fully train LSTM-

based models, while using pre-trained transform-

ers, namely BERT (English; Devlin et al.,2019),

Danish BERT (Hvingelby et al.,2020), and Fin-

BERT (Finnish; Virtanen et al.,2019), for the other

approaches. The whole procedure is depicted in

Figure 1. The way we optimize models is provided

in Appendix C.3. We list training hardware, hyper-

parameter information in Appendix C.2, with the

environmental impact described in Appendix C.5.

3.5 Evaluation

Apart from evaluating models on the task perfor-

mance, we also evaluate the following calibration

and uncertainty, painting a multi-faceted picture

of the reliability of models. In all cases, we use

the Almost Stochastic Order test (ASO; del Bar-

rio et al.,2018;Dror et al.,2019) for signiﬁcance

testing, which is elaborated on in Appendix C.1.

Evaluation of Calibration

First, we measure

the calibration of models using the adaptive cal-

ibration error (ACE; Nixon et al.,2019), which is

an extension of the expected calibration error (ECE;

Naeini et al.,2015;Guo et al.,2017).

Furthermore,

we use the frequentist measure of coverage (Larry,

2004;Kompa et al.,2021). Coverage is based on

See Appendix B for a short overview over the differences.

Language Task Dataset OOD Test Set # ID / OOD. Sub-sampled

Training Set Sizes

EN Intent

Classiﬁcation

Clinc Plus (Larson et al.,2019) Out-of-scope voice commands 15k / 1k 15k / 12.5k / 10k

DA Named Entity

Recognition

Dan+ News (Plank et al.,2020) Tweets 4382 / 109 4k / 2k / 1k

FI PoS Tagging Finnish UD Treebank (Haverinen et al.,2014;

Pyysalo et al.,2015;Kanerva and Ginter,2022)

Hospital records, online forums,

tweets, poetry

12217 / 2122 10k / 7.5k / 5k

Table 1: Datasets. The original and sub-sampled number of sequences for experiments are given on the right.

the prediction set

P(x)

of a classiﬁer given an in-

put, which includes the most likely classes adding

up to or surpassing

1−α

probability mass. A well-

tuned classiﬁer should contain the correct class in

this very set, and minimize its width. The extent to

which this property holds can be determined by the

coverage percentage, i.e. the number of times the

correct class in indeed contain within the prediction

set, and its cardinality, denoted simply as width.

Evaluation of Uncertainty

We compare uncer-

tainty scores on the ID and OOD test set and mea-

sure the area under the receiver-operator curve (AU-

ROC) and under the precision-recall curve (AUPR),

assuming that uncertainty will generally be higher

on samples from the OOD test set.

An ideal model

should create very different distributions of conﬁ-

dence scores on ID and OOD data, thus maximiz-

ing AUROC and AUPR. However, we also want

to ﬁnd out to what extend uncertainty can give an

indication of the correctness of the model, which is

why we propose a new way to evaluate the discrimi-

nation property posed by Alaa and Van Der Schaar

(2020) based on Leonard et al. (1992): A good

model should be less certain for inputs that incur

a higher loss. To measure this both on a token and

sequence level, we utilize Kendall’s

(Kendall,

1938), which, given two lists of measurements, de-

termines the degree to which they are concordant

— that is, to what extent the rankings of elements

according to their measured values agree. This

is expressed by a value between

−1

and

, with

the latter expressing complete concordance. In our

case, these measurements correspond to the uncer-

tainty estimate and the actual model loss, either for

tokens (Token τ) or sequences (Sequence τ).

We thus formulate a pseudo-binary classiﬁcation task as

common in the literature, using the model’s uncertainty score

to try to distinguish the two test sets. Note that we do not

advocate for actually using uncertainty for OOD detection,

but only use it for evaluation purposes, since uncertainty on

OOD examples should be high due to model uncertainty.

4 Experiments

4.1 RQ1: Uncertainty & Calibration

We present the results from our experiments using

the largest training set sizes per dataset in Table 2.

Task Performance

Across datasets and models,

we can identify several trends: some of the BERT-

based models unsurprisingly perform better than

LSTM based models, which can be explained with

their pre-training procedure. We observe worse

performance for some LSTM and BERT-variants,

in particular the Variational, Bayesian and ST-

LSTM, as well the SNGP BERT. In accordance

with the ML literature (see e.g. Lakshminarayanan

et al. (2017); Ovadia et al. (2019), LSTM ensem-

bles actually perform very strongly and on par or

sometimes better than ﬁne-tuned BERTs.

Calibration

We also see BERT models to gen-

erally achieve lower calibration errors across all

metrics measured, which is in line with previous

works (Desai and Durrett,2020;Dan and Roth,

2021). It is interesting to see that the correct pre-

diction is almost always contained in the

0.95

con-

ﬁdence set across all models, however these num-

ber have to be interpreted in the context of the

set’s width: It becomes apparent that for instance

LSTMs achieve this coverage by spreading proba-

bility mass over many classes, while only BERT-

based models, LSTM ensembles as well as the

Bayesian LSTM (on Danish) and the Variational

LSTM (on Finnish) are conﬁdently correct.

Uncertainty Quality

LSTM-based model seem

to struggle to distinguish in- from out-of-

distribution data based on predictive uncertainty.

For Danish, only BERTs perform visibly above

chance-level. For Finnish, the AUPR results sug-

gest that although some OOD instances are quickly

For English, some models were omitted due to conver-

gence issues, which are discussed in Appendix C.4.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExploringPredictiveUncertaintyandCalibrationinNLP:AStudyontheImpactofMethod&DataScarcityDennisUlmerÞJesFrellsenÄChristianHardmeierÞÞDepartmentofComputerScience,ITUniversityofCopenhagenÄDepartmentofAppliedMathematics&ComputerScience,TechnicalUniversityofDenmarkdennis.ulmer@mailbox.orgAbstractWeinvest...

展开>> 收起<<

Exploring Predictive Uncertainty and Calibration in NLP A Study on the Impact of Method Data Scarcity Dennis UlmerJes FrellsenrobotChristian Hardmeier.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploring Predictive Uncertainty and Calibration in NLP A Study on the Impact of Method Data Scarcity Dennis UlmerJes FrellsenrobotChristian Hardmeier

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: