Exploring Predictive Uncertainty and Calibration in NLP A Study on the Impact of Method Data Scarcity Dennis UlmerJes FrellsenrobotChristian Hardmeier

2025-05-06 0 0 1.82MB 30 页 10玖币
侵权投诉
Exploring Predictive Uncertainty and Calibration in NLP:
A Study on the Impact of Method & Data Scarcity
Dennis UlmerJes FrellsenÄChristian Hardmeier
Department of Computer Science, IT University of Copenhagen
ÄDepartment of Applied Mathematics & Computer Science, Technical University of Denmark
dennis.ulmer@mailbox.org
Abstract
We investigate the problem of determining the
predictive confidence (or, conversely, uncer-
tainty) of a neural classifier through the lens
of low-resource languages. By training mod-
els on sub-sampled datasets in three differ-
ent languages, we assess the quality of esti-
mates from a wide array of approaches and
their dependence on the amount of available
data. We find that while approaches based on
pre-trained models and ensembles achieve the
best results overall, the quality of uncertainty
estimates can surprisingly suffer with more
data. We also perform a qualitative analysis
of uncertainties on sequences, discovering that
a model’s total uncertainty seems to be influ-
enced to a large degree by its data uncertainty,
not model uncertainty. All model implementa-
tions are open-sourced in a software package.
1 Introduction
In 1877, Italian astronomer Giovanni Schiaparelli
described the existence of “canals” on the surface
of Mars, a finding that was described by a contem-
porary as a “very important and perplexing [prob-
lem]” (Young,1895; p. 355). It later turned out that
the structures, originally termed canali in Italian,
were simply mistranslated, since the word can also
refer to (natural) channels of water. By that point
however, the possibility of irrigation on the red
planet had already sept into popular culture, and is
still being referenced to this day. In the meantime,
translation has become a task that is increasingly
performed by neural networks, which — in the face
of a word such as canali — might simply fall back
on the most likely translation given the training
data. And while the error above seems fairly in-
nocuous, there are more safety-critical scenarios in
which such ambiguities matter and can potentially
have negative real-word consequences. Besides
translation, there also exist other language-based
problems in which the uncertainty surrounding a
Figure 1: Schematic of our experiments. Training
sets are sub-sampled and used to train LSTM-based
models and fine-tune transformer-based ones, which
are evaluated on in- and out-of-distribution test data.
model prediction can convey critical information,
such as medical analyses (Esteva et al.,2019), le-
gal case data (Frankenreiter and Livermore,2020)
or analyzing job applications (Zimmermann et al.,
2016). Determining model confidence, or, con-
versely, uncertainty, consequently is an important
mean to instill trust in end users and avert harm
(Bhatt et al.,2021;Jacovi et al.,2021). While there
exist many works on images (Lakshminarayanan
et al.,2017;Snoek et al.,2019) and tabular data
(Ruhe et al.,2019;Ulmer et al.,2020;Malinin et al.,
2021), the quality of uncertainty estimates provided
by neural networks remains underexplored in Nat-
ural Language Processing (NLP). In addition, as
model underspecification due to insufficient data
presents a risk (D’Amour et al.,2020), the increas-
ing interest in less-researched languages with lim-
ited resources raises the question of how reliably
uncertain predictions can be identified. This lets us
pose the following research questions:
RQ1
What are the best approaches in terms of un-
certainty quality and calibration?
RQ2
How are models impacted by the amount of
available training data?
RQ3
What are differences in how the different ap-
proaches estimate uncertainty?
arXiv:2210.15452v1 [cs.CL] 20 Oct 2022
Contributions 1
We address these questions
by conducting a comprehensive empirical study
of eight different models for uncertainty estima-
tion for classification and evaluate their effective-
ness on three languages spanning distinct NLP
tasks, involving sequence labeling and classifica-
tion.
2
We show that while approaches based on
pre-trained models and ensembles achieve the best
results overall, the quality of uncertainty estimates
on OOD data can become worse using more data.
3
In a qualitative analysis, we also discover that a
model’s total uncertainty seems to mostly consist
of its data uncertainty.
4
We make our experi-
mental code and model implementations available
open-source in separate repositories, aiding future
research in this direction.1
2 Related Work
Notions of Uncertainty
In the absence of addi-
tional information, the introductionary example
canali has two valid translations — canals and
channels. This is an instance of data or aleatoric
uncertainty, describing the irreducible ambiguity
and noise in the data generating process. The other
notion is model or epistemic uncertainty: Fitting
parameters, there remains a degree of incertitude
about the optimal values due to finite data. We can
usually reduce this uncertainty by amassing more
data,
2
for instance by supplying a translation sys-
tem with other meanings of canali. These two con-
cepts form the basis for uncertainty estimation in
Machine Learning (Der Kiureghian and Ditlevsen,
2009;Hüllermeier and Waegeman,2021).
Uncertainty in NLP
Since uncertainty estima-
tion literature is manifold on image data, we dedi-
cate this part to related works in the realm of Natu-
ral Language Processing. There are several exam-
ples trying to incorporate uncertainty into models to
either increase trustworthiness or performance, for
instance in Machine Translation (Glushkova et al.,
2021;Wei et al.,2020;Xiao et al.,2020), Sum-
marization (Gidiotis and Tsoumakas,2021), Infor-
mation Retrieval (Penha and Hauff,2021) and Ac-
tive Learning (Siddhant and Lipton,2018). To ob-
tain uncertainties, Gan et al. (2017) use Stochastic-
Gradient Langevin Dynamics (Welling and Teh,
1
The model zoo is available under
https://github.com/
Kaleidophon/nlp-uncertainty-zoo
, with the code for
the experiments available under
https://github.com/
Kaleidophon/nlp-low-resource-uncertainty.
2
That is, unless the model class we chose is too restrictive.
2011) to obtain posterior weight samples for a
LSTM. Shelmanov et al. (2021) apply MC Dropout
with determinantal point processes to transformers
for Natural Language Understanding. Several au-
thors have also highlighted connections of multi-
head attention to Bayesian inference (An et al.,
2020;Hron et al.,2020). Shen et al. (2020) attempt
to transfer the idea of prior networks (Malinin and
Gales,2018;Joo et al.,2020) onto recurrent neu-
ral networks. Another line of works investigates
uncertainty properties themselves; For instance,
Chen and Ji (2022) try to explain uncertainty es-
timates for BERT and RoBERTa. Another exam-
ple is given by Xiao and Wang (2021), who use
predictive uncertainty to explain hallucination in
Language Generation. Xu et al. (2020) similarly
use uncertainty as a tool to investigate challenges
of neural summarization approaches. Lastly, due
to the way that uncertainty estimates are evaluated,
investigating distributional shift in NLP is also of
interest, for instance through the work of Arora
et al. (2021), Kamath et al. (2020), who focus on
question answering and Tan et al. (2019) for text
classification. The most similar work to ours is
the text classification uncertainty benchmark by
Van Landeghem et al. (2022), however they do not
consider the impact of data or language, and test a
different selection of models.
Calibration
Calibration denotes the property of
a model’s output to accurately reflect the true
chance of a correct prediction — i.e. predicting
a class with a confidence of
90%
should yield the
correct prediction for
90%
of similar inputs, when
repeated. There have been several studies testing
this property in modern neural networks (Guo et al.,
2017;Nixon et al.,2019;Minderer et al.,2021;
Wang et al.,2021b) and proposing ways to im-
prove it (Thulasidasan et al.,2019;Mukhoti et al.,
2020;Karandikar et al.,2021;Zhao et al.,2021a;
Tian et al.,2021). In NLP, calibration as been ex-
plored for pre-trained models (Desai and Durrett,
2020), including on out-of-distribution data (Dan
and Roth,2021), for neural machine translation
(Wang et al.,2020) and for question-answering
(Jiang et al.,2021). Likewise, authors have pro-
posed several calibration schemes, for instance by
focusing on classes of interest (Jagannatha and
Yu,2020), generating synthetic examples for reg-
ularization (Kong et al.,2020), using richer input
representations (Zhang et al.,2021) and adapting
prompts in a zero-shot setting (Zhao et al.,2021b).
3 Methodology
3.1 Models
We choose a variety of models that cover a range
of different approaches based on the two most
prominently used architectures in NLP: Long-Short
Term Memory networks (LSTMs; Hochreiter and
Schmidhuber,1997) and transformers (Vaswani
et al.,2017). Inside the first family, we use the Vari-
ational LSTM (Gal and Ghahramani,2016b) based
on MC Dropout (Gal and Ghahramani,2016a), the
Bayesian LSTM (Fortunato et al.,2017) imple-
menting Bayes-by-backprop (Blundell et al.,2015)
and the ST-
τ
LSTM (Wang et al.,2021a), mod-
elling transitions in a finite-state automaton, as well
as an ensemble (Lakshminarayanan et al.,2017). In
the second family, we count the Variational Trans-
former (Xiao et al.,2020), also using MC Dropout,
the SNGP Transformer (Liu et al.,2022), using a
Gaussian Process output layer, and the Deep Deter-
ministic Uncertainty transformer (DDU; Mukhoti
et al.,2021), fitting a Gaussian mixture model on
extracted features. We elaborate on implementation
details in Appendix C.1.
3.2 Uncertainty Metrics
We employ the following metrics to quantify confi-
dence or uncertainty — in all cases, lower values in-
dicate lower confidence / certainty and conversely,
higher values mean higher confidence / certainty.
The following metrics were either chosen due to
their frequent use in the literature, or because they
are trying to capture uncertainty in a novel way.
Single prediction metrics
We distinguish be-
tween metrics suitable for models using only a
single prediction (or using the mean of multiple pre-
dictions, e.g. for an ensemble). The most straight-
forward of them is the maximum softmax proba-
bility by Hendrycks and Gimpel (2017). A variant
of this is the softmax-gap, measuring the differ-
ence between the two largest predicted probabili-
ties (Tagasovska and Lopez-Paz,2019). Another
common metric, predictive entropy, involves mea-
suring the Shannon entropy of the output distribu-
tion, which is maximized for a uniform prediction:
K
X
k=1
pθ(y=k|x) log pθ(y=k|x)
Lastly, we consider the Dempster-Shafer met-
ric (Sensoy et al.,2018), defined as
K/(K+
PK
k=1 exp(zk))
, where
zk
denotes the logit cor-
responding to class
k
. It has been shown that prob-
abilities for (ReLU) networks tend to saturate in the
limit (Hein et al.,2019;Ulmer and Cinà,2021), and
since this metric considers logits, it might provide
more informative estimates on OOD data.
Multiple prediction metrics
For some of the in-
cluded models, we can express uncertainty as some
score based on a number of predicted distributions,
e.g. from different ensemble members or forward
passes for MC Dropout. Here we use the expecta-
tion with respect to the weight posterior to express
the aggregation of multiple predictions, which will
simply be evaluated using the mean of a number of
Monte Carlo samples in practice. A simple uncer-
tainty metric on this basis is the predictive variance
between predictions for a class:
1
K
K
X
k=1
Eq(θ)pθ(y=k|x)Eq(θ)hpθ(y=k|x)i2,
where the expectation is evaluated over multiple
sets of parameters, e.g. stemming from different
dropout masks. Another possibility lies in using the
mutual information between the label and model
parameters given the data and input sample, which
was introduced by Smith and Gal (2018):
HEq(θ)hpθ(y|x)iEq(θ)Hhpθ(y|x)i(1)
where H denotes the Shannon entropy as used
for predictive entropy. The two terms of this equa-
tion can be identified as the total entropy and the
aleatoric uncertainty, respectively. In theory, the
remaining epistemic uncertainty of the model — in
form of the the mutual information — should be
particularly high on OOD inputs.
Model-specific metrics
Lastly, DDU by
Mukhoti et al. (2021) uses the log-probability of
the last layer network activation under a Gaussian
Mixture Model fitted on the training set as an
additional metric. Since all others models are
trained or fine-tuned as classifiers, they are not
able to assign log-probabilities to sequences.
Uncertainty for sequences
Since some tasks re-
quire predictions for every time step of a sequence,
we determine the uncertainty of a whole sequence
in these cases by taking the mean over all step-wise
uncertainties.
3
A more principled approach for se-
quences is for instance provided by Malinin and
3
We also just considered the maximum uncertainty over a
sequence, with similar results.
Gales (2021), and we leave the extension and ex-
ploration of such methods for different uncertainty
metrics, models and tasks to future work.
3.3 Dataset Selection & Creation
In-distribution training sets
We choose three
different languages, namely English (Clinc Plus;
Larson et al.,2019), Danish in the form of the Dan+
dataset (Plank et al.,2020) based on News texts
from PAROLE-DK (Bilgram and Keson,1998),
Finnish (UD Treebank; Haverinen et al.,2014;
Pyysalo et al.,2015;Kanerva and Ginter,2022),
corresponding to NLP tasks such as sequence
classification, named entity recognition and
part-of-speech tagging. An overview over the used
the data is given in Table 1. We do use standardized
low-resource languages in the case of Finnish and
Danish, and simulate a low-resource setting using
English data.
4
Starting with a sufficiently-sized
training set and then sub-sampling allows us to
create training sets of arbitrary sizes. By using
languages from different families, we hope to be
able draw conclusions that generalize across a
single language. We employ a specific sampling
scheme that tries to maintain the sequence length
and class distribution of the original corpus, which
we explain and verify in Appendix A.2.
Out-of-distribution Test Sets
While it is possi-
ble to create OOD text by for instance withholding
classes from the training set or appending text from
a different source (Arora et al.,2021), we choose
to pick entirely new OOD test sets that are quali-
tatively different: Out-of-scope voice commands
by users in Larson et al. (2019),
5
the Twitter split
of the Dan+ dataset (Plank et al.,2020), and the
Finnish OOD treebank (Kanerva and Ginter,2022).
In similar works for the image domain, OOD test
sets are often chosen to be convincingly different
from the training distribution, for instance MNIST
versus Fashion-MNIST (Nalisnick et al.,2019;van
4
The definition of low-resource actually differs greatly be-
tween works. One definition by Bird (2022) advocates the us-
age for (would-be) standardized languages with a large amount
of speakers and a written tradition, but a lack of resources for
language technologies. Another way is a task-dependent defi-
nition: For dependency parsing, Müller-Eberstein et al. (2021)
define low-resource as providing less than
5000
annotated sen-
tences in the Universal Dependencies Treebank. Hedderich
et al. (2021); Lignos et al. (2022) lay out a task-dependent
spectrum, from a several hundred to thousands of instances.
5
Since all instances in this test set correspond to out-of-
scope inputs and not to classes the model was trained on, we
cannot evaluate certain metrics in Table 2.
Amersfoort et al.,2021). While there exist a va-
riety of formalizations of types of distributional
shift (Moreno-Torres et al.,2012;Wald et al.,2021;
Arora et al.,2021;Federici et al.,2021), it is often
hard to determine if and what kind of shift is taking
place. Winkens et al. (2020) define near OOD as a
scenario in which the inlier and outlier distribution
are meaningfully related, and far OOD as a case in
which they are unrelated. Unfortunately, this dis-
tinction is somewhat arbitrary and hard to apply in
a language context, where OOD could be defined
as anything ranging from a different language or
dialect to a different demographic on an author or
speaker or a new genre. Therefore, we use a similar
methodology to the validation of the sub-sampled
training sets to make an argument that the selected
OOD splits are sufficiently different in nature from
the training splits. The exact procedure along some
more detailed results is described in Appendix A.3.
3.4 Model Training
Unfortunately, our datasets do not contain enough
data to train transformer-based models from
scratch. Therefore, we only fully train LSTM-
based models, while using pre-trained transform-
ers, namely BERT (English; Devlin et al.,2019),
Danish BERT (Hvingelby et al.,2020), and Fin-
BERT (Finnish; Virtanen et al.,2019), for the other
approaches. The whole procedure is depicted in
Figure 1. The way we optimize models is provided
in Appendix C.3. We list training hardware, hyper-
parameter information in Appendix C.2, with the
environmental impact described in Appendix C.5.
3.5 Evaluation
Apart from evaluating models on the task perfor-
mance, we also evaluate the following calibration
and uncertainty, painting a multi-faceted picture
of the reliability of models. In all cases, we use
the Almost Stochastic Order test (ASO; del Bar-
rio et al.,2018;Dror et al.,2019) for significance
testing, which is elaborated on in Appendix C.1.
Evaluation of Calibration
First, we measure
the calibration of models using the adaptive cal-
ibration error (ACE; Nixon et al.,2019), which is
an extension of the expected calibration error (ECE;
Naeini et al.,2015;Guo et al.,2017).
6
Furthermore,
we use the frequentist measure of coverage (Larry,
2004;Kompa et al.,2021). Coverage is based on
6
See Appendix B for a short overview over the differences.
Language Task Dataset OOD Test Set # ID / OOD. Sub-sampled
Training Set Sizes
EN Intent
Classification
Clinc Plus (Larson et al.,2019) Out-of-scope voice commands 15k / 1k 15k / 12.5k / 10k
DA Named Entity
Recognition
Dan+ News (Plank et al.,2020) Tweets 4382 / 109 4k / 2k / 1k
FI PoS Tagging Finnish UD Treebank (Haverinen et al.,2014;
Pyysalo et al.,2015;Kanerva and Ginter,2022)
Hospital records, online forums,
tweets, poetry
12217 / 2122 10k / 7.5k / 5k
Table 1: Datasets. The original and sub-sampled number of sequences for experiments are given on the right.
the prediction set
ˆ
P(x)
of a classifier given an in-
put, which includes the most likely classes adding
up to or surpassing
1α
probability mass. A well-
tuned classifier should contain the correct class in
this very set, and minimize its width. The extent to
which this property holds can be determined by the
coverage percentage, i.e. the number of times the
correct class in indeed contain within the prediction
set, and its cardinality, denoted simply as width.
Evaluation of Uncertainty
We compare uncer-
tainty scores on the ID and OOD test set and mea-
sure the area under the receiver-operator curve (AU-
ROC) and under the precision-recall curve (AUPR),
assuming that uncertainty will generally be higher
on samples from the OOD test set.
7
An ideal model
should create very different distributions of confi-
dence scores on ID and OOD data, thus maximiz-
ing AUROC and AUPR. However, we also want
to find out to what extend uncertainty can give an
indication of the correctness of the model, which is
why we propose a new way to evaluate the discrimi-
nation property posed by Alaa and Van Der Schaar
(2020) based on Leonard et al. (1992): A good
model should be less certain for inputs that incur
a higher loss. To measure this both on a token and
sequence level, we utilize Kendall’s
τ
(Kendall,
1938), which, given two lists of measurements, de-
termines the degree to which they are concordant
— that is, to what extent the rankings of elements
according to their measured values agree. This
is expressed by a value between
1
and
1
, with
the latter expressing complete concordance. In our
case, these measurements correspond to the uncer-
tainty estimate and the actual model loss, either for
tokens (Token τ) or sequences (Sequence τ).
7
We thus formulate a pseudo-binary classification task as
common in the literature, using the model’s uncertainty score
to try to distinguish the two test sets. Note that we do not
advocate for actually using uncertainty for OOD detection,
but only use it for evaluation purposes, since uncertainty on
OOD examples should be high due to model uncertainty.
4 Experiments
4.1 RQ1: Uncertainty & Calibration
We present the results from our experiments using
the largest training set sizes per dataset in Table 2.
8
Task Performance
Across datasets and models,
we can identify several trends: some of the BERT-
based models unsurprisingly perform better than
LSTM based models, which can be explained with
their pre-training procedure. We observe worse
performance for some LSTM and BERT-variants,
in particular the Variational, Bayesian and ST-
τ
LSTM, as well the SNGP BERT. In accordance
with the ML literature (see e.g. Lakshminarayanan
et al. (2017); Ovadia et al. (2019), LSTM ensem-
bles actually perform very strongly and on par or
sometimes better than fine-tuned BERTs.
Calibration
We also see BERT models to gen-
erally achieve lower calibration errors across all
metrics measured, which is in line with previous
works (Desai and Durrett,2020;Dan and Roth,
2021). It is interesting to see that the correct pre-
diction is almost always contained in the
0.95
con-
fidence set across all models, however these num-
ber have to be interpreted in the context of the
set’s width: It becomes apparent that for instance
LSTMs achieve this coverage by spreading proba-
bility mass over many classes, while only BERT-
based models, LSTM ensembles as well as the
Bayesian LSTM (on Danish) and the Variational
LSTM (on Finnish) are confidently correct.
Uncertainty Quality
LSTM-based model seem
to struggle to distinguish in- from out-of-
distribution data based on predictive uncertainty.
For Danish, only BERTs perform visibly above
chance-level. For Finnish, the AUPR results sug-
gest that although some OOD instances are quickly
8
For English, some models were omitted due to conver-
gence issues, which are discussed in Appendix C.4.
摘要:

ExploringPredictiveUncertaintyandCalibrationinNLP:AStudyontheImpactofMethod&DataScarcityDennisUlmerÞJesFrellsenÄChristianHardmeierÞÞDepartmentofComputerScience,ITUniversityofCopenhagenÄDepartmentofAppliedMathematics&ComputerScience,TechnicalUniversityofDenmarkdennis.ulmer@mailbox.orgAbstractWeinvest...

展开>> 收起<<
Exploring Predictive Uncertainty and Calibration in NLP A Study on the Impact of Method Data Scarcity Dennis UlmerJes FrellsenrobotChristian Hardmeier.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:30 页 大小:1.82MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注