Revisiting Softmax for Uncertainty Approximation in Text Classification Andreas Nugaard Holm and Dustin Wright and Isabelle Augenstein Department of Computer Science University of Copenhagen

2025-04-29 0 0 1.66MB 16 页 10玖币
侵权投诉
Revisiting Softmax for Uncertainty Approximation in Text Classification
Andreas Nugaard Holm and Dustin Wright and Isabelle Augenstein
Department of Computer Science, University of Copenhagen
{aholm, dw, augenstein}@di.ku.dk
Abstract
Uncertainty approximation in text classification
is an important area with applications in do-
main adaptation and interpretability. One of the
most widely used uncertainty approximation
methods is Monte Carlo (MC) Dropout, which
is computationally expensive as it requires mul-
tiple forward passes through the model. A
cheaper alternative is to simply use the soft-
max based on a single forward pass without
dropout to estimate model uncertainty. How-
ever, prior work has indicated that these predic-
tions tend to be overconfident. In this paper, we
perform a thorough empirical analysis of these
methods on five datasets with two base neural
architectures in order to identify the trade-offs
between the two. We compare both softmax
and an efficient version of MC Dropout on their
uncertainty approximations and downstream
text classification performance, while weigh-
ing their runtime (cost) against performance
(benefit). We find that, while MC dropout pro-
duces the best uncertainty approximations, us-
ing a simple softmax leads to competitive and
in some cases better uncertainty estimation for
text classification at a much lower computa-
tional cost, suggesting that softmax can in fact
be a sufficient uncertainty estimate when com-
putational resources are a concern.
1 Introduction
The pursuit of pushing state-of-the-art performance
on machine learning benchmarks often comes with
an added cost of computational complexity. On
top of already complex base models, such as Trans-
former models (Vaswani et al.,2017;Lin et al.,
2021), successful methods often employ additional
techniques to improve the uncertainty estimation
of these models as they tend to be over-confident
in their predictions. Though these techniques can
be effective, the overall benefit in relation to the
added computational cost is under-studied.
More complexity does not always imply better
performance. For example, Transformers can be
outperformed by much simpler convolutional neu-
ral nets (CNNs) when the latter are pre-trained as
well (Tay et al.,2021). Here, we turn our attention
to neural network uncertainty estimation methods
in text classification, which have applications in do-
main adaptation and decision making, and can help
make models more transparent and explainable. In
particular, we focus on a setting where efficiency
is of concern, which can help improve the sustain-
ability and democratisation of machine learning, as
well as enable use in resource-constrained environ-
ments.
Quantifying predictive uncertainty in neural nets
has been explored using various techniques (Gaw-
likowski et al.,2021), with the methods being di-
vided into three main categories: Bayesian meth-
ods, single deterministic networks, and ensemble
methods. Bayesian methods include Monte Carlo
(MC) dropout (Gal and Ghahramani,2016b) and
Bayes by back-prop (Blundell et al.,2015). Sin-
gle deterministic networks can approximate the
predictive uncertainty by a single forward pass in
the model, with softmax being the prototypical
method. Lastly, ensemble methods utilise a col-
lection of models to calculate the predictive uncer-
tainty. However, while uncertainty estimation can
improve when using more complex Bayesian and
ensembling techniques, efficiency takes a hit.
In this paper, we perform an empirical inves-
tigation of the trade-off between choosing cheap
vs. expensive uncertainty approximation methods
for text classification, with the goal of highlight-
ing the efficacy of these methods in an efficient
setting. We focus on one single deterministic and
one Bayesian method. For the single deterministic
method, we study the softmax, which is calculated
from a single forward pass and is computationally
very efficient. While softmax is a widely used
method, prior work has posited that the softmax
output when taken as a single deterministic oper-
ation is not the most dependable uncertainty ap-
arXiv:2210.14037v2 [cs.LG] 19 Jul 2023
proximation method (Gal and Ghahramani,2016b;
Hendrycks and Gimpel,2017). As such, it has been
superseded by newer methods such as MC dropout,
which leverages the dropout function in neural nets
to approximate a random sample of multiple net-
works and aggregates the softmax outputs of this
sample. MC dropout is favoured due to its close
approximation of uncertainty, and because it can be
used without any modification to the applied model.
It has also been widely applied in text classification
tasks (Zhang et al.,2019;He et al.,2020).
To understand the cost vs. benefit of softmax
vs. MC dropout, we perform experiments on five
datasets using two different neural network archi-
tectures, applying them to three different down-
stream text classification tasks. We measure both
the added computational complexity in the form
of runtime (cost) and the downstream performance
on multiple uncertainty metrics (benefit). We show
that by using a single deterministic method like
softmax instead of MC dropout, we can improve
the runtime by
10
times while still providing rea-
sonable uncertainty estimates on the studied tasks.
As such, given the already high computational
cost of deep neural network based methods and
recent pushes for more sustainable ML (Strubell
et al.,2019;Patterson et al.,2021), we recommend
not discarding efficient uncertainty approximation
methods such as softmax in resource-constrained
settings, as they can still potentially provide rea-
sonable estimations of uncertainty.
Contribution In summary, our contributions are:
1) an empirical study of an efficient version of MC
dropout and softmax for text classification tasks,
using two different neural architectures, and five
datasets; 2) a comparison of uncertainty estimation
between MC dropout and softmax using expected
calibration error; 3) a comparison of the cost vs.
benefit of MC dropout and softmax in a setting
where efficiency is of concern.
2 Related Work
2.1 Uncertainty Quantification
Quantifying the uncertainty of a prediction can be
done using various techniques (Ovadia et al.,2019;
Gawlikowski et al.,2021;Henne et al.,2020) such
as single deterministic methods (Mozejko et al.,
2018;van Amersfoort et al.,2020) which calcu-
late the uncertainty on a single forward pass of
the model. They can further be classified as in-
ternal or external methods, which describe if the
uncertainty is calculated internally in the model or
post-processing the output. Another family of tech-
niques are Bayesian methods, which combine NNs
and Bayesian learning. Bayesian Neural Networks
(BNNs) can also be split into subcategories, namely
Variational Inference (Hinton and van Camp,1993),
Sampling (Neal,1993), and Laplace Approxima-
tion (MacKay,1992). Some of the more notable
methods are Bayes by backprop (Blundell et al.,
2015) and Monte Carlo Dropout (Gal and Ghahra-
mani,2016b). One can also approximate uncer-
tainty using ensemble methods, which use multiple
models to better measure predictive uncertainty,
compared to using the predictive uncertainty given
by a single model (Lakshminarayanan et al.,2017;
He et al.,2020;Durasov et al.,2021). Recently,
we have seen uncertainty methods being used to
develop methods for new tasks (Zhang et al.,2019;
He et al.,2020), where mainly Bayesian methods
have been used. We present a thorough empirical
study of how uncertainty quantification behaves
for text classification tasks. Unlike prior work, we
do not only evaluate based on the performance of
the methods, but perform an in-depth comparison
to much simpler deterministic methods based on
multiple metrics.
2.2 Uncertainty Metrics
Measuring the performance of uncertainty approxi-
mation methods can be done in multiple ways, each
offering benefits and downsides. Niculescu-Mizil
and Caruana (2015) explore the use of obtaining
confidence values from model predictions to use for
supervised learning. One of the more widespread
and accepted methods is using expected calibra-
tion error (ECE, Guo et al.,2017). While ECE
measures the underlying confidence of the uncer-
tainty approximation, we have also seen the use of
human intervention for text classification (Zhang
et al.,2019;He et al.,2020). There, the uncertainty
estimates are used to identify uncertain predictions
from the model and ask humans to classify these
predictions. The human classified data is assumed
to have
100%
accuracy and to be suitable for mea-
suring how well the model scores after removing a
proportion of the most uncertain data points. Using
metrics such as ECE, the calibration of models is
shown, and this calibration can be improved using
scaling techniques (Guo et al.,2017;Naeini et al.,
2015). We use uncertainty approximation metrics
like expected calibration error, and human interven-
tion (which we refer to as holdout experiments) to
measure the difference in the performance of MC
dropout and softmax compared against each other
on text classification tasks.
3 Uncertainty Approximation for Text
Classification
We focus on one deterministic method and one
Bayesian method of uncertainty approximation.
Both methods assume the existence of an already
trained base model, and are applied at test time
to obtain uncertainty estimates from the model’s
predictions. In the following sections, we formally
introduce the two methods we study, namely MC
dropout and softmax. MC dropout is a Bayesian
method which utilises the dropout layers of the
model to measure the predictive uncertainty, while
softmax is a deterministic method that uses the
classification output. In Figure 1, we visualise the
differences between the two methods and how they
are connected to base text classification models.
3.1 Bayesian Learning
Before introducing the MC dropout method, we
quickly introduce the concept of Bayesian learn-
ing. We start by comparing Bayesian learning to
a traditional NN. A traditional NN assumes that
the network weights
ωRn
are real but of an un-
known value and can be found through maximum-
likelihood estimation, and the input data
(x, y)
D
are treated as random variables. Bayesian learn-
ing instead views the weights as random variables,
and infers a posterior distribution
p(ω|D)
over
ω
after observing
D
. The posterior distribution is
defined as follows:
p(ω|D) = p(ω)p(D|ω)
p(D)=p(ω)p(D|ω)
Rp(ω)p(D|ω)
(1)
Using the posterior distribution, we can find the
prediction of an input of unseen data
x
and
y
as
follows:
p(y|x,D) = Zp(y|x, ω)p(ω|D). (2)
However, the posterior distribution is infeasible to
compute due to the marginal likelihood in the de-
nominator, so we cannot find a solution analytically.
We therefore resort to approximating the posterior
distribution. For this approximation we rely on
methods such as Bayes by Backpropagation (Blun-
dell et al.,2015) and Monte Carlo Dropout (Gal
and Ghahramani,2016b).
3.2 Monte Carlo Dropout
At a high level, MC dropout approximates the
posterior distribution
p(ω|D)
by leveraging the
dropout layers in a model (Gal and Ghahramani,
2016b,a). Mathematically, it is derived by introduc-
ing a distribution
q(ω)
, representing a distribution
of weight matrices whose columns are randomly
set to
0
, to approximate the posterior distribution
p(ω|D)
, which results in the following predictive
distribution:
q(y|x,D) = Zp(y|x, ω)q(ω). (3)
As this integral is still intractable, it is approx-
imated by taking
K
samples from
q(ω)
using
the dropout layers of a learned network
f
which
approximates
p(y|x, ω)
. As such, calculating
p(y|x, ω)q(ω)
amounts to leaving the dropout
layers active during testing, and approximating the
integral amounts to aggregating predictions across
multiple dropout samples. For the proofs, see Gal
and Ghahramani (2016b).
MC dropout requires multiple forward passes,
so its computational cost is a multiple of the cost of
performing a forward pass through the entire net-
work. As this is obviously more computationally
expensive than the single forward pass required for
deterministic methods, we provide a fairer compar-
ison between softmax and MC dropout by using an
efficient version of MC dropout which caches an
intermediate representation and only activates the
dropout layers of the latter part of the network. As
such, we obtain a representation
z
by passing an
input through the first several layers of the model
and pass only this representation through the latter
part of the model multiple times, reducing the com-
putational cost while approximating the sampling
of multiple networks.
3.2.1 Combining Sample Predictions
With multiple samples of the same data point, we
have to determine how to combine them to quantify
the predictive uncertainty. We test two methods
that can be calculated using the logits of the model,
requiring no model changes. The first approach,
which we refer to as Mean MC, is averaging the
output of the softmax layer from all forward passes:
ui=1
K
K
X
k=1
Softmax f(zk
i),(4)
Figure 1: MC Dropout (left) and softmax (right). In the
version of MC dropout tested in this paper, a test input
x
is passed through model
f
to obtain a representation
z
, which is then subsequently passed through a dropout
layer multiple times, and passed through the final part
of the network to obtain prediction
y
. For softmax,
dropout is disabled and a single prediction is obtained.
where
zk
i
is a representation of the
i
’th data
point of the
k
’th forward pass and
f
is a fully-
connected layer. The second method we use to
quantify the predictive uncertainty is Dropout En-
tropy (DE) (Zhang et al.,2019) which uses a com-
bination of binning and entropy:
bi=1
KBinCount(argmax(f(zi))) (5)
ui=
C
X
j=1
bi(j) log bi(j)(6)
BinCount
is the number of predictions of each class
and
b
is a vector the probabilities of a class’s oc-
currence based on the bin count. We show the
performance of the two methods in Section 4.3.2.
3.3 Softmax
Softmax, a common normalising function for pro-
ducing a probability distribution from neural net-
work logits, is defined as follows:
ui=ezi
PC
j=1 ezi(j),(7)
where
zi
are the logits of the
i
’th data point. The
softmax yields a probability distribution over the
predicted classes. However, the predicted prob-
ability distribution is often overconfident toward
the predicted class (Gal and Ghahramani,2016b;
Hendrycks and Gimpel,2017). The issue of soft-
max’s overconfidence can also be exploited (Gal
and Ghahramani,2016b;Joo et al.,2020) – in the
worst case, this leads to the softmax producing im-
precise uncertainties. However, model calibration
methods like temperature scaling have been found
to lessen the overconfidence to some extent (Guo
et al.,2017). As temperature scaling also incurs a
cost in terms of runtime in order to find an optimal
temperature, we choose to compare raw softmax
probabilities to the efficient MC dropout method
desribed previously, though uncertainty estimation
could potentially be improved by scaling the logits
appropriately.
4 Experiments and Results
We consider five different datasets and two different
base models in our experiments. Additionally, we
conduct experiments to determine the optimal hy-
perparameters for the MC dropout method, particu-
larly the optimal amount of samples which affects
the efficiency and performance of MC dropout. In
the paper we focus on the results of the 20 News-
groups dataset, the results of the other four datasets
are shown in the Appendix Band C. We further
find the optimal dropout percentage in Appendix
A.3.
4.1 Data
To test the predictive uncertainty of the two meth-
ods, we use five datasets for diverse text classifica-
tion tasks. We use the following five datasets: The
20 Newsgroups dataset (Lang,1995), is a text clas-
sification consisting of a collection of
20.000
news
articles. The news articles are classified into 20
different classes. The Amazon dataset (McAuley
and Leskovec,2013) is a sentiment classification
task. We use the ‘sports and outdoors’ category,
which consists of
272.630
reviews ranging from
1
to
5
. The IMDb dataset (Maas et al.,2011)
is also a sentiment classification task. However,
compared to the amazon dataset, this is a binary
problem. The dataset consists of
50.000
reviews.
The SST-2 dataset (Socher et al.,2013), is also
a binary sentiment classification dataset, consist-
ing of
70.042
sentences. Lastly, we also use the
WIKI dataset (Redi et al.,2019), which is a citation
needed task, i.e. we predict if a citation is needed.
The dataset consists of
19.998
texts. For the 20
Newsgroups, Amazon, IMDb and Wiki datasets,
we use a split of
60
,
20
and
20
for the training,
validation and test data, the data in splits have been
selected randomly. We used the provided splits for
the SST-2 dataset, but due to the test labels being
摘要:

RevisitingSoftmaxforUncertaintyApproximationinTextClassificationAndreasNugaardHolmandDustinWrightandIsabelleAugensteinDepartmentofComputerScience,UniversityofCopenhagen{aholm,dw,augenstein}@di.ku.dkAbstractUncertaintyapproximationintextclassificationisanimportantareawithapplicationsindo-mainadaptati...

展开>> 收起<<
Revisiting Softmax for Uncertainty Approximation in Text Classification Andreas Nugaard Holm and Dustin Wright and Isabelle Augenstein Department of Computer Science University of Copenhagen.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.66MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注