Revisiting Softmax for Uncertainty Approximation in Text Classification Andreas Nugaard Holm and Dustin Wright and Isabelle Augenstein Department of Computer Science University of Copenhagen

2025-04-29 1 0 1.66MB 16 页 10玖币

侵权投诉

Revisiting Softmax for Uncertainty Approximation in Text Classiﬁcation

Andreas Nugaard Holm and Dustin Wright and Isabelle Augenstein

Department of Computer Science, University of Copenhagen

{aholm, dw, augenstein}@di.ku.dk

Abstract

Uncertainty approximation in text classiﬁcation

is an important area with applications in do-

main adaptation and interpretability. One of the

most widely used uncertainty approximation

methods is Monte Carlo (MC) Dropout, which

is computationally expensive as it requires mul-

tiple forward passes through the model. A

cheaper alternative is to simply use the soft-

max based on a single forward pass without

dropout to estimate model uncertainty. How-

ever, prior work has indicated that these predic-

tions tend to be overconﬁdent. In this paper, we

perform a thorough empirical analysis of these

methods on ﬁve datasets with two base neural

architectures in order to identify the trade-offs

between the two. We compare both softmax

and an efﬁcient version of MC Dropout on their

uncertainty approximations and downstream

text classiﬁcation performance, while weigh-

ing their runtime (cost) against performance

(beneﬁt). We ﬁnd that, while MC dropout pro-

duces the best uncertainty approximations, us-

ing a simple softmax leads to competitive and

in some cases better uncertainty estimation for

text classiﬁcation at a much lower computa-

tional cost, suggesting that softmax can in fact

be a sufﬁcient uncertainty estimate when com-

putational resources are a concern.

1 Introduction

The pursuit of pushing state-of-the-art performance

on machine learning benchmarks often comes with

an added cost of computational complexity. On

top of already complex base models, such as Trans-

former models (Vaswani et al.,2017;Lin et al.,

2021), successful methods often employ additional

techniques to improve the uncertainty estimation

of these models as they tend to be over-conﬁdent

in their predictions. Though these techniques can

be effective, the overall beneﬁt in relation to the

added computational cost is under-studied.

More complexity does not always imply better

performance. For example, Transformers can be

outperformed by much simpler convolutional neu-

ral nets (CNNs) when the latter are pre-trained as

well (Tay et al.,2021). Here, we turn our attention

to neural network uncertainty estimation methods

in text classiﬁcation, which have applications in do-

main adaptation and decision making, and can help

make models more transparent and explainable. In

particular, we focus on a setting where efﬁciency

is of concern, which can help improve the sustain-

ability and democratisation of machine learning, as

well as enable use in resource-constrained environ-

ments.

Quantifying predictive uncertainty in neural nets

has been explored using various techniques (Gaw-

likowski et al.,2021), with the methods being di-

vided into three main categories: Bayesian meth-

ods, single deterministic networks, and ensemble

methods. Bayesian methods include Monte Carlo

(MC) dropout (Gal and Ghahramani,2016b) and

Bayes by back-prop (Blundell et al.,2015). Sin-

gle deterministic networks can approximate the

predictive uncertainty by a single forward pass in

the model, with softmax being the prototypical

method. Lastly, ensemble methods utilise a col-

lection of models to calculate the predictive uncer-

tainty. However, while uncertainty estimation can

improve when using more complex Bayesian and

ensembling techniques, efﬁciency takes a hit.

In this paper, we perform an empirical inves-

tigation of the trade-off between choosing cheap

vs. expensive uncertainty approximation methods

for text classiﬁcation, with the goal of highlight-

ing the efﬁcacy of these methods in an efﬁcient

setting. We focus on one single deterministic and

one Bayesian method. For the single deterministic

method, we study the softmax, which is calculated

from a single forward pass and is computationally

very efﬁcient. While softmax is a widely used

method, prior work has posited that the softmax

output when taken as a single deterministic oper-

ation is not the most dependable uncertainty ap-

arXiv:2210.14037v2 [cs.LG] 19 Jul 2023

proximation method (Gal and Ghahramani,2016b;

Hendrycks and Gimpel,2017). As such, it has been

superseded by newer methods such as MC dropout,

which leverages the dropout function in neural nets

to approximate a random sample of multiple net-

works and aggregates the softmax outputs of this

sample. MC dropout is favoured due to its close

approximation of uncertainty, and because it can be

used without any modiﬁcation to the applied model.

It has also been widely applied in text classiﬁcation

tasks (Zhang et al.,2019;He et al.,2020).

To understand the cost vs. beneﬁt of softmax

vs. MC dropout, we perform experiments on ﬁve

datasets using two different neural network archi-

tectures, applying them to three different down-

stream text classiﬁcation tasks. We measure both

the added computational complexity in the form

of runtime (cost) and the downstream performance

on multiple uncertainty metrics (beneﬁt). We show

that by using a single deterministic method like

softmax instead of MC dropout, we can improve

the runtime by

times while still providing rea-

sonable uncertainty estimates on the studied tasks.

As such, given the already high computational

cost of deep neural network based methods and

recent pushes for more sustainable ML (Strubell

et al.,2019;Patterson et al.,2021), we recommend

not discarding efﬁcient uncertainty approximation

methods such as softmax in resource-constrained

settings, as they can still potentially provide rea-

sonable estimations of uncertainty.

Contribution In summary, our contributions are:

1) an empirical study of an efﬁcient version of MC

dropout and softmax for text classiﬁcation tasks,

using two different neural architectures, and ﬁve

datasets; 2) a comparison of uncertainty estimation

between MC dropout and softmax using expected

calibration error; 3) a comparison of the cost vs.

beneﬁt of MC dropout and softmax in a setting

where efﬁciency is of concern.

2 Related Work

2.1 Uncertainty Quantiﬁcation

Quantifying the uncertainty of a prediction can be

done using various techniques (Ovadia et al.,2019;

Gawlikowski et al.,2021;Henne et al.,2020) such

as single deterministic methods (Mozejko et al.,

2018;van Amersfoort et al.,2020) which calcu-

late the uncertainty on a single forward pass of

the model. They can further be classiﬁed as in-

ternal or external methods, which describe if the

uncertainty is calculated internally in the model or

post-processing the output. Another family of tech-

niques are Bayesian methods, which combine NNs

and Bayesian learning. Bayesian Neural Networks

(BNNs) can also be split into subcategories, namely

Variational Inference (Hinton and van Camp,1993),

Sampling (Neal,1993), and Laplace Approxima-

tion (MacKay,1992). Some of the more notable

methods are Bayes by backprop (Blundell et al.,

2015) and Monte Carlo Dropout (Gal and Ghahra-

mani,2016b). One can also approximate uncer-

tainty using ensemble methods, which use multiple

models to better measure predictive uncertainty,

compared to using the predictive uncertainty given

by a single model (Lakshminarayanan et al.,2017;

He et al.,2020;Durasov et al.,2021). Recently,

we have seen uncertainty methods being used to

develop methods for new tasks (Zhang et al.,2019;

He et al.,2020), where mainly Bayesian methods

have been used. We present a thorough empirical

study of how uncertainty quantiﬁcation behaves

for text classiﬁcation tasks. Unlike prior work, we

do not only evaluate based on the performance of

the methods, but perform an in-depth comparison

to much simpler deterministic methods based on

multiple metrics.

2.2 Uncertainty Metrics

Measuring the performance of uncertainty approxi-

mation methods can be done in multiple ways, each

offering beneﬁts and downsides. Niculescu-Mizil

and Caruana (2015) explore the use of obtaining

conﬁdence values from model predictions to use for

supervised learning. One of the more widespread

and accepted methods is using expected calibra-

tion error (ECE, Guo et al.,2017). While ECE

measures the underlying conﬁdence of the uncer-

tainty approximation, we have also seen the use of

human intervention for text classiﬁcation (Zhang

et al.,2019;He et al.,2020). There, the uncertainty

estimates are used to identify uncertain predictions

from the model and ask humans to classify these

predictions. The human classiﬁed data is assumed

to have

100%

accuracy and to be suitable for mea-

suring how well the model scores after removing a

proportion of the most uncertain data points. Using

metrics such as ECE, the calibration of models is

shown, and this calibration can be improved using

scaling techniques (Guo et al.,2017;Naeini et al.,

2015). We use uncertainty approximation metrics

like expected calibration error, and human interven-

tion (which we refer to as holdout experiments) to

measure the difference in the performance of MC

dropout and softmax compared against each other

on text classiﬁcation tasks.

3 Uncertainty Approximation for Text

Classiﬁcation

We focus on one deterministic method and one

Bayesian method of uncertainty approximation.

Both methods assume the existence of an already

trained base model, and are applied at test time

to obtain uncertainty estimates from the model’s

predictions. In the following sections, we formally

introduce the two methods we study, namely MC

dropout and softmax. MC dropout is a Bayesian

method which utilises the dropout layers of the

model to measure the predictive uncertainty, while

softmax is a deterministic method that uses the

classiﬁcation output. In Figure 1, we visualise the

differences between the two methods and how they

are connected to base text classiﬁcation models.

3.1 Bayesian Learning

Before introducing the MC dropout method, we

quickly introduce the concept of Bayesian learn-

ing. We start by comparing Bayesian learning to

a traditional NN. A traditional NN assumes that

the network weights

ω∈Rn

are real but of an un-

known value and can be found through maximum-

likelihood estimation, and the input data

(x, y)∈

are treated as random variables. Bayesian learn-

ing instead views the weights as random variables,

and infers a posterior distribution

p(ω|D)

over

after observing

. The posterior distribution is

deﬁned as follows:

p(ω|D) = p(ω)p(D|ω)

p(D)=p(ω)p(D|ω)

Rp(ω)p(D|ω)dω

(1)

Using the posterior distribution, we can ﬁnd the

prediction of an input of unseen data

x∗

and

y∗

follows:

p(y∗|x∗,D) = Zp(y∗|x∗, ω)p(ω|D)dω. (2)

However, the posterior distribution is infeasible to

compute due to the marginal likelihood in the de-

nominator, so we cannot ﬁnd a solution analytically.

We therefore resort to approximating the posterior

distribution. For this approximation we rely on

methods such as Bayes by Backpropagation (Blun-

dell et al.,2015) and Monte Carlo Dropout (Gal

and Ghahramani,2016b).

3.2 Monte Carlo Dropout

At a high level, MC dropout approximates the

posterior distribution

p(ω|D)

by leveraging the

dropout layers in a model (Gal and Ghahramani,

2016b,a). Mathematically, it is derived by introduc-

ing a distribution

q(ω)

, representing a distribution

of weight matrices whose columns are randomly

set to

, to approximate the posterior distribution

p(ω|D)

, which results in the following predictive

distribution:

q(y∗|x∗,D) = Zp(y∗|x∗, ω)q(ω)dω. (3)

As this integral is still intractable, it is approx-

imated by taking

samples from

q(ω)

using

the dropout layers of a learned network

which

approximates

p(y∗|x∗, ω)

. As such, calculating

p(y∗|x∗, ω)q(ω)

amounts to leaving the dropout

layers active during testing, and approximating the

integral amounts to aggregating predictions across

multiple dropout samples. For the proofs, see Gal

and Ghahramani (2016b).

MC dropout requires multiple forward passes,

so its computational cost is a multiple of the cost of

performing a forward pass through the entire net-

work. As this is obviously more computationally

expensive than the single forward pass required for

deterministic methods, we provide a fairer compar-

ison between softmax and MC dropout by using an

efﬁcient version of MC dropout which caches an

intermediate representation and only activates the

dropout layers of the latter part of the network. As

such, we obtain a representation

z∗

by passing an

input through the ﬁrst several layers of the model

and pass only this representation through the latter

part of the model multiple times, reducing the com-

putational cost while approximating the sampling

of multiple networks.

3.2.1 Combining Sample Predictions

With multiple samples of the same data point, we

have to determine how to combine them to quantify

the predictive uncertainty. We test two methods

that can be calculated using the logits of the model,

requiring no model changes. The ﬁrst approach,

which we refer to as Mean MC, is averaging the

output of the softmax layer from all forward passes:

ui=1

k=1

Softmax f(zk

i),(4)

Figure 1: MC Dropout (left) and softmax (right). In the

version of MC dropout tested in this paper, a test input

x∗

is passed through model

to obtain a representation

z∗

, which is then subsequently passed through a dropout

layer multiple times, and passed through the ﬁnal part

of the network to obtain prediction

y∗

. For softmax,

dropout is disabled and a single prediction is obtained.

where

is a representation of the

’th data

point of the

’th forward pass and

is a fully-

connected layer. The second method we use to

quantify the predictive uncertainty is Dropout En-

tropy (DE) (Zhang et al.,2019) which uses a com-

bination of binning and entropy:

bi=1

KBinCount(argmax(f(zi))) (5)

ui=−

j=1

bi(j) log bi(j)(6)

BinCount

is the number of predictions of each class

and

is a vector the probabilities of a class’s oc-

currence based on the bin count. We show the

performance of the two methods in Section 4.3.2.

3.3 Softmax

Softmax, a common normalising function for pro-

ducing a probability distribution from neural net-

work logits, is deﬁned as follows:

ui=ezi

j=1 ezi(j),(7)

where

are the logits of the

’th data point. The

softmax yields a probability distribution over the

predicted classes. However, the predicted prob-

ability distribution is often overconﬁdent toward

the predicted class (Gal and Ghahramani,2016b;

Hendrycks and Gimpel,2017). The issue of soft-

max’s overconﬁdence can also be exploited (Gal

and Ghahramani,2016b;Joo et al.,2020) – in the

worst case, this leads to the softmax producing im-

precise uncertainties. However, model calibration

methods like temperature scaling have been found

to lessen the overconﬁdence to some extent (Guo

et al.,2017). As temperature scaling also incurs a

cost in terms of runtime in order to ﬁnd an optimal

temperature, we choose to compare raw softmax

probabilities to the efﬁcient MC dropout method

desribed previously, though uncertainty estimation

could potentially be improved by scaling the logits

appropriately.

4 Experiments and Results

We consider ﬁve different datasets and two different

base models in our experiments. Additionally, we

conduct experiments to determine the optimal hy-

perparameters for the MC dropout method, particu-

larly the optimal amount of samples which affects

the efﬁciency and performance of MC dropout. In

the paper we focus on the results of the 20 News-

groups dataset, the results of the other four datasets

are shown in the Appendix Band C. We further

ﬁnd the optimal dropout percentage in Appendix

A.3.

4.1 Data

To test the predictive uncertainty of the two meth-

ods, we use ﬁve datasets for diverse text classiﬁca-

tion tasks. We use the following ﬁve datasets: The

20 Newsgroups dataset (Lang,1995), is a text clas-

siﬁcation consisting of a collection of

20.000

news

articles. The news articles are classiﬁed into 20

different classes. The Amazon dataset (McAuley

and Leskovec,2013) is a sentiment classiﬁcation

task. We use the ‘sports and outdoors’ category,

which consists of

272.630

reviews ranging from

. The IMDb dataset (Maas et al.,2011)

is also a sentiment classiﬁcation task. However,

compared to the amazon dataset, this is a binary

problem. The dataset consists of

50.000

reviews.

The SST-2 dataset (Socher et al.,2013), is also

a binary sentiment classiﬁcation dataset, consist-

ing of

70.042

sentences. Lastly, we also use the

WIKI dataset (Redi et al.,2019), which is a citation

needed task, i.e. we predict if a citation is needed.

The dataset consists of

19.998

texts. For the 20

Newsgroups, Amazon, IMDb and Wiki datasets,

we use a split of

and

for the training,

validation and test data, the data in splits have been

selected randomly. We used the provided splits for

the SST-2 dataset, but due to the test labels being

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RevisitingSoftmaxforUncertaintyApproximationinTextClassificationAndreasNugaardHolmandDustinWrightandIsabelleAugensteinDepartmentofComputerScience,UniversityofCopenhagen{aholm,dw,augenstein}@di.ku.dkAbstractUncertaintyapproximationintextclassificationisanimportantareawithapplicationsindo-mainadaptati...

展开>> 收起<<

Revisiting Softmax for Uncertainty Approximation in Text Classification Andreas Nugaard Holm and Dustin Wright and Isabelle Augenstein Department of Computer Science University of Copenhagen.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Revisiting Softmax for Uncertainty Approximation in Text Classification Andreas Nugaard Holm and Dustin Wright and Isabelle Augenstein Department of Computer Science University of Copenhagen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: