proximation method (Gal and Ghahramani,2016b;
Hendrycks and Gimpel,2017). As such, it has been
superseded by newer methods such as MC dropout,
which leverages the dropout function in neural nets
to approximate a random sample of multiple net-
works and aggregates the softmax outputs of this
sample. MC dropout is favoured due to its close
approximation of uncertainty, and because it can be
used without any modification to the applied model.
It has also been widely applied in text classification
tasks (Zhang et al.,2019;He et al.,2020).
To understand the cost vs. benefit of softmax
vs. MC dropout, we perform experiments on five
datasets using two different neural network archi-
tectures, applying them to three different down-
stream text classification tasks. We measure both
the added computational complexity in the form
of runtime (cost) and the downstream performance
on multiple uncertainty metrics (benefit). We show
that by using a single deterministic method like
softmax instead of MC dropout, we can improve
the runtime by
10
times while still providing rea-
sonable uncertainty estimates on the studied tasks.
As such, given the already high computational
cost of deep neural network based methods and
recent pushes for more sustainable ML (Strubell
et al.,2019;Patterson et al.,2021), we recommend
not discarding efficient uncertainty approximation
methods such as softmax in resource-constrained
settings, as they can still potentially provide rea-
sonable estimations of uncertainty.
Contribution In summary, our contributions are:
1) an empirical study of an efficient version of MC
dropout and softmax for text classification tasks,
using two different neural architectures, and five
datasets; 2) a comparison of uncertainty estimation
between MC dropout and softmax using expected
calibration error; 3) a comparison of the cost vs.
benefit of MC dropout and softmax in a setting
where efficiency is of concern.
2 Related Work
2.1 Uncertainty Quantification
Quantifying the uncertainty of a prediction can be
done using various techniques (Ovadia et al.,2019;
Gawlikowski et al.,2021;Henne et al.,2020) such
as single deterministic methods (Mozejko et al.,
2018;van Amersfoort et al.,2020) which calcu-
late the uncertainty on a single forward pass of
the model. They can further be classified as in-
ternal or external methods, which describe if the
uncertainty is calculated internally in the model or
post-processing the output. Another family of tech-
niques are Bayesian methods, which combine NNs
and Bayesian learning. Bayesian Neural Networks
(BNNs) can also be split into subcategories, namely
Variational Inference (Hinton and van Camp,1993),
Sampling (Neal,1993), and Laplace Approxima-
tion (MacKay,1992). Some of the more notable
methods are Bayes by backprop (Blundell et al.,
2015) and Monte Carlo Dropout (Gal and Ghahra-
mani,2016b). One can also approximate uncer-
tainty using ensemble methods, which use multiple
models to better measure predictive uncertainty,
compared to using the predictive uncertainty given
by a single model (Lakshminarayanan et al.,2017;
He et al.,2020;Durasov et al.,2021). Recently,
we have seen uncertainty methods being used to
develop methods for new tasks (Zhang et al.,2019;
He et al.,2020), where mainly Bayesian methods
have been used. We present a thorough empirical
study of how uncertainty quantification behaves
for text classification tasks. Unlike prior work, we
do not only evaluate based on the performance of
the methods, but perform an in-depth comparison
to much simpler deterministic methods based on
multiple metrics.
2.2 Uncertainty Metrics
Measuring the performance of uncertainty approxi-
mation methods can be done in multiple ways, each
offering benefits and downsides. Niculescu-Mizil
and Caruana (2015) explore the use of obtaining
confidence values from model predictions to use for
supervised learning. One of the more widespread
and accepted methods is using expected calibra-
tion error (ECE, Guo et al.,2017). While ECE
measures the underlying confidence of the uncer-
tainty approximation, we have also seen the use of
human intervention for text classification (Zhang
et al.,2019;He et al.,2020). There, the uncertainty
estimates are used to identify uncertain predictions
from the model and ask humans to classify these
predictions. The human classified data is assumed
to have
100%
accuracy and to be suitable for mea-
suring how well the model scores after removing a
proportion of the most uncertain data points. Using
metrics such as ECE, the calibration of models is
shown, and this calibration can be improved using
scaling techniques (Guo et al.,2017;Naeini et al.,
2015). We use uncertainty approximation metrics
like expected calibration error, and human interven-