
To Softmax, or not to Softmax A PREPRINT
Aquery strategy f:U −→ Qthen subsequently chooses a batch of bunlabeled samples Q, which will be labeled
by the oracle (human expert) and added to the set of labeled data L. This AL cycle repeats τtimes until a stopping
criterion is met.
In its most basic form, the focus of the sampling criteria is on either informativeness/uncertainty, or representative-
ness/diversity. Informativeness in this context prefers samples, which reduce the error of the underlying classification
model by minimizing the uncertainty of the model in predicting unknown samples. Representativeness aims at an
evenly distributed sampling in the vector space of the samples. As we are interested in improving the core measure
of informativenesses, the confidence of the ML model in its own predictions, we are concentrating in this paper on
evaluating AL strategies, which are solely relying on the informativeness metric.
Commonly used AL query strategies, which are relying on informativeness, use the confidence of the learner model θ
to select the AL query. The confidence is defined by the probability of the learner Pθ(y|x)in classifying a sample x
with the label y. The most simple informativeness AL strategy is Uncertainty Least Confidence (LC) [LG94], which
selects those samples, where the learner model is most uncertain about, i.e. where the probability Pθ(ˆy|x)of the most
probable label ˆyis the lowest:
fLC (U) = argmax
x∈U
(1 −Pθ(ˆy|x)) (1)
As we are interested in the effectiveness of alternative methods in computing the confidence probability Pθ, we are
using each method in combination with Uncertainty Least Confidence (LC), which directly relies solely on the confi-
dence without further processing.
3 Confidence Probability Quantification Methods
The confidence probability of an ML model should represent, how probable it is that the prediction is true. For
example, a confidence of 70% should mean a correct prediction in 70 out of 100 cases. We are using Neural Network
(NN) as the ML model. Due to the property of having one as the sum for all components the softmax function is often
used as a makeshift probability measure for the confidence of NNs:
σ(zi) = exp(zi)
PK
j=1 exp(zj),for i= 1, . . . , K (2)
The output of the last neurons ibefore entering the activation functions is called logit, and denoted as zi,Kdenotes
the amount of neurons in the last layer. But as has been mentioned in the past by other researchers [DD22; GI22;
LPB17; SWF22; WT22], the training objective in training NNs is purely to maximize the value of the correct output
neuron, not to create a true confidence probability. An inherent limitation of the softmax function is its inability to
have – in the theoretical case – zero confidence in its prediction, as the sum of all possible outcomes always equals
1. Previous works have indicated that softmax based confidence is often overconfident [GG16]. Especially NNs using
the often used ReLU activation function [Ag18] for the inner layers can be easily tricked into being overly confident
in any prediction by simply scaling the input xwith an arbitrarily large value α > 1to ˜x=αx [HAB19; PBZ21].
We selected seven methods from the literature suitable to quantify the confidence of Deep NNs such as Transformer
models. They can be divided into four categories [Ga21]: a) single network deterministic methods, which determinis-
tically produce the same result for each NN forward pass (Inhibited Softmax (IS) [MSK18], TrustScore (TrSc) [Ji18]
and Evidential Neural Networks (Evi) [SKK18]), b) Bayesian methods, which sample from a distribution and result
therefore in non-deterministic results (Monte-Carlo Dropout (MC) [GG16]), c) ensemble methods, which combine
multiple deterministic models into a single decision (Softmax Ensemble [SOS92] ), and d) test-time augmentation
methods, which, similarly to the ensemble methods, augment the input samples, and return the combined prediction
for the augmented samples. The last category is a subject of future research as we could not find a subset of data
augmentation techniques which reliably worked well for our use case among different datasets.
Additionally to the aforementioned categories, the existing softmax function can be calibrated to produce meaningful
confidence probabilities. For calibration we selected the two techniques Label Smoothing (LS) [Sz16] and Temperature
Scaling (TeSc) [ZKH20].
More elaborate AL strategies like BALD [KAG19] or QUIRE [HJZ10] not only focus on the confidence probability
measure, but also make use of the vector space to label a diverse training set, including also regions far away from the
classification boundary. As the focus of this paper is on purely evaluating the influence of the confidence prediction
methods, we are deliberately solely using the most basic AL strategy Uncertainty Least Confidence.
In the following the core ideas of the individual methods are briefly explained, more details, reasonings, and the exact
formulas can be found in the original papers. In the end we are explaining our meta-strategy Uncertainty Clipping,
which significantly enhances several confidence probabilities quantification methods.
3