
1.1 Related work
There are (at least) two scenarios where a model has a high predictive uncertainty: when it encoun-
ters an unknown sample (out-of-distribution, OOD) or a known (in-distribution, ID) but ambiguous
sample. In either case, the model accuracy is likely low and samples should be deferred to a human
expert to avoid erroneous predictions. By setting an upper threshold on the uncertainty of samples,
uncertainty estimation methods can be used for selective prediction.
Uncertainty estimation methods The maximum softmax probability (MSP) [Hendrycks and
Gimpel, 2016] is a common baseline estimate of predictive uncertainty, but in general not well cal-
ibrated [Guo et al., 2017]. Bayesian neural networks [Mackay, 1992, Buntine and Weigend, 1991,
Mackay, 1995] provide a principled approach to quantify model uncertainty but require dedicated ar-
chitectures, are difficult and expensive to train, hard to scale to large models and datasets, and their
uncertainty estimates may not be robust to to dataset shift [Ovadia et al., 2019, Gustafsson et al.,
2020]. There are various approximations that reduce the computational complexity, such as low
rank approximations [Dusenberry et al., 2020] and Markov chain Monte Carlo methods [Welling
and Teh, 2011], or that can be transferred to standard network architectures, such as Laplace ap-
proximations [MacKay, 1992, Daxberger et al., 2021].
Monte-Carlo (MC) dropout [Gal and Ghahramani, 2016] is a widely used method, as it is easily
implemented in architectures with DropOut layers [Hinton et al., 2012]. Currently, state-of-the-
art uncertainty estimates are obtained by using the entropy in the predictions of a deep ensemble
[Lakshminarayanan et al., 2017], or an efficient approximation thereof [Wen et al., 2020]. Similarly
competitive are a number of recently proposed methods that require only a single-forward pass
and are "distance-aware" [Tagasovska and Lopez-Paz, 2019, Liu et al., 2020, Mukhoti et al., 2021,
Van Amersfoort et al., 2020, Jain et al., 2021]. They use feature space distances between training
and test samples to quantify uncertainty. This allows them to accurately estimate uncertainty far
away from the decision boundary.
Uncertainty estimation and OOD detection in histopathology Several studies in histopathology
use deep ensembles [Lakshminarayanan et al., 2017, Poceviˇ
ci¯
ut˙
e et al., 2021, Thagaard et al., 2020],
while Senousy et al. [2021b] use MSP to select models for an ensemble. Linmans et al. [2020] use
an ensembling approach on multiple prediction heads for open set recognition (OSR). R ˛aczkowski
et al. [2019] show that MC Dropout-based uncertainty is high for ambiguous or mislabelled patches,
but did not test on OOD data; Syrykh et al. [2020] use MC Dropout for both OSR and OOD detection
but do not report OOD detection metrics. Note that, MC Dropout can be problematic in network
architectures commonly used in computational histopathology 2, and has been shown to negatively
affect task performance [Linmans et al., 2020].
Unfortunately, the OOD detection reported in Syrykh et al. [2020], Poceviˇ
ci¯
ut˙
e et al. [2021],
Senousy et al. [2021a] is of limited insight, as uncertainty thresholds for selective prediction were
set on the same OOD data on which performance was evaluated. Dolezal et al. [2022] avoid such
data leakage by setting the uncertainty threshold on validation data using cross-validation. However,
it is unclear whether a threshold chosen to distinghish between correct and incorrect ID samples is
suitable to separate ID and OOD data. For example, on a dataset for which there is no correct diag-
nosis, still more than 20% of slides are rated high-confidence by the uncertainty-aware classifier of
Dolezal et al. [2022].
1.2 Contributions
The uncertainty estimation methods used by previous work like MC Dropout or deep ensembles
don’t show state-of-the-art performance on standard ML datasets or require substantial additional
compute, respectively. They also estimate uncertainty around the decision boundary, i.e. are most
suitable to detect ambiguous samples, but may give high confidence estimates for OOD samples
far away from the decision boundary. Recently proposed "distance-aware" uncertainty estimation
2Li et al. [2019] demonstrated that applying MC Dropout with dropout rates ≥0.1 in networks that use
Layer Normalization [Ba et al., 2016] – as is the case in the related work cited above – can be problematic:
the combination causes unstable numerical behavior during inference on a range of architectures (DenseNet,
ResNet, ResNeXt [Xie et al., 2017] and Wide ResNet [Zagoruyko and Komodakis, 2016]), and requires addi-
tional implementation strategies
2