2
experts 1,2. Despite the successes of these models, we rarely see them implemented in
clinical practice. This phenomenon is multifactorial, but one concern is the validity of
a particular prediction. Can clinicians trust a prediction, and how can this be measured?
Can a model self-identify cases where it does not "know" if its prediction is reliable—
i.e. can it say "I do not know?" Uncertainty estimation is a method proposed to quantify
the reliability of a prediction and might provide a means to reassure physicians regard-
ing a model's prediction. Alternatively, if the model can comment that it is "unsure"
then the physician can ignore the prediction and rely entirely on their clinical judgment.
There are two primary sources of uncertainty: aleatoric and epistemic uncertainties
3. At a basic level, aleatoric uncertainty is associated with inherent noise within the
data; clinically, a contributor of aleatoric uncertainty may be artifact present on a CT
image. Epistemic uncertainty represents a model's lack of knowledge. For example, a
model trained to predict outcomes associated with a diverse array of head and neck
cancers may be "uncertain" with expected to make a prediction on a poorly represented
subset. Concretely, say a model was trained on 100 oropharyngeal cancer cases, 80
laryngeal cancer cases, and three nasopharyngeal cancer cases. When asked to make
another prediction on a new nasopharyngeal cancer case, it might be more "uncertain"
compared to a new oropharyngeal cancer case. Alternatively, epistemic might be able
to identify model misuse, where a prediction made by the model above would be more
"uncertain" if exposed to a thyroid cancer case. While limited, uncertainty estimation
has shown an increased prevalence in the medical-based machine learning literature,
and there have been several proposed methods for estimating both aleatoric and epis-
temic uncertainty 3–9.
Most comparison studies on uncertainty estimation methods describe the pros and
cons associated with various methods 10. For example, dropout variation inference and
test-time augmentation methods are more computationally expensive when making pre-
dictions on test data versus ensemble methods, which alternatively require significant
computational effort during training. Single deterministic approaches are less compu-
tationally expensive than the prior methods but require the model to be retrained to
predict the uncertainty distribution 5,6,8,9,11.
Excitingly, Berger et al. recently published their work, which compared out-of-dis-
tribution detection methods (i.e. epistemic uncertainty surrogates) using a large publicly
available medical dataset (CheXpert) and made the critical observation that associated
performance on traditional computer vision datasets do not always translate as well
when applied to medical dataset 4,12. They also briefly explored threshold selection for
out-of-distribution identification using temperature scaling—high confidence/low un-
certainty was associated with higher accuracy. However, accuracy is only one metric
that is important to a clinician; sensitivity and specificity are key. Moreover, threshold
selection is critical as having too strict a threshold might limit a model's utility (i.e.
decrease the patient sample size where the model is "certain") or negatively affect clin-
ically used metrics such as sensitivity or specificity 13.
This study employs several epistemic and aleatoric uncertainty estimation methods
and compares performance at various cutoffs using AUC, sensitivity, and specificity
for a model trained to predict a clinically significant event—feeding tube placement—
in head and neck cancer patients treated with definitive radiation therapy. Feeding tube