
of training data. This is broadly classified as aleotoric and
epistemic uncertainty.
•Aleotoric or indirect uncertainty- This type of uncertainty
arises from the unaccounted factors, such as environment
settings, noise in the input data, or bad input feature
selections. It is also known as an irreducible error and
can’t be remediated with more data. One of the solution
to overcome such uncertainty is to make sure that the
data collection strategy is carefully designed and the
measurement environment is constrained so that the effect
of environment or any external factors is minimized. Also
careful feature selection that represents the phenomena
or application is of utmost importance to avoid such
uncertainty.
•Epistemic or direct uncertainty- This type of uncertainty
usually arises from the lack of knowledge about the
model or data. One example can be over-generalization,
where the ML model is very complex as compared to
the amount of data it is trained on and thereby overgen-
eralizes on a test dataset. This type of uncertainty can
be overcomed by collecting more data or experimenting
with different ML model architectures or by chang-
ing/tweaking model parameters. Since, this uncertainty
is caused inherently by the model/data, therefore it can
be easily reduced by more data or by different model
architecture.
Epistemic uncertainty is also used to detect dataset shifts
(test data has different distribution than training), or adversarial
inputs. Modeling epistemic uncertainty is challenging than
modeling the aleotoric one. The later one is incorporated in the
model loss function while the epistemic is highly dependent
on the model itself and may vary from one model architecture
to other.
A. Modeling Aleotoric and Epistemic Uncertainty
Given a dataset, D={Xi, yi}, i ∈ {1, . . . , n}, where Xi
is the ith input, yiis the ith output and nis the total number
of input examples in the dataset, the ML model can be then
described as a function, ˆ
f:Xi7→ ˆyi, or
ˆyi=ˆ
f(Xi)(1)
and lets assume the original data generating process can be
given by a function f:Xi7→ yi, such that yi=f(Xi) + i,
where irepresents the irreducible error caused by measure-
ment errors during data collection or by wrong labeling in the
training data or bad input feature selection. Thus, the mean
square error (MSE) between the actual labels and predicted
labels from the model will be given as,
E(yi−ˆyi)2=E(f(Xi) + i−ˆ
f(Xi)),
= [f(Xi)−ˆ
f(Xi)]2
| {z }
reducible error
+Var(i).
| {z }
irreducible error
(2)
The first part in (2) is model dependent and therefore rep-
resents epistemic uncertainty while the second term (variance
of i) is the irreducible or aleotoric uncertainty. This variance
of iis also known as the Bayes error, which actually is the
lowest possible prediction error than can be achieved with
any model. In literature, is generally modeled as an inde-
pendent and identically distributed (i.i.d) Normal distribution,
i∼ N (µi, σi). To incorporate the aleotoric uncertainty in
a AI/ML model, the final layer can be therefore replaced
with a probabilistic layer, usually a normally distributed one
with a mean of µand a standard deviation of σ. During
training/testing phase, samples are drawn from this layer for
prediction and also for aleotoric uncertainty quantification1.
The problem with this approach is to figure out how to learn
the parameters of this Normal distribution. This can be solved
by defining a new cost function, negative log-likelihood (NLL)
that represents the loss between a distribution and the true
output label.
The NLL is equivalent to maximizing the likelihood of
observing a data given a distribution with its parameters.
In NLL, the logarithmic probabilities associated with each
class is summed up for a dataset. This closely resembles the
cross entropy loss function except in cross entropy, the last
classification activation is implicitly applied before taking the
logarithmic transformation, while in NLL this is not the case.
The NLL is given as,
NLL =−log P(yi|Xi;µ, σ).(3)
With NLL as a cost function and the last layer as a proba-
bilistic layer, the aleotoric uncertainty can thus be modeled as
described in Algorithm 1. The independent normal layer can
be implemented in any modern day ML packages. In our case,
we used the TensorFlow Probability package [8] to model such
layer.
Algorithm 1 method to measure Aleotoric uncertainty
Input: D(Xi, yi), replace output layer with probabilistic
node: N(µ, σ), define optimizer as rmsProp and set it’s
learning rate, set num epochs for training.
Output: ˆyi
1: for epoch = 1 to num epochs do
2: i) Calculate loss and gradients using NLL (3) and the
defined optimizer
3: ii) Apply gradients and update weights
4: iii) Monitor loss and accuracy
5: end for
6: Determine the parameters µand σfrom the output layer
Once the mean and standard deviation is determined, we
can then easily figure out the 95% confidence interval for
the trained data or even for the test data. For classification
problems, the last layer can be modeled as a categorical
distribution, where for each class, there is a learned distribution
and based on the learned parameters for these distribution,
1This assumes the ML architecture chosen is able to give high accuracy
before replacing the output layer.