2 Background
2.1 Problem Statement
We formulate the problem in the paper as follows. A calibration technique
C
is performed during
training at each epoch, and a sample prioritization function
a
is then used to select the most informative
samples for training each subsequent epoch. We use Expected Calibration Error (ECE) for model
calibration [
10
], which measures the absolute difference between the model’s accuracy and its
confidence.
The paper discusses how a calibration technique
C
, when coupled with a sample prioritization
function
a
, affects the performance (accuracy and calibration error (ECE)) of the model. In addition,
we also observe if this phenomenon can aid in faster and more efficient training. We hypothesize
a closer relationship between calibration and sample prioritization during training, wherein the
calibrated model probabilities at each epoch are used by a sample prioritization criterion to select the
most informative samples for training each subsequent epoch.
2.2 Calibration
Calibration is a technique that curbs overconfident predictions in deep neural networks, wherein the
predicted (softmax) probabilities reflect true probabilities of correctness (better confidence estimates)
[
1
]. In this paper, we consider various prominently used calibration techniques which are performed
during training.
Label Smoothing
implicitly calibrates a model by discouraging overconfident prediction probabilities
during training [
9
]. The one-hot encoded ground truth labels (
yk
) are smoothened using a parameter
α
, that is
yLS
k=yk(1 −α) + α/K
, where
K
is the number of classes. These smoothened targets
yLS
kand predicted outputs pkare then used to minimize the cross-entropy loss.
Mixup
is a data augmentation method [
14
] which is shown to output well-calibrated predictive scores
[13], and is again performed during training.
¯x=λxi+ (1 −λ)xj
¯y=λyi+ (1 −λ)yj
where
xi
and
xj
are two input data points that are randomly sampled, and
yi
and
yj
are their respective
one-hot encoded labels. Here, λ∼Beta(α, α)with λ∈[0,1].
Focal Loss
is an alternative loss function to cross-entropy which yields calibrated probabilities by
minimizing a regularized KL divergence between the predicted and target distributions [8].
LF ocal =−(1 −p)γlogp
where
p
is the probability assigned by the model to the ground-truth correct class, and
γ
is a
hyperparameter. When compared with cross-entropy, Focal Loss has an added factor that encourages
the samples predicted with correct classes to have lower probabilities. This enables the predicted
distribution to have higher entropy, thereby helping avoid overconfident predictions.
2.3 Sample Prioritization
Sample prioritization is the process of selecting important samples during different stages of training
to accelerate the training process of a deep neural network without compromising on performance. In
this paper, we perform sample prioritization during training using Max Entropy, which is a de facto
uncertainty sampling technique to select the most efficient samples at each epoch.
Max Entropy
selects the most informative samples (top-
k
) that maximize the predictive entropy [
12
].
H[y|x, Dtrain] := −X
c
p(y=c|x, Dtrain) log p(y=c|x, Dtrain)
2.4 Pre-trained Calibrated Target models
Pre-trained models have been widely used in literature to obtain comprehensive sample representations
before training a downstream task [
5
]. We use a pre-trained calibrated model with larger capacity
2