
time predicted calibrated probabilities. For over a decade, cal-
ibration received only sporadic attention, until [4] empirically
showed that modern features of neural networks (large capacity,
batch normalization, etc.) have a detrimental effect on calibra-
tion. This work also introduced temperature scaling, a uni-
parametric model for multi-class problems, and showed that it
is a maximal entropy solution which efficiently recalibrates in
many situations when one is interested in top-class calibration.
Immediate extensions to affine maps using vector and matrix
scaling can work better when calibration for multiple classes is
required but tend to overfit. [8] showed that a generative model
of class-conditionals XjYfollowing Dirichlet distributions is
equivalent to Matrix scaling, but provide a probabilistic inter-
pretation.
[9] empirically show that a variant of the histogram estimator
of ECE, which attempts to debias it, and we use in our experi-
ments has better convergence than the standard one. They also
introduce a hybrid parametric / non-parametric method with
improved error guarantees with respect to scaling and histogram
methods.
In the binary setting, [10] use beta distributions and show them
to outperform logistic models in many situations, whereas [11]
evaluate multiple methods for scoring of loan applications and
find non-parametric methods to outperform parametric ones.
Finally, [12] introduce the calibration lens upon which we
extend here, highlight the pitfalls of empirical estimates of cal-
ibration error and suggest hypothesis testing to overcome some
of them.
III. MEASURING MISCALIBRATION
The best possible probabilistic classifier c?, exactly reproduces
the distribution of YjX. With the vector notation introduced
above: P(YjX) = c?(X)a.s. and c?is maximally accurate as
well as strongly calibrated.2For a fixed classifier c, and C:=
c(X), the best one can do is to find a post-processing function
rid: ∆K¡1!∆K¡1which fulfills3
rid(&) = P(YjC=&)a:s: (2)
The composition
rid ◦c
is strongly calibrated (although not
necessarily accurate) and rid =id for any cwhich is already
strongly calibrated. This optimal post-processing function is
called canonical calibration function and it gives the best pos-
sible post-processing of a probabilistic classifier's outputs. The
goal of any a posteriori recalibration algorithm is to approx-
imate rid.
2. To see this, use the tower property:
P(Y=kjc?(X)) =
E[1fkg(Y)jc?(X)] = E[E[1fkg(Y)jX]jc?(X)] = E[P(Y=
kjX)jP(YjX)] = P(Y=kjX) = c?(X).
3. Here, P(YjC=&)is a regular conditional probability of Ygiven C, which
exists e.g. for discrete Yand continuous C. The notation rid is for consistency
with the notation introduced in Section IV for calibration lenses.
A natural measure of miscalibration is the expected value of the
distance between rid and id. Given any norm k·k over ∆K¡1,
one defines the expected strong calibration error (also canon-
ical calibration error) as:
ESCE(c) := EC[kP(YjC)¡Ck] = E[krid(C)¡Ck]:
Unfortunately, computing ESCE requires an estimate of rid,
and because the latter can be used to recalibrate the classifier,
computing ESCE is as hard as recalibrating. Because of this
difficulty, practical calibration metrics have to resort to some
form of reduction. A common method is to condition on a 1-
dimensional projection of C, thereby replacing the complicated
estimation of a high-dimensional distribution with the much
simpler estimation of a 1-dimensional one. The latter can be
done e.g. with binning. A general framework for constructing
such reductions was introduced by [12] with the concept of
calibration lens, see Section IV. Two common examples are
expected (confidence) calibration error:
ECE(c):= EC[jP(Y=argmax(C)jmax C)¡max Cj];(3)
which focuses on the top prediction, and class-wise ECE:4
cwECE(c) := 1
KX
k=1
K
EC[jP(Y=kjCk)¡Ckj];(4)
which focuses on single classes. For each
k
we also define
cwECEk(c) := EC[jP(Y=kjCk)¡Ckj]
. A strongly cali-
brated classifier has vanishing ECE and cwECE (as well as all
other reductions),5but the converse is not true, see [12] for an
example. Note that there exist alternative definitions of ECE
and cwECE in the literature which condition on Cinstead of
max Cor Ck, which are not the same as (3) and (4).
In practice we often encounter two important classes of classi-
fiers:
Definition 1. Let c:X ! ∆K¡1be a classifier and C
~:= max C;
Y
~:= 1fargmaxCg(Y). We say that cis almost always over-
(resp. under-) confident if the set U:= fP(Y
~= 1jC
~)6C
~g,
resp. U:= fP(Y
~= 1jC
~)>C
~g, has P(U)>1¡δfor some
0< δ 1/2.
Empirically, neural networks are known to be overconfident,
making the following bounds of practical significance. They
show that to minimize ECE, it is usually enough to achieve
high accuracy. Intuitively, an accurate classifier simply does
not have much room to be overconfident, and if it is perfectly
accurate, it cannot be overconfident at all:
4. Following [8], we use a constant factor 1/K, although it would seem more
natural to use weights 1/P(Y=k)instead. Note also that cwECE is an
example which is not induced by any calibration lens as introduced in Sec-
tion IV.
5. To see this, use the tower property as in Footnote 2.
2