Class-wise and reduced calibration methods_2

2025-04-27 0 0 434.8KB 11 页 10玖币
侵权投诉
Class-wise and reduced calibration methods
Michael Panchenko
appliedAI Institute gGmbH
m.panchenko@appliedai-institute.de
Anes Benmerzoug
appliedAI Initiative GmbH
a.benmerzoug@appliedai.de
Miguel de Benito Delgadoy
appliedAI Institute gGmbH
m.debenito@appliedai-institute.de
Abstract—For many applications of probabilistic classifiers it is
important that the predicted confidence vectors reflect true proba-
bilities (one says that the classifier is calibrated). It has been shown
that common models fail to satisfy this property, making reliable
methods for measuring and improving calibration important tools.
Unfortunately, obtaining these is far from trivial for problems
with many classes. We propose two techniques that can be used in
tandem. First, a reduced calibration method transforms the orig-
inal problem into a simpler one. We prove for several notions of
calibration that solving the reduced problem minimizes the corre-
sponding notion of miscalibration in the full problem, allowing the
use of non-parametric recalibration methods that fail in higher
dimensions. Second, we propose class-wise calibration methods,
based on intuition building on a phenomenon called neural collapse
and the observation that most of the accurate classifiers found in
practice can be thought of as a union of Kdifferent functions
which can be recalibrated separately, one for each class. These
typically out-perform their non class-wise counterparts, especially
for classifiers trained on imbalanced data sets. Applying the two
methods together results in class-wise reduced calibration algo-
rithms, which are powerful tools for reducing the prediction and
per-class calibration errors. We demonstrate our methods on real
and synthetic datasets and release all code as open source in [2,3].
I. INTRODUCTION
Probabilistic classifiers predict confidence vectors from inputs.
Their performance is often evaluated only on the top predic-
tion(s), i.e. on the argmax of the confidences. However, for
many decision-making processes, the actual confidence vectors
can be relevant. In such cases it is important that confidences
are meaningful quantities which, ideally, approximate observed
probabilities.
Let (X ; Y )be random variables with X2 X , and labels Y2
Y:= f1;:::; K g. For our purposes, a (trained) probabilistic clas-
sifier is a deterministic function c:X ! K¡1, where K¡1:=
fx2[0;1]K:Pxi= 1gis the K¡1dimensional simplex. The
confidences C:= c(X)are a random variable with distribution
induced by that of X. In what follows we will be mostly con-
cerned with the distribution of (Y ; C)and will therefore often
omit the dependency on X.
. Submitted to the 21st IEEE International Conference on Machine Learning
and Applications, ICMLA 2022. This article has been written using GNU
T
EXMACS [1].
y. Corresponding author.
Loosely speaking, we call a classifier calibrated if at test time
the confidence vectors represent true probabilities. More pre-
cisely, a classifier cis called strongly calibrated iff for every
k2 f1;:::Kg,P(Y=kjC) = Cka.s. For brevity, we write
instead1
P(YjC) = C a:s: (1)
Unfortunately, as shown in [4], many modern probabilistic
classifiers, despite being highly accurate, fail to be strongly cal-
ibrated. Given the importance of calibration for practical appli-
cations, it is desirable to accurately measure miscalibration as
well as to correct for it in some form, e.g. by recalibrating the
classifier during or after training (“post-processing”).
Contributions. Section III proves some elementary but useful
bounds for Expected Calibration Error (ECE), which, to our
knowledge and despite their simplicity and significance for prac-
titioners, have never been explicitly stated in the literature. Sec-
tion IV shows how recalibration of K-class problems can take
place in a much more sample efficient and also computationally
cheaper reduced setting while maintaining performance guar-
antees. Section VI connects calibration and a recently described
phenomenon called neural collapse [5] which motivates the
introduction of a novel algorithm in Section VI that we call
class-wise calibration and can extend any calibration algorithm.
Section VII benchmarks our proposed methods. We release a
library of calibration metrics and recalibration methods as open
source in [2]. Code to reproduce the experiments can be found
in [3].
II. PREVIOUS RELATED WORK
Calibration has a long history in forecasting, dating at least back
to the 1960s [6], and typically revolving around binary predic-
tions in meteorology, using methods like logistic regression to
recalibrate probabilities.
In the machine learning community, calibration has tradition-
ally played a lesser role. In the early 2000s, [7] observed that,
contrary to margin-based methods, the neural networks of the
1. For each k2 f1; : : : ; K gthe left hand side is the conditional probability of
Y=k
given
C
, and it is the r.v. defined as
P(Y=kjC):=E[1fkg(Y)jσ(C)]
.
1
time predicted calibrated probabilities. For over a decade, cal-
ibration received only sporadic attention, until [4] empirically
showed that modern features of neural networks (large capacity,
batch normalization, etc.) have a detrimental effect on calibra-
tion. This work also introduced temperature scaling, a uni-
parametric model for multi-class problems, and showed that it
is a maximal entropy solution which efficiently recalibrates in
many situations when one is interested in top-class calibration.
Immediate extensions to affine maps using vector and matrix
scaling can work better when calibration for multiple classes is
required but tend to overfit. [8] showed that a generative model
of class-conditionals XjYfollowing Dirichlet distributions is
equivalent to Matrix scaling, but provide a probabilistic inter-
pretation.
[9] empirically show that a variant of the histogram estimator
of ECE, which attempts to debias it, and we use in our experi-
ments has better convergence than the standard one. They also
introduce a hybrid parametric / non-parametric method with
improved error guarantees with respect to scaling and histogram
methods.
In the binary setting, [10] use beta distributions and show them
to outperform logistic models in many situations, whereas [11]
evaluate multiple methods for scoring of loan applications and
find non-parametric methods to outperform parametric ones.
Finally, [12] introduce the calibration lens upon which we
extend here, highlight the pitfalls of empirical estimates of cal-
ibration error and suggest hypothesis testing to overcome some
of them.
III. MEASURING MISCALIBRATION
The best possible probabilistic classifier c?, exactly reproduces
the distribution of YjX. With the vector notation introduced
above: P(YjX) = c?(X)a.s. and c?is maximally accurate as
well as strongly calibrated.2For a fixed classifier c, and C:=
c(X), the best one can do is to find a post-processing function
rid: ∆K¡1!K¡1which fulfills3
rid(&) = P(YjC=&)a:s: (2)
The composition
rid c
is strongly calibrated (although not
necessarily accurate) and rid =id for any cwhich is already
strongly calibrated. This optimal post-processing function is
called canonical calibration function and it gives the best pos-
sible post-processing of a probabilistic classifier's outputs. The
goal of any a posteriori recalibration algorithm is to approx-
imate rid.
2. To see this, use the tower property:
P(Y=kjc?(X)) =
E[1fkg(Y)jc?(X)] = E[E[1fkg(Y)jX]jc?(X)] = E[P(Y=
kjX)jP(YjX)] = P(Y=kjX) = c?(X).
3. Here, P(YjC=&)is a regular conditional probability of Ygiven C, which
exists e.g. for discrete Yand continuous C. The notation rid is for consistency
with the notation introduced in Section IV for calibration lenses.
A natural measure of miscalibration is the expected value of the
distance between rid and id. Given any norm k·k over K¡1,
one defines the expected strong calibration error (also canon-
ical calibration error) as:
ESCE(c) := EC[kP(YjC)¡Ck] = E[krid(C)¡Ck]:
Unfortunately, computing ESCE requires an estimate of rid,
and because the latter can be used to recalibrate the classifier,
computing ESCE is as hard as recalibrating. Because of this
difficulty, practical calibration metrics have to resort to some
form of reduction. A common method is to condition on a 1-
dimensional projection of C, thereby replacing the complicated
estimation of a high-dimensional distribution with the much
simpler estimation of a 1-dimensional one. The latter can be
done e.g. with binning. A general framework for constructing
such reductions was introduced by [12] with the concept of
calibration lens, see Section IV. Two common examples are
expected (confidence) calibration error:
ECE(c):= EC[jP(Y=argmax(C)jmax C)¡max Cj];(3)
which focuses on the top prediction, and class-wise ECE:4
cwECE(c) := 1
KX
k=1
K
EC[jP(Y=kjCk)¡Ckj];(4)
which focuses on single classes. For each
k
we also define
cwECEk(c) := EC[jP(Y=kjCk)¡Ckj]
. A strongly cali-
brated classifier has vanishing ECE and cwECE (as well as all
other reductions),5but the converse is not true, see [12] for an
example. Note that there exist alternative definitions of ECE
and cwECE in the literature which condition on Cinstead of
max Cor Ck, which are not the same as (3) and (4).
In practice we often encounter two important classes of classi-
fiers:
Definition 1. Let c:X ! K¡1be a classifier and C
~:= max C;
Y
~:= 1fargmaxCg(Y). We say that cis almost always over-
(resp. under-) confident if the set U:= fP(Y
~= 1jC
~)6C
~g,
resp. U:= fP(Y
~= 1jC
~)>C
~g, has P(U)>1¡δfor some
0< δ 1/2.
Empirically, neural networks are known to be overconfident,
making the following bounds of practical significance. They
show that to minimize ECE, it is usually enough to achieve
high accuracy. Intuitively, an accurate classifier simply does
not have much room to be overconfident, and if it is perfectly
accurate, it cannot be overconfident at all:
4. Following [8], we use a constant factor 1/K, although it would seem more
natural to use weights 1/P(Y=k)instead. Note also that cwECE is an
example which is not induced by any calibration lens as introduced in Sec-
tion IV.
5. To see this, use the tower property as in Footnote 2.
2
Lemma 2. Let accU(c)be the accuracy of cover the set Ufrom
Definition 1:
1. If
c
is almost always overconfident, then
ECE(c)61¡accU(c).
2. If cis almost always overconfident for a fixed class k, then
cwECEk(c)61¡P(Y=kjU).
3. If
c
is almost always under-confident for class
k
, then
ECE(c)6accU(c), and cwECEk(c)6P(Y=kjU).
Proof. 1. On the set U, we can use the linearity of the expec-
tation, and on Ucwe can bound the integrand by 1 to obtain:
ECE(c)6E[(C
~¡P(Y
~= 1jC
~)) 1U] + δ=E[C
~1U]¡P(Y
~=
1jU) + δ61¡δ¡accU(c) + δ.
2. Analogously:
cwECEk(c)6E[Ck1U]¡E[P(Y=
kjCk)1U] + δ61¡P(Y=kjU).
3. Swap the terms in the previous computation.
IV. CALIBRATION IN A REDUCED SETTING
As a formalization of the process of focusing on specific aspects
of calibration [12] introduce the calibration lens. For our
purposes, this is a map φ:Y × K¡1![m]×m¡1, with
[m] := f0; : : : ; mg, generating a reduced problem such that
φ: (y; c)7! (y
~
; c
~
)
with
y
~
=φy(y; c)
and
c
~
=φc(c)
fulfills
c~i=PyCat (c)(y~= i). One of the strongest meaningful reduc-
tions one can make is the confidence lens:
φconf(y; c) := (1fargmaxcg(y);max c)2 f0;1g × [0;1];
which reduces a
K
-class problem into a binary one.6The
(reduced) canonical calibration function for this new problem
is rφconf(&) = P(Y
~= 1jC
~=&) = P(Y=argmax Cjmax C=&)
and the strong calibration error for the induced problem equals
the ECE of the original problem. The induced strong calibra-
tion error vanishes when rφconf =id. See Appendix Bfor more
examples of lenses and their properties.
More generally, for any calibration lens
φ=(φy; φc)
, one has an
associated canonical calibration function
rφ(&) := P(φyjφc=&)
and an error based on the distance from rφto the identity:
EφEp(c) := EC[kP(φy(Y ; C)jφc(C)) ¡φc(C)kp];
for any
p
-norm, with
16p61
(in the sequel we fix some value
of pand omit the subindex). If φ=id then EφE=ESCE.
6. Several past works make implicit use of the reduced construction (Y
~; C
~),
e.g. [9] by calibrating one-vs-all models.
The main result of this section shows that if, starting from a
recalibration function r~for the reduced problem, one can con-
struct another recalibration function rfulfilling some mild con-
ditions, then strong calibration error guarantees for the reduced
problem translate to error guarantees for the original problem
in terms of EφE. Because the reduced problems are designed
to be of lower dimension, calibration methods typically per-
form better, improving calibration for the original problem with
higher sample efficiency and reduced computational cost. Espe-
cially non-parametric methods which can easily underperform
in higher dimensions benefit from reduced calibration7.
K¡1
m¡1
K¡1
m¡1
φcφc
r~
r
~
Figure 1. Reduced calibration rfrom φcand r~: if the diagram is com-
mutative with high probability, then ESCE for the reduced problem
transfers to EφE for the original one.
The key intuition about the construction r~7!ris that it provides
a right inverse for φcon most of the predictions, in the sense
that φcr=r~φcholds with high probability, see Figure 1.
The extent to which this happens determines how successfully
one can lift calibration from a reduced problem to the original
one:
Lemma 3. Let φbe a calibration lens with φ(Y ; C)2[m]×
m¡1. Fix (Y
~; C
~) = φ(Y ; C)and assume one has calibration
functions r~: ∆m¡1!m¡1and r: ∆K¡1!K¡1fulfilling:
(1) there exists δ > 0such that the set U:= fφcr=r~φcghas
P(U)>1¡δ; and: (2) r~is “almost calibrated” in the sense
that kP(Y
~jr~(C
~)) ¡r~(C
~)kU6"for some " > 0. Then:
EφE(rc)6"+δ:
If, in particular P(U) = 1, then EφE(rc) = ESCE(r~c~).
The final observation is of practical relevance for parametric
r
~when it is often possible to compute
U
exactly. E.g. for
temperature scaling
r
~
T(c) = σ(log(c) / T)
, if
T>1
then
r~T([0;1]) [1/K; 1] )P(U) = 1 (see Corollary 4).
Proof. (of Lemma 3)Define
Z:=P(Y
~
jφc(r(C)))¡φc(r(C))
.
By the construction of U, one has kZ1Uk=kP(Y
~jr~(C
~)) ¡
r~(C
~)kU6", and over Uc,kZ1Uck6kZkP(Uc)6δ. Con-
sequently:
EφE(rc) = E[kP(Y
~jφc(r(C))) ¡φc(r(C))k]
=E[kZ1U+Z1Uck]6"+δ;
as desired. If
δ= 0
, then
EφE(rc) = E[kZk] =
E[kP(Y
~jr~(C
~)) ¡C
~k] = ESCE(r~c~).
7. A notable exception where reducing the problem may go wrong is tempera-
ture scaling. As we will show experimentally in Section VII , for certain 1-dim.
calibration problems temperature scaling fails to give a good approximation.
3
摘要:

Class-wiseandreducedcalibrationmethodsMichaelPanchenkoappliedAIInstitutegGmbHm.panchenko@appliedai-institute.deAnesBenmerzougappliedAIInitiativeGmbHa.benmerzoug@appliedai.deMigueldeBenitoDelgadoyappliedAIInstitutegGmbHm.debenito@appliedai-institute.deAbstract—Formanyapplicationsofprobabilisticclass...

展开>> 收起<<
Class-wise and reduced calibration methods_2.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:434.8KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注