Class-wise and reduced calibration methods_2

2025-04-27 0 0 434.8KB 11 页 10玖币

侵权投诉

Class-wise and reduced calibration methods∗

Michael Panchenko

appliedAI Institute gGmbH

m.panchenko@appliedai-institute.de

Anes Benmerzoug

appliedAI Initiative GmbH

a.benmerzoug@appliedai.de

Miguel de Benito Delgadoy

appliedAI Institute gGmbH

m.debenito@appliedai-institute.de

Abstract—For many applications of probabilistic classiﬁers it is

important that the predicted conﬁdence vectors reﬂect true proba-

bilities (one says that the classiﬁer is calibrated). It has been shown

that common models fail to satisfy this property, making reliable

methods for measuring and improving calibration important tools.

Unfortunately, obtaining these is far from trivial for problems

with many classes. We propose two techniques that can be used in

tandem. First, a reduced calibration method transforms the orig-

inal problem into a simpler one. We prove for several notions of

calibration that solving the reduced problem minimizes the corre-

sponding notion of miscalibration in the full problem, allowing the

use of non-parametric recalibration methods that fail in higher

dimensions. Second, we propose class-wise calibration methods,

based on intuition building on a phenomenon called neural collapse

and the observation that most of the accurate classiﬁers found in

practice can be thought of as a union of Kdifferent functions

which can be recalibrated separately, one for each class. These

typically out-perform their non class-wise counterparts, especially

for classiﬁers trained on imbalanced data sets. Applying the two

methods together results in class-wise reduced calibration algo-

rithms, which are powerful tools for reducing the prediction and

per-class calibration errors. We demonstrate our methods on real

and synthetic datasets and release all code as open source in [2,3].

I. INTRODUCTION

Probabilistic classiﬁers predict conﬁdence vectors from inputs.

Their performance is often evaluated only on the top predic-

tion(s), i.e. on the argmax of the conﬁdences. However, for

many decision-making processes, the actual conﬁdence vectors

can be relevant. In such cases it is important that conﬁdences

are meaningful quantities which, ideally, approximate observed

probabilities.

Let (X ; Y )be random variables with X2 X , and labels Y2

Y:= f1;:::; K g. For our purposes, a (trained) probabilistic clas-

siﬁer is a deterministic function c:X ! ∆K¡1, where ∆K¡1:=

fx2[0;1]K:Pxi= 1gis the K¡1dimensional simplex. The

conﬁdences C:= c(X)are a random variable with distribution

induced by that of X. In what follows we will be mostly con-

cerned with the distribution of (Y ; C)and will therefore often

omit the dependency on X.

∗. Submitted to the 21st IEEE International Conference on Machine Learning

and Applications, ICMLA 2022. This article has been written using GNU

EXMACS [1].

y. Corresponding author.

Loosely speaking, we call a classiﬁer calibrated if at test time

the conﬁdence vectors represent true probabilities. More pre-

cisely, a classiﬁer cis called strongly calibrated iﬀ for every

k2 f1;:::Kg,P(Y=kjC) = Cka.s. For brevity, we write

instead1

P(YjC) = C a:s: (1)

Unfortunately, as shown in [4], many modern probabilistic

classiﬁers, despite being highly accurate, fail to be strongly cal-

ibrated. Given the importance of calibration for practical appli-

cations, it is desirable to accurately measure miscalibration as

well as to correct for it in some form, e.g. by recalibrating the

classiﬁer during or after training (“post-processing”).

Contributions. Section III proves some elementary but useful

bounds for Expected Calibration Error (ECE), which, to our

knowledge and despite their simplicity and significance for prac-

titioners, have never been explicitly stated in the literature. Sec-

tion IV shows how recalibration of K-class problems can take

place in a much more sample eﬃcient and also computationally

cheaper reduced setting while maintaining performance guar-

antees. Section VI connects calibration and a recently described

phenomenon called neural collapse [5] which motivates the

introduction of a novel algorithm in Section VI that we call

class-wise calibration and can extend any calibration algorithm.

Section VII benchmarks our proposed methods. We release a

library of calibration metrics and recalibration methods as open

source in [2]. Code to reproduce the experiments can be found

in [3].

II. PREVIOUS RELATED WORK

Calibration has a long history in forecasting, dating at least back

to the 1960s [6], and typically revolving around binary predic-

tions in meteorology, using methods like logistic regression to

recalibrate probabilities.

In the machine learning community, calibration has tradition-

ally played a lesser role. In the early 2000s, [7] observed that,

contrary to margin-based methods, the neural networks of the

1. For each k2 f1; : : : ; K gthe left hand side is the conditional probability of

Y=k

given

, and it is the r.v. defined as

P(Y=kjC):=E[1fkg(Y)jσ(C)]

time predicted calibrated probabilities. For over a decade, cal-

ibration received only sporadic attention, until [4] empirically

showed that modern features of neural networks (large capacity,

batch normalization, etc.) have a detrimental eﬀect on calibra-

tion. This work also introduced temperature scaling, a uni-

parametric model for multi-class problems, and showed that it

is a maximal entropy solution which eﬃciently recalibrates in

many situations when one is interested in top-class calibration.

Immediate extensions to aﬃne maps using vector and matrix

scaling can work better when calibration for multiple classes is

required but tend to overﬁt. [8] showed that a generative model

of class-conditionals XjYfollowing Dirichlet distributions is

equivalent to Matrix scaling, but provide a probabilistic inter-

pretation.

[9] empirically show that a variant of the histogram estimator

of ECE, which attempts to debias it, and we use in our experi-

ments has better convergence than the standard one. They also

introduce a hybrid parametric / non-parametric method with

improved error guarantees with respect to scaling and histogram

methods.

In the binary setting, [10] use beta distributions and show them

to outperform logistic models in many situations, whereas [11]

evaluate multiple methods for scoring of loan applications and

ﬁnd non-parametric methods to outperform parametric ones.

Finally, [12] introduce the calibration lens upon which we

extend here, highlight the pitfalls of empirical estimates of cal-

ibration error and suggest hypothesis testing to overcome some

of them.

III. MEASURING MISCALIBRATION

The best possible probabilistic classiﬁer c?, exactly reproduces

the distribution of YjX. With the vector notation introduced

above: P(YjX) = c?(X)a.s. and c?is maximally accurate as

well as strongly calibrated.2For a ﬁxed classiﬁer c, and C:=

c(X), the best one can do is to ﬁnd a post-processing function

rid: ∆K¡1!∆K¡1which fulﬁlls3

rid(&) = P(YjC=&)a:s: (2)

The composition

rid ◦c

is strongly calibrated (although not

necessarily accurate) and rid =id for any cwhich is already

strongly calibrated. This optimal post-processing function is

called canonical calibration function and it gives the best pos-

sible post-processing of a probabilistic classiﬁer's outputs. The

goal of any a posteriori recalibration algorithm is to approx-

imate rid.

2. To see this, use the tower property:

P(Y=kjc?(X)) =

E[1fkg(Y)jc?(X)] = E[E[1fkg(Y)jX]jc?(X)] = E[P(Y=

kjX)jP(YjX)] = P(Y=kjX) = c?(X).

3. Here, P(YjC=&)is a regular conditional probability of Ygiven C, which

exists e.g. for discrete Yand continuous C. The notation rid is for consistency

with the notation introduced in Section IV for calibration lenses.

A natural measure of miscalibration is the expected value of the

distance between rid and id. Given any norm k·k over ∆K¡1,

one deﬁnes the expected strong calibration error (also canon-

ical calibration error) as:

ESCE(c) := EC[kP(YjC)¡Ck] = E[krid(C)¡Ck]:

Unfortunately, computing ESCE requires an estimate of rid,

and because the latter can be used to recalibrate the classiﬁer,

computing ESCE is as hard as recalibrating. Because of this

diﬃculty, practical calibration metrics have to resort to some

form of reduction. A common method is to condition on a 1-

dimensional projection of C, thereby replacing the complicated

estimation of a high-dimensional distribution with the much

simpler estimation of a 1-dimensional one. The latter can be

done e.g. with binning. A general framework for constructing

such reductions was introduced by [12] with the concept of

calibration lens, see Section IV. Two common examples are

expected (conﬁdence) calibration error:

ECE(c):= EC[jP(Y=argmax(C)jmax C)¡max Cj];(3)

which focuses on the top prediction, and class-wise ECE:4

cwECE(c) := 1

k=1

EC[jP(Y=kjCk)¡Ckj];(4)

which focuses on single classes. For each

we also define

cwECEk(c) := EC[jP(Y=kjCk)¡Ckj]

. A strongly cali-

brated classiﬁer has vanishing ECE and cwECE (as well as all

other reductions),5but the converse is not true, see [12] for an

example. Note that there exist alternative deﬁnitions of ECE

and cwECE in the literature which condition on Cinstead of

max Cor Ck, which are not the same as (3) and (4).

In practice we often encounter two important classes of classi-

ﬁers:

Deﬁnition 1. Let c:X ! ∆K¡1be a classiﬁer and C

~:= max C;

~:= 1fargmaxCg(Y). We say that cis almost always over-

(resp. under-) conﬁdent if the set U:= fP(Y

~= 1jC

~)6C

~g,

resp. U:= fP(Y

~= 1jC

~)>C

~g, has P(U)>1¡δfor some

0< δ 1/2.

Empirically, neural networks are known to be overconﬁdent,

making the following bounds of practical signiﬁcance. They

show that to minimize ECE, it is usually enough to achieve

high accuracy. Intuitively, an accurate classiﬁer simply does

not have much room to be overconﬁdent, and if it is perfectly

accurate, it cannot be overconﬁdent at all:

4. Following [8], we use a constant factor 1/K, although it would seem more

natural to use weights 1/P(Y=k)instead. Note also that cwECE is an

example which is not induced by any calibration lens as introduced in Sec-

tion IV.

5. To see this, use the tower property as in Footnote 2.

Lemma 2. Let accU(c)be the accuracy of cover the set Ufrom

Deﬁnition 1:

1. If

is almost always overconfident, then

ECE(c)61¡accU(c).

2. If cis almost always overconﬁdent for a ﬁxed class k, then

cwECEk(c)61¡P(Y=kjU).

3. If

is almost always under-confident for class

, then

ECE(c)6accU(c), and cwECEk(c)6P(Y=kjU).

Proof. 1. On the set U, we can use the linearity of the expec-

tation, and on Ucwe can bound the integrand by 1 to obtain:

ECE(c)6E[(C

~¡P(Y

~= 1jC

~)) 1U] + δ=E[C

~1U]¡P(Y

1jU) + δ61¡δ¡accU(c) + δ.

2. Analogously:

cwECEk(c)6E[Ck1U]¡E[P(Y=

kjCk)1U] + δ61¡P(Y=kjU).

3. Swap the terms in the previous computation. 

IV. CALIBRATION IN A REDUCED SETTING

As a formalization of the process of focusing on specific aspects

of calibration [12] introduce the calibration lens. For our

purposes, this is a map φ:Y × ∆K¡1![m]×∆m¡1, with

[m] := f0; : : : ; mg, generating a reduced problem such that

φ: (y; c)7! (y

; c

)

with

=φy(y; c)

and

=φc(c)

fulfills

c~i=Py∼Cat (c)(y~= i). One of the strongest meaningful reduc-

tions one can make is the conﬁdence lens:

φconf(y; c) := (1fargmaxcg(y);max c)2 f0;1g × [0;1];

which reduces a

-class problem into a binary one.6The

(reduced) canonical calibration function for this new problem

is rφconf(&) = P(Y

~= 1jC

~=&) = P(Y=argmax Cjmax C=&)

and the strong calibration error for the induced problem equals

the ECE of the original problem. The induced strong calibra-

tion error vanishes when rφconf =id. See Appendix Bfor more

examples of lenses and their properties.

More generally, for any calibration lens

φ=(φy; φc)

, one has an

associated canonical calibration function

rφ(&) := P(φyjφc=&)

and an error based on the distance from rφto the identity:

EφEp(c) := EC[kP(φy(Y ; C)jφc(C)) ¡φc(C)kp];

for any

-norm, with

16p61

(in the sequel we fix some value

of pand omit the subindex). If φ=id then EφE=ESCE.

6. Several past works make implicit use of the reduced construction (Y

~; C

~),

e.g. [9] by calibrating one-vs-all models.

The main result of this section shows that if, starting from a

recalibration function r~for the reduced problem, one can con-

struct another recalibration function rfulﬁlling some mild con-

ditions, then strong calibration error guarantees for the reduced

problem translate to error guarantees for the original problem

in terms of EφE. Because the reduced problems are designed

to be of lower dimension, calibration methods typically per-

form better, improving calibration for the original problem with

higher sample eﬃciency and reduced computational cost. Espe-

cially non-parametric methods which can easily underperform

in higher dimensions beneﬁt from reduced calibration7.

∆K¡1

∆m¡1

∆K¡1

∆m¡1

φcφc

◦

Figure 1. Reduced calibration rfrom φcand r~: if the diagram is com-

mutative with high probability, then ESCE for the reduced problem

transfers to EφE for the original one.

The key intuition about the construction r~7!ris that it provides

a right inverse for φcon most of the predictions, in the sense

that φc◦r=r~◦φcholds with high probability, see Figure 1.

The extent to which this happens determines how successfully

one can lift calibration from a reduced problem to the original

one:

Lemma 3. Let φbe a calibration lens with φ(Y ; C)2[m]×

∆m¡1. Fix (Y

~; C

~) = φ(Y ; C)and assume one has calibration

functions r~: ∆m¡1!∆m¡1and r: ∆K¡1!∆K¡1fulﬁlling:

(1) there exists δ > 0such that the set U:= fφc◦r=r~◦φcghas

P(U)>1¡δ; and: (2) r~is “almost calibrated” in the sense

that kP(Y

~jr~(C

~)) ¡r~(C

~)kU6"for some " > 0. Then:

EφE(r◦c)6"+δ:

If, in particular P(U) = 1, then EφE(r◦c) = ESCE(r~◦c~).

The ﬁnal observation is of practical relevance for parametric

~when it is often possible to compute

exactly. E.g. for

temperature scaling

T(c) = σ(log(c) / T)

, if

T>1

then

r~T([0;1]) ⊆[1/K; 1] )P(U) = 1 (see Corollary 4).

Proof. (of Lemma 3)Define

Z:=P(Y

jφc(r(C)))¡φc(r(C))

By the construction of U, one has kZ1Uk=kP(Y

~jr~(C

~)) ¡

r~(C

~)kU6", and over Uc,kZ1Uck6kZkP(Uc)6δ. Con-

sequently:

EφE(r◦c) = E[kP(Y

~jφc(r(C))) ¡φc(r(C))k]

=E[kZ1U+Z1Uck]6"+δ;

as desired. If

δ= 0

, then

EφE(r◦c) = E[kZk] =

E[kP(Y

~jr~(C

~)) ¡C

~k] = ESCE(r~◦c~).

7. A notable exception where reducing the problem may go wrong is tempera-

ture scaling. As we will show experimentally in Section VII , for certain 1-dim.

calibration problems temperature scaling fails to give a good approximation.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Class-wiseandreducedcalibrationmethodsMichaelPanchenkoappliedAIInstitutegGmbHm.panchenko@appliedai-institute.deAnesBenmerzougappliedAIInitiativeGmbHa.benmerzoug@appliedai.deMigueldeBenitoDelgadoyappliedAIInstitutegGmbHm.debenito@appliedai-institute.deAbstract—Formanyapplicationsofprobabilisticclass...

展开>> 收起<<

Class-wise and reduced calibration methods_2.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Class-wise and reduced calibration methods_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: