Useful Conﬁdence Measures Beyond the Max Score Gal Yona Weizmann Institute

2025-05-06 0 0 354.99KB 7 页 10玖币

侵权投诉

Useful Conﬁdence Measures: Beyond the Max Score

Gal Yona∗

Weizmann Institute

gal.yona@gmail.com

Amir Feder

Columbia University

amirfeder@gmail.com

Itay Laish

Google

itaylaish@google.com

Abstract

An important component in deploying machine learning (ML) in safety-critic ap-

plications is having a reliable measure of conﬁdence in the ML model’s predictions.

For a classiﬁer

producing a probability vector

f(x)

over the candidate classes,

the conﬁdence is typically taken to be

maxif(x)i

. This approach is potentially

limited, as it disregards the rest of the probability vector. In this work, we derive

several conﬁdence measures that depend on information beyond the maximum

score, such as margin-based and entropy-based measures, and empirically evaluate

their usefulness, focusing on NLP tasks with distribution shifts and Transformer-

based models. We show that when models are evaluated on the out-of-distribution

data “out of the box”, using only the maximum score to inform the conﬁdence

measure is highly suboptimal. In the post-processing regime (where the scores of

can be improved using additional in-distribution held-out data), this remains true,

albeit less signiﬁcant. Overall, our results suggest that entropy-based conﬁdence is

a surprisingly useful measure.

1 Introduction

As machine learning (ML) is increasingly deployed in high-stakes decision-making applications, it

becomes increasingly important that practitioners have access to reliable measure of how conﬁdent

the ML model is in its various predictions. This becomes especially crucial in settings where the

predictions are made in conditions signiﬁcantly different than the ones present during development.

In these cases, accuracy may unavoidably degrade, but a useful conﬁdence measure can at least

ensure practitioners “know” when the ML model “doesn’t know”.

In this work we assume classiﬁers

output a probability vector

f(x)

over the candidate classes

and treat conﬁdence as a scalar quantity

c(f(x)) ∈[0,1]

that represents how conﬁdent

is in its

prediction, with scores near

representing highly conﬁdent predictions. Intuitively, a good conﬁdence

measure should give rise to scores that correlate well with the accuracy of

. In Section 3 we show

that this objective can be decomposed into two familiar terms from the literature on forecasting

(Murphy, 1973; Dawid, 1982): a calibration error term (encouraging that whenever we output a

conﬁdence value of e.g. 0.7, then on average, 70% of the time the model makes a correct prediction)

and a sharpness term (encouraging the conﬁdence values to also be varied).

Given a classiﬁer

, what should we choose as our conﬁdence measure

? One natural choice is

to use the maximum class probability,

c(f(x)) = maxif(x)i

. When

is itself calibrated, this will

give rise to a calibrated conﬁdence measure. However, we don’t necessarily expect models to be

∗Work done while an intern at Google.

Workshop on Distribution Shifts, 36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.14070v1 [cs.LG] 25 Oct 2022

well-calibrated “out of the box”, especially not in the presence of distribution shifts. While it is

common to post-process predictions to improve their calibration, this approach is not always feasible

as it requires additional data, is observed to have limited success in settings of distributions shifts

(Desai and Durrett, 2020; Dan and Roth, 2021), and may come at the cost of sharpness (Kumar et al.,

2018).

The above discussion suggests that in the presence of model miscalibration, using the tail of the

predictions to inform the conﬁdence score could be beneﬁcial.

Indeed, the literature on active

learning (AL) has long since considered uncertainty scores that employ the rest of the probability

vector (Settles, 2009). In the AL context, these are used to greedily select examples from a large pool

of unlabeled examples for which labels will be requested. E.g., it is common to use the gap between

the ﬁrst and second largest entries of

f(x)

, and in some cases even the entropy of

f(x)

, to inform

this selection.

Our contributions.

In this work, we consider such uncertainty scores in the context of conﬁdence

measures, and perform a systematic evaluation of these measures in the presence of distribution shifts.

We focus on large pre-trained Transformer-based language models like BERT (Devlin et al., 2018)

for multi-class NLP tasks, which have observed to be well-calibrated on in-distribution data (Desai

and Durrett, 2020). We use the Amazon reviews dataset from the WILDS benchmark (Koh et al.,

2021), in which the out-of-distribution (OOD) test set consists of a set of reviewers that is disjoint

from the training set and in-distribution (ID) validation set. We consider models trained on this task

with different objectives (regular risk minimization, but also approaches that are designed to handle

distribution shifts), and evaluate the different conﬁdence measures on the OOD test data. Our key

ﬁndings are:

When the conﬁdence measures are evaluated “out of the box” (with no further tuning based

on a validation set), using

maxif(x)i

is highly sub-optimal. Margin-based conﬁdence

measures perform better for most of the models considered (and by a signiﬁcant gap), and

the entropy-based conﬁdence measure is consistently better.

We derive a variant of temperature scaling (TS), a popular post-processing technique for

improving calibration, and show that it can be used to consistently improve the calibration

for all the conﬁdence measures we consider.

In the post-processing regime (namely, after applying TS), the entropy-based conﬁdence

measure Pareto dominates the max-based measure for most of the models (and is otherwise

incomparable - has a marginally larger calibration error but is sharper).

Additional related work.

(Desai and Durrett, 2020) evaluate the calibration of pre-trained Trans-

former models in both ID and OOD settings. Their results demonstrate that Transformer-based

models tend to well-calibrated ID but that the calibration error can decrease signiﬁcantly OOD. (Dan

and Roth, 2021) empirically evaluate the relationship between scale and calibration, showing that

OOD, smaller Transformer models tend to have worse calibration than larger models, even after

applying TS. These works, together with earlier works (Guo et al., 2017), all conﬂate a model’s

conﬁdence with the probability of the predicted label. One recent exception is (Taha et al., 2022),

which consider conﬁdence measures based on the margin and kurtosis of the logits. Their work

is signiﬁcantly different from ours as they do not directly evaluate calibration of these proposed

measures and also do not consider distribution shifts.

2 Preliminaries

Setup.

We consider a multi-class classiﬁcation problem with feature space

and label space

where

|Y| =k≥2

and

∆(Y)

denotes the simplex over

. Let

denote a joint distribution over

X × Y

and

X, Y

random variables w.r.t

. A classiﬁer is a mapping

f:X → ∆(Y)

and its predicted

label is

arg maxif(x)i

. We use

err

to denote the 0-1 error of

err(x, y) = 1[arg maxif(x)i6=y]

As one illustrative example, consider a 10-class classiﬁcation task and the predictions on two instances:

f(x1) = [0.9,0.1,0.0, . . . 0.0]

and

f(x2) = [0.9,0.1/9,...,0.1/9]

. We might expect that the conﬁdence on

should be higher than on

: intuitively, for

the model is “deliberating” between two concrete options (the

ﬁrst and second classes) whereas for

there is no clear alternative to the ﬁrst class. However, by deﬁnition,

maxif(x)ican make no such distinctions.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UsefulCondenceMeasures:BeyondtheMaxScoreGalYonaWeizmannInstitutegal.yona@gmail.comAmirFederColumbiaUniversityamirfeder@gmail.comItayLaishGoogleitaylaish@google.comAbstractAnimportantcomponentindeployingmachinelearning(ML)insafety-criticap-plicationsishavingareliablemeasureofcondenceintheMLmodel's...

展开>> 收起<<

Useful Conﬁdence Measures Beyond the Max Score Gal Yona Weizmann Institute.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Useful Conﬁdence Measures Beyond the Max Score Gal Yona Weizmann Institute

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: