Useful Confidence Measures Beyond the Max Score Gal Yona Weizmann Institute

2025-05-06 0 0 354.99KB 7 页 10玖币
侵权投诉
Useful Confidence Measures: Beyond the Max Score
Gal Yona
Weizmann Institute
gal.yona@gmail.com
Amir Feder
Columbia University
amirfeder@gmail.com
Itay Laish
Google
itaylaish@google.com
Abstract
An important component in deploying machine learning (ML) in safety-critic ap-
plications is having a reliable measure of confidence in the ML model’s predictions.
For a classifier
f
producing a probability vector
f(x)
over the candidate classes,
the confidence is typically taken to be
maxif(x)i
. This approach is potentially
limited, as it disregards the rest of the probability vector. In this work, we derive
several confidence measures that depend on information beyond the maximum
score, such as margin-based and entropy-based measures, and empirically evaluate
their usefulness, focusing on NLP tasks with distribution shifts and Transformer-
based models. We show that when models are evaluated on the out-of-distribution
data “out of the box”, using only the maximum score to inform the confidence
measure is highly suboptimal. In the post-processing regime (where the scores of
f
can be improved using additional in-distribution held-out data), this remains true,
albeit less significant. Overall, our results suggest that entropy-based confidence is
a surprisingly useful measure.
1 Introduction
As machine learning (ML) is increasingly deployed in high-stakes decision-making applications, it
becomes increasingly important that practitioners have access to reliable measure of how confident
the ML model is in its various predictions. This becomes especially crucial in settings where the
predictions are made in conditions significantly different than the ones present during development.
In these cases, accuracy may unavoidably degrade, but a useful confidence measure can at least
ensure practitioners “know” when the ML model “doesn’t know”.
In this work we assume classifiers
f
output a probability vector
f(x)
over the candidate classes
Y
and treat confidence as a scalar quantity
c(f(x)) [0,1]
that represents how confident
f
is in its
prediction, with scores near
1
representing highly confident predictions. Intuitively, a good confidence
measure should give rise to scores that correlate well with the accuracy of
f
. In Section 3 we show
that this objective can be decomposed into two familiar terms from the literature on forecasting
(Murphy, 1973; Dawid, 1982): a calibration error term (encouraging that whenever we output a
confidence value of e.g. 0.7, then on average, 70% of the time the model makes a correct prediction)
and a sharpness term (encouraging the confidence values to also be varied).
Given a classifier
f
, what should we choose as our confidence measure
c
? One natural choice is
to use the maximum class probability,
c(f(x)) = maxif(x)i
. When
f
is itself calibrated, this will
give rise to a calibrated confidence measure. However, we don’t necessarily expect models to be
Work done while an intern at Google.
Workshop on Distribution Shifts, 36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.14070v1 [cs.LG] 25 Oct 2022
well-calibrated “out of the box”, especially not in the presence of distribution shifts. While it is
common to post-process predictions to improve their calibration, this approach is not always feasible
as it requires additional data, is observed to have limited success in settings of distributions shifts
(Desai and Durrett, 2020; Dan and Roth, 2021), and may come at the cost of sharpness (Kumar et al.,
2018).
The above discussion suggests that in the presence of model miscalibration, using the tail of the
predictions to inform the confidence score could be beneficial.
2
Indeed, the literature on active
learning (AL) has long since considered uncertainty scores that employ the rest of the probability
vector (Settles, 2009). In the AL context, these are used to greedily select examples from a large pool
of unlabeled examples for which labels will be requested. E.g., it is common to use the gap between
the first and second largest entries of
f(x)
, and in some cases even the entropy of
f(x)
, to inform
this selection.
Our contributions.
In this work, we consider such uncertainty scores in the context of confidence
measures, and perform a systematic evaluation of these measures in the presence of distribution shifts.
We focus on large pre-trained Transformer-based language models like BERT (Devlin et al., 2018)
for multi-class NLP tasks, which have observed to be well-calibrated on in-distribution data (Desai
and Durrett, 2020). We use the Amazon reviews dataset from the WILDS benchmark (Koh et al.,
2021), in which the out-of-distribution (OOD) test set consists of a set of reviewers that is disjoint
from the training set and in-distribution (ID) validation set. We consider models trained on this task
with different objectives (regular risk minimization, but also approaches that are designed to handle
distribution shifts), and evaluate the different confidence measures on the OOD test data. Our key
findings are:
1.
When the confidence measures are evaluated “out of the box” (with no further tuning based
on a validation set), using
maxif(x)i
is highly sub-optimal. Margin-based confidence
measures perform better for most of the models considered (and by a significant gap), and
the entropy-based confidence measure is consistently better.
2.
We derive a variant of temperature scaling (TS), a popular post-processing technique for
improving calibration, and show that it can be used to consistently improve the calibration
for all the confidence measures we consider.
3.
In the post-processing regime (namely, after applying TS), the entropy-based confidence
measure Pareto dominates the max-based measure for most of the models (and is otherwise
incomparable - has a marginally larger calibration error but is sharper).
Additional related work.
(Desai and Durrett, 2020) evaluate the calibration of pre-trained Trans-
former models in both ID and OOD settings. Their results demonstrate that Transformer-based
models tend to well-calibrated ID but that the calibration error can decrease significantly OOD. (Dan
and Roth, 2021) empirically evaluate the relationship between scale and calibration, showing that
OOD, smaller Transformer models tend to have worse calibration than larger models, even after
applying TS. These works, together with earlier works (Guo et al., 2017), all conflate a model’s
confidence with the probability of the predicted label. One recent exception is (Taha et al., 2022),
which consider confidence measures based on the margin and kurtosis of the logits. Their work
is significantly different from ours as they do not directly evaluate calibration of these proposed
measures and also do not consider distribution shifts.
2 Preliminaries
Setup.
We consider a multi-class classification problem with feature space
X
and label space
Y
,
where
|Y| =k2
and
∆(Y)
denotes the simplex over
Rk
. Let
P
denote a joint distribution over
X × Y
and
X, Y
random variables w.r.t
P
. A classifier is a mapping
f:X ∆(Y)
and its predicted
label is
arg maxif(x)i
. We use
err
to denote the 0-1 error of
f
:
err(x, y) = 1[arg maxif(x)i6=y]
.
2
As one illustrative example, consider a 10-class classification task and the predictions on two instances:
f(x1) = [0.9,0.1,0.0, . . . 0.0]
and
f(x2) = [0.9,0.1/9,...,0.1/9]
. We might expect that the confidence on
x1
should be higher than on
x2
: intuitively, for
x1
the model is “deliberating” between two concrete options (the
first and second classes) whereas for
x2
there is no clear alternative to the first class. However, by definition,
maxif(x)ican make no such distinctions.
2
摘要:

UsefulCondenceMeasures:BeyondtheMaxScoreGalYonaWeizmannInstitutegal.yona@gmail.comAmirFederColumbiaUniversityamirfeder@gmail.comItayLaishGoogleitaylaish@google.comAbstractAnimportantcomponentindeployingmachinelearning(ML)insafety-criticap-plicationsishavingareliablemeasureofcondenceintheMLmodel's...

展开>> 收起<<
Useful Confidence Measures Beyond the Max Score Gal Yona Weizmann Institute.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:354.99KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注