
well-calibrated “out of the box”, especially not in the presence of distribution shifts. While it is
common to post-process predictions to improve their calibration, this approach is not always feasible
as it requires additional data, is observed to have limited success in settings of distributions shifts
(Desai and Durrett, 2020; Dan and Roth, 2021), and may come at the cost of sharpness (Kumar et al.,
2018).
The above discussion suggests that in the presence of model miscalibration, using the tail of the
predictions to inform the confidence score could be beneficial.
2
Indeed, the literature on active
learning (AL) has long since considered uncertainty scores that employ the rest of the probability
vector (Settles, 2009). In the AL context, these are used to greedily select examples from a large pool
of unlabeled examples for which labels will be requested. E.g., it is common to use the gap between
the first and second largest entries of
f(x)
, and in some cases even the entropy of
f(x)
, to inform
this selection.
Our contributions.
In this work, we consider such uncertainty scores in the context of confidence
measures, and perform a systematic evaluation of these measures in the presence of distribution shifts.
We focus on large pre-trained Transformer-based language models like BERT (Devlin et al., 2018)
for multi-class NLP tasks, which have observed to be well-calibrated on in-distribution data (Desai
and Durrett, 2020). We use the Amazon reviews dataset from the WILDS benchmark (Koh et al.,
2021), in which the out-of-distribution (OOD) test set consists of a set of reviewers that is disjoint
from the training set and in-distribution (ID) validation set. We consider models trained on this task
with different objectives (regular risk minimization, but also approaches that are designed to handle
distribution shifts), and evaluate the different confidence measures on the OOD test data. Our key
findings are:
1.
When the confidence measures are evaluated “out of the box” (with no further tuning based
on a validation set), using
maxif(x)i
is highly sub-optimal. Margin-based confidence
measures perform better for most of the models considered (and by a significant gap), and
the entropy-based confidence measure is consistently better.
2.
We derive a variant of temperature scaling (TS), a popular post-processing technique for
improving calibration, and show that it can be used to consistently improve the calibration
for all the confidence measures we consider.
3.
In the post-processing regime (namely, after applying TS), the entropy-based confidence
measure Pareto dominates the max-based measure for most of the models (and is otherwise
incomparable - has a marginally larger calibration error but is sharper).
Additional related work.
(Desai and Durrett, 2020) evaluate the calibration of pre-trained Trans-
former models in both ID and OOD settings. Their results demonstrate that Transformer-based
models tend to well-calibrated ID but that the calibration error can decrease significantly OOD. (Dan
and Roth, 2021) empirically evaluate the relationship between scale and calibration, showing that
OOD, smaller Transformer models tend to have worse calibration than larger models, even after
applying TS. These works, together with earlier works (Guo et al., 2017), all conflate a model’s
confidence with the probability of the predicted label. One recent exception is (Taha et al., 2022),
which consider confidence measures based on the margin and kurtosis of the logits. Their work
is significantly different from ours as they do not directly evaluate calibration of these proposed
measures and also do not consider distribution shifts.
2 Preliminaries
Setup.
We consider a multi-class classification problem with feature space
X
and label space
Y
,
where
|Y| =k≥2
and
∆(Y)
denotes the simplex over
Rk
. Let
P
denote a joint distribution over
X × Y
and
X, Y
random variables w.r.t
P
. A classifier is a mapping
f:X → ∆(Y)
and its predicted
label is
arg maxif(x)i
. We use
err
to denote the 0-1 error of
f
:
err(x, y) = 1[arg maxif(x)i6=y]
.
2
As one illustrative example, consider a 10-class classification task and the predictions on two instances:
f(x1) = [0.9,0.1,0.0, . . . 0.0]
and
f(x2) = [0.9,0.1/9,...,0.1/9]
. We might expect that the confidence on
x1
should be higher than on
x2
: intuitively, for
x1
the model is “deliberating” between two concrete options (the
first and second classes) whereas for
x2
there is no clear alternative to the first class. However, by definition,
maxif(x)ican make no such distinctions.
2