Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition Sebastian G. Gruber Florian Buettner

2025-05-06 0 0 7.39MB 26 页 10玖币
侵权投诉
Uncertainty Estimates of Predictions via a General Bias-Variance
Decomposition
Sebastian G. Gruber Florian Buettner
German Cancer Research Center (DKFZ)
German Cancer Consortium (DKTK)
Goethe University Frankfurt, Germany
sebastian.gruber@dkfz.de
German Cancer Research Center (DKFZ)
German Cancer Consortium (DKTK)
Frankfurt Cancer Institute, Germany
Goethe University Frankfurt, Germany
florian.buettner@dkfz.de
Abstract
Reliably estimating the uncertainty of a predic-
tion throughout the model lifecycle is crucial in
many safety-critical applications. The most com-
mon way to measure this uncertainty is via the
predicted confidence. While this tends to work
well for in-domain samples, these estimates are
unreliable under domain drift and restricted to
classification. Alternatively, proper scores can be
used for most predictive tasks but a bias-variance
decomposition for model uncertainty does not
exist in the current literature. In this work we in-
troduce a general bias-variance decomposition for
strictly proper scores, giving rise to the Bregman
Information as the variance term. We discover
how exponential families and the classification
log-likelihood are special cases and provide novel
formulations. Surprisingly, we can express the
classification case purely in the logit space. We
showcase the practical relevance of this decom-
position on several downstream tasks, including
model ensembles and confidence regions. Further,
we demonstrate how different approximations of
the instance-level Bregman Information allow out-
of-distribution detection for all degrees of domain
drift.
1 INTRODUCTION
A core principle behind the success of modern Machine
and Deep Learning approaches are loss functions, which
are used to optimize and compare the goodness-of-fit of
Proceedings of the 26
th
International Conference on Artificial Intel-
ligence and Statistics (AISTATS) 2023, Valencia, Spain. PMLR:
Volume 206. Copyright 2023 by the author(s).
predictive models. Typical loss functions, such as the
Brier score or the negative log-likelihood, capture not
only predictive power (in the sense of accuracy) but also
predictive uncertainty. The latter is particularly relevant in
sensitive forecasting domains, such as cancer diagnostics
(Haggenm
¨
uller et al., 2021), genotype-based disease
prediction (Katsaouni et al., 2021) or climate prediction
(Yen et al., 2019).
Proper scores are a common occurrence as loss functions
for probabilistic modelling since their defining criterion
is to assign the best value to the target distribution as
prediction. Consequently, they are widely applicable from
quantile regression (Gneiting & Raftery, 2007) to generative
models (Song et al., 2021). They are a generalization of
the log-likelihood and also cover exponential families
(Gr
¨
unwald & Dawid, 2004). However, for such loss
functions, it is not clear how we can decompose them such
that a component capturing predictive uncertainty arises.
Consequently, predictive uncertainty is typically only con-
sidered as variance of predictions or, in classification, via
the confidence score associated to the top-label prediction.
Such confidence scores capture the predictive uncertainty
well if they are calibrated, namely if the confidence of a
prediction matches its true likelihood (Guo et al., 2017).
However, the calibration error of these confidence scores
typically increases under domain drift, making them an
unreliable measure for predictive uncertainty in many
real-world applications (Ovadia et al., 2019; Tomani &
Buettner, 2021).
In this work, we discover the Bregman Information as a
natural replacement of model variance via a bias-variance
decomposition for strictly proper scores. The Bregman
Information generalizes the variance of a random variable
via a closed-form definition based on a generating function
(Banerjee et al., 2005). In the case of our decompo-
sition, this generating function is a convex conjugate
directly associated with the respective proper score. The
source code for the experiments is openly accessible at
https://github.com/MLO-lab/Uncertainty_
arXiv:2210.12256v3 [cs.LG] 20 Apr 2023
Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition
Figure 1: Accuracy after discarding test instances with high
levels of uncertainty. We can discard fewer samples to reach
better accuracy when using Bregman Information as un-
certainty measure. For example, to achieve the validation
set accuracy (dotted line) for severely corrupted data we
only have to discard
7% of most uncertain in-domain sam-
ples contrary to
14% when using the confidence score.
The standard deviation bounds stem from different types of
corruption.
Estimates_via_BVD
. We summarize our
contribu-
tions in the following:
In Section 3, we generalize relevant properties to func-
tional Bregman divergences, which allows for deriv-
ing a bias-variance decomposition for strictly proper
scores. Via Bregman Information, we give novel for-
mulations for decompositions of exponential families
and the classification log-likelihood in the logit space.
We generalize the law of total variance to Breg-
man Information and show how ensemble predictions
marginalize out a specific source of uncertainty in Sec-
tion 3.5. We also propose a general way to give confi-
dence regions for predictions in Section 3.6.
We showcase experiments on how typical classifiers
differ in their Bregman Information in Section 4. There,
we demonstrate that the Bregman Information can be
a more meaningful measure of out-of-domain uncer-
tainty compared to the confidence score in the case
of corrupted CIFAR-10 and ImageNet (Figure 1 and
Algorithm 1).
2 BACKGROUND
In this section, we first start with a basic introduction of
Bregman divergences and Bregman Information. We specif-
ically mention recent developments for functional Bregman
divergences as we will require and provide generalizations
to this topic. Then follows another introduction into the
basic concepts of proper scores and exponential families,
which are related to Bregman divergences. Finally, we will
discuss other proposed bias-variance decompositions in the
literature to put our contribution into perspective.
2.1 Bregman Divergences and Bregman Information
Bregman divergences are a class of divergences occurring
in a wide range of applications (Bregman, 1967; Banerjee
et al., 2005; Frigyik et al., 2006; Si et al., 2009; Gupta et al.,
2022). We use the following definition.
Definition 2.1
(Bregman (1967))
.
Let
φUR
be a dif-
ferentiable, convex function with
URd
. The
Bregman
divergence generated by φof x, y Uis defined as
dφ(x, y)=φ(y)φ(x)φ(x), y x.
It can be interpreted geometrically as the difference be-
tween
φ
and the supporting hyperplane of
φ(x)
at
y
. We
have dφ(x, y)0with dφ(x, y)=0if x=y.
By definition, we can use Bregman divergences for scalar
and vector inputs. But, in the infinite-dimensional case,
for example when dealing with a continuous distribution
space
P
, the gradient vector and the inner product are
not defined anymore. As a solution to this, Frigyik et al.
(2006) introduce functional Bregman divergences by re-
placing the inner product term with the Fr
´
echet derivative.
The authors showed that the functional case generalizes
the standard case. Since some relevant functions are not
Fr
´
echet differentiable, Ovcharov (2018) offers an alterna-
tive approach to define the functional case based on sub-
gradients. In the context of dual vector spaces with pairing
., .
, a subgradient
x
at point
xU
of a convex function
φUR
fulfills the property
φ(y)φ(x)+x, y x
for
all
yU
. A function
φ(x)
which maps to a subgradient
of
φ(x)
for all
xU
is called a selection of subgradients,
or, if it is unambiguous in the context, just
subgradient
of
φ
. In general, subgradients are not unique, unlike gradients.
Ovcharov (2018) proposes to use the vector space
L(P)
of
P
-integrable functions and the vector space
spanP
of fi-
nite linear combinations of elements from
P
. These spaces
are dual with the pairing ”
” defined as
fP=fdP
for
fL(P)
and
PspanP
. They proceed to define
the by
φ, φ
generated
functional Bregman divergence
as
dφ,φ(x, y)=φ(y)φ(x)φ(x)(yx)
. We will also
use this definition for general vector spaces as long as a
subgradient is defined. In Section 3, we encounter the case
when a subgradient
φ
r
of
φ
is only defined on a smaller
domain
VU
. Then, we refer to
dφ,φ
rV×UR
as a
restricted functional Bregman divergence.
The
convex conjugate φ
of a function
φ
is defined as
φx=supyx, yφ(y)
in the context of dual vec-
tor spaces (Zalinescu, 2002). If
φ
is differentiable and
Sebastian G. Gruber, Florian Buettner
321 1 2 3 4
0
2
4
x
ln 1+ex
E[X]
Bσ+[X]
Figure 2: Illustration of the Bregman Information generated
by the softplus function
σ+(x)=ln 1+ex
of a binary
random variable.
strictly convex, then
(φ)1=φ
and
φ∗∗ =φ
. For
this case, Banerjee et al. (2005) give the important fact
that
dφ(x, y)=dφ(φ(y),φ(x))
. That is, by using the
convex conjugate, we can flip the arguments in a Bregman
divergence. To derive our main contribution, we will state a
generalization of this property to functional Bregman diver-
gences in Lemma 3.1.
We can also use Bregman divergences to quantify the vari-
ability or deviation of a random variable. Throughout this
work, the following definition is a central concept.
Definition 2.2
(Banerjee et al. (2005))
.
Let
φUR
be a
differentiable, convex function. The
Bregman Information
(generated by
φ
) of a random variable
X
with realizations
in Uis defined as
Bφ[X]=Edφ(E[X], X).
The Bregman Information generalizes the variance of a ran-
dom variable since both are equal if we set
U=R
and
φ(x)=x2
. Thus, one interpretation of the Bregman In-
formation is that it measures the divergence of a random
variable from its mean. Another representation, which
does not depend on
dφ
, is
Bφ[X]=E[φ(X)]φ(E[X])
.
Banerjee et al. (2005) show that this follows from the
original definition. Recall that Jensen’s inequality gives
E[φ(X)]φ(E[X])
. Consequently, a second interpreta-
tion of the Bregman Information is that it measures the gap
between both sides of Jensen’s inequality of the convex func-
tion
φ
and random variable
X
(Banerjee et al., 2005). It also
shows that we do not require a subgradient for a generaliza-
tion to the functional case. Thus, we define the
functional
Bregman Information
generated by a non-differentiable
convex
φ
as
Bφ[X]E[φ(X)]φ(E[X])
. The Bregman
Information generated by the softplus function is depicted in
Figure 2 for a binary random variable. The softplus finds use
as an activation function in neural networks (Glorot et al.,
2011; Murphy, 2022). Its generalization is the so-called
LogSumExp function (c.f. Section 3.4).
In Section 3, the Bregman Information will play a critical
role in our bias-variance decompositions since it represents
the variance term. Further, the LogSumExp-generated ver-
sion covers the variance term for classification. It reduces
to the softplus version for the binary case.
2.2 Proper Scores and Exponential Families
Gneiting & Raftery (2007) give an extensive and approach-
able overview of proper scores. In short, proper scores put a
negative loss on a distribution prediction
PP
for a target
random variable
YQP
and reach their maximum
if
P=Q
. For a concise statement of our main result, we
require a more technical definition provided in the following
similar to Hendrickson & Buehler (1971), Ovcharov (2015),
and Ovcharov (2018). We call a function
SPL(P)
scoring rule
or just
score
. Note that for a given
P
,
S(P)
maps into a function space and can be again evaluated on
an observation
y
, like
S(P)(y)
. To assess the goodness-
of-fit between distributions
P
and
Q
, we use the expected
score
S(P)Q=EYQ[S(P)(Y)]
. A score is defined
to be
proper
on
P
if and only if
S(P)QS(Q)Q
holds for all
P, Q P
, and
strictly proper
if and only
if an equality implies
P=Q
. In other words, a score is
proper if predicting the target distribution gives the best
expectation and strictly proper if no other prediction can
achieve this value. Note that the choice of
P
is rele-
vant: The negative squared error of a mean prediction is
strictly proper for normal distributions with fixed variance
but only proper if the variance varies. Given a proper score,
the associated
negative entropy GPR
is defined as
G(Q)=S(Q)Q
. It represents the highest reachable value
for a given target. If
P
is convex, the negative entropy has
S
as a subgradient and is (strictly) convex if and only if
S
is
(strictly) proper. For this case, Ovcharov (2018) proved that
a proper score is closely related to a functional Bregman
divergence generated by the associated negative entropy via
G(Q)S(P)Q=dG,S (P, Q)
. An example of such a
relation is the Kullback-Leibler divergence and the Shannon
entropy associated with the log score (log-likelihood).
Next, we summarize relevant aspects of exponential families.
Banerjee et al. (2005) provides a more extensive introduc-
tion. For a support set
T
, the probability density/mass
function
pθ
at a point
xT
of an
exponential family
is
given by
pθ(x)=exp (θ, T (x)A(θ))h(x)
. Here, we
call
θΘ
the natural parameter of the convex parameter
space
ΘRd
,
T
is the sufficient statistic, and
A
is the log-
partition. Table 1 gives two relevant examples and shows
the mapping between typical and natural parameters. Fur-
ther examples are the Dirichlet, exponential, and Poisson
distributions. There are two relevant properties which we
will require for our results. One is that
A
is a strictly con-
Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition
Table 1: Examples of exponential families. The mapping defines the relation between natural parameters and typical
parameters. We denote the dummy-encoding for a
x{1, . . . , k}
with
dx
, class probabilities with
pj
, mean with
µ
, and
standard deviation with σ. The Bernoulli distribution is a special case of the categorical distribution for k=2.
Distribution TΘT(x)A(θ)h(x)Mapping
Categorical (k-classes) {1, . . . , k}Rk1dxln 1+k1
i=1exp θi1pj=exp θj
1+k1
i=1exp θi
Normal (known σ)R R xθ2
2
exp x2
2σ2
2πσ µ=θσ
vex function. The other is
E[T(X)]=A(θ)
for
Xpθ
.
Banerjee et al. (2005) proved under mild conditions that
an exponential family relates to a Bregman divergence and
vice versa via the negative log-likelihood. Gr
¨
unwald &
Dawid (2004) proved a similar link between proper scores
and exponential families.
As we can see, exponential families, proper scores, and
Bregman divergences have strong relationships to one an-
other. By generalizing some properties to the functional
case, this relationship will allow us to state every variance
term as (functional) Bregman Information. In the case of
the log-likelihood of an exponential family, the functional
Bregman Information reduces to a vector-based Bregman
Information.
2.3 Other Bias-Variance Decompositions
In general, all general decompositions in current literature
are either for categorical, real-valued, or parametric predic-
tions, and it is not clear if a decomposition for proper scores
of non-parametric distributions is possible.
James & Hastie (1997) formulate a decomposition for any
loss function of categorical or real-valued predictions but do
not provide a closed-form solution for a given case. Domin-
gos (2000) introduce how a general bias-variance decompo-
sition should look, though they stated it is unclear when or if
this decomposition holds for a loss function. James (2003)
provide a bias-variance decomposition for symmetric loss
functions. Heskes (1998) use the bias-variance decompo-
sitions for the Kullback-Leibler divergence, which allows
to derive a decomposition for exponential families. Hansen
& Heskes (2000) proves that a bias-variance decomposition
of a parametric prediction is only possible if the prediction
belongs to an exponential family. Importantly, they only
introduce the specific decomposition for a given exponential
family. The decomposition is not formulated for the natural
parameters and relies on the canonical link function. Con-
sequently, a relation to Bregman divergences and Bregman
Information is missing, which we will provide.
A Pythagorean-like theorem for vector-based Bregman di-
vergences is a known fact in literature (Jones & Byrne,
1990; Csiszar, 1991; Della Pietra et al., 2002; Dawid,
2007; Telgarsky & Dasgupta, 2012). An equality in
this theorem implies a decomposition in the form of
Edφ(X, y) =EdφX, x +dφx, y
with
x=
arg minzEdφ(X, z)
(Pfau, 2013). Brofos et al. (2019),
Brinda et al. (2019), and Yang et al. (2020) relate the
classification log-likelihood to the Kullback-Leibler diver-
gence and provide a bias-variance decomposition, where
X
takes the form of a predictive probability vector. They set
xexp E[log X]
. Note that predictions in the logit space
require normalization to the log space, which will not be
the case in our formulation. Gupta et al. (2022) build upon
the Bregman divergence decomposition and use the notion
of primal and dual space of the variance. Even though the
definitions are similar, the authors did not state the relation
between Bregman Information and dual variance, for which
they introduce a general law of total variance.
Due to the restriction of Bregman divergences to vector
inputs, it is not clear if a decomposition for proper scores of
non-parametric distributions is possible. In other words, we
require an extension of the current literature to functional
Bregman divergences for a positive result. In the following
section, we provide the required generalization and unify
the variance term in previous literature via the Bregman
Information.
3 A GENERAL BIAS-VARIANCE
DECOMPOSITION
In this section, we offer a general bias-variance decompo-
sition for strictly proper scores. The only assumptions are
that the distribution set
P
is convex, the associated negative
entropy is lower semicontinuous, and each respective ex-
pectation exists. Further, we will discover that the variance
term is the Bregman Information generated by the convex
conjugate of the associated negative entropy. This discovery
generalizes and unifies decompositions in current literature
for which exists a concrete form (Hansen & Heskes, 2000),
but also provides a closed formulation contrary to other gen-
eral bias-variance decompositions (James & Hastie, 1997).
All technical details and proofs are presented in Appendix
B.
Sebastian G. Gruber, Florian Buettner
3.1 Functional Bregman Divergences of Convex
Conjugates
The essential part for deriving our main result is the ex-
change of arguments in a functional Bregman divergence.
Note that a subgradient of a strictly convex function is injec-
tive. Thus, its inverse exists on an appropriate domain, and
this inverse is again a subgradient of the convex conjugate.
With that in mind, we can state the following.
Lemma 3.1.
Assume a strictly convex, lower semicontinu-
ous function
GPR
has a subgradient
S
. Then,
dG,S1
is a restricted functional Bregman divergence with the prop-
erties
dG,S (p, q)=dG,S1(S(q), S (p)), and
dG,S1p, q=dG,S S1q, S1p.
For an appropriate random variable Qwe have
EdG,S1p, Q=BGQ+dG,S1p,EQ.
The last property is a generalization of the decomposition
of Bregman divergences combined with the Definition of
Bregman Information. Lemma 3.1 confirms that a critical
property of Bregman divergences is also well-defined for
functional Bregman divergences. Namely, we can exchange
the arguments by changing the generating convex function
to the convex conjugate in a dual space. This insight leads
us now to the main theoretical contribution of this work.
3.2 A Decomposition for Proper Scores
In the following, we present a bias-variance decomposition
for strictly proper scores. Note that we require no assump-
tions about the score or its entropy being differentiable.
Theorem 3.2.
For a strictly proper score
S
with associated
lower semicontinuous negative entropy
G
, an estimated
prediction ˆ
f, and the target YQ, we have
ES(ˆ
f)(Y)

Error
=
G(Q)

Noise
+BGS(ˆ
f)

”Variance”
+dG,S1S(Q),ES(ˆ
f)

Bias
.
As we can see, the variance term is always represented by
the Bregman Information generated by the convex conju-
gate of the associated negative entropy. The theorem di-
rectly follows from the relationship between proper scores
and functional Bregman divergences provided by Ovcharov
(2018) in combination with Lemma 3.1 (c.f. Appendix B).
We provide an example in the form of the prominent log
score. For conciseness, we denote all distributions with their
respective densities.
Example 3.3.The log score
S=ln
is strictly proper on any
set of densities. Its negative entropy is the negative Shannon
entropy
H(p)=p(x)ln p(x)dx
. The convex conjugate
is the log partition function
Hp=ln ep(x)dx
. Since
EHln ˆ
f=0
, we receive the Bregman Information in
the form BHSˆ
f=HEln ˆ
f.
Dealing with non-parametric distributions can be challeng-
ing both in theory and practice. Consequently, we provide
the following restriction to exponential families and then
express neatly the decomposition through the natural param-
eters. Specifically,
BH
will be reduced to a vector-based
Bregman Information.
3.3 Exponential Families as a Special Case
When we deal with densities, it is often more practical to
consider parametric densities since they can be represented
by a parameter vector instead of a function. Particularly
relevant classes are exponential families for which one uses
the log-likelihood to assess the goodness of fit. In the last
section, we derived the decomposition for the log-likelihood
expressed in the function space in Example 3.3. Thus, we
provide the following special case as a novel formulation of
(Hansen & Heskes, 2000).
Proposition 3.4.
For an exponential family density/mass
function
pˆ
θ
with estimated natural parameter vector
ˆ
θ
, log-
partition
A
, and reference function
h
, we have the log likeli-
hood decomposition
Eln pˆ
θ(Y)

NLL
=n(θ)

Noise
+BAˆ
θ

”Variance”
+dAθ, Eˆ
θ

Bias
where n(θ)=A(A(θ))E[ln h(Y)]and Ypθ.
The proof in Appendix B uses
θ=A(E[T(Y)])
in the
last step, where
T
is the sufficient statistic. This is a fun-
damental property of exponential families but only holds if
the distribution assumption holds with the data. Since this
is usually not the case in practical scenarios, it is important
to note that the decomposition still holds if we replace every
θwith A(E[T(Y)]).
Example 3.5.We can recover the textbook decomposition
of the MSE by using
YNµ, σ2
with unknown mean
µ
and known variance
σ2
. The necessary information is
provided in Table 1. For the LHS in Proposition 3.4 we
have
Eln pˆ
θ(Y) =
E(Yˆµ)2
2σ2+ln 2πσ
. And for the
RHS, we get
n(θ)=1
2+ln 2πσ
,
BAˆ
θ=1
2σ2V[ˆµ]
, and
dAθ, Eˆ
θ=(µE[ˆµ])2
2σ2
. Finally, we subtract
ln 2πσ
on
摘要:

UncertaintyEstimatesofPredictionsviaaGeneralBias-VarianceDecompositionSebastianG.GruberFlorianBuettnerGermanCancerResearchCenter(DKFZ)GermanCancerConsortium(DKTK)GoetheUniversityFrankfurt,Germanysebastian.gruber@dkfz.deGermanCancerResearchCenter(DKFZ)GermanCancerConsortium(DKTK)FrankfurtCancerInstit...

展开>> 收起<<
Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition Sebastian G. Gruber Florian Buettner.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:26 页 大小:7.39MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注