Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition Sebastian G. Gruber Florian Buettner

2025-05-06 0 0 7.39MB 26 页 10玖币

侵权投诉

Uncertainty Estimates of Predictions via a General Bias-Variance

Decomposition

Sebastian G. Gruber Florian Buettner

German Cancer Research Center (DKFZ)

German Cancer Consortium (DKTK)

Goethe University Frankfurt, Germany

sebastian.gruber@dkfz.de

German Cancer Research Center (DKFZ)

German Cancer Consortium (DKTK)

Frankfurt Cancer Institute, Germany

Goethe University Frankfurt, Germany

florian.buettner@dkfz.de

Abstract

Reliably estimating the uncertainty of a predic-

tion throughout the model lifecycle is crucial in

many safety-critical applications. The most com-

mon way to measure this uncertainty is via the

predicted conﬁdence. While this tends to work

well for in-domain samples, these estimates are

unreliable under domain drift and restricted to

classiﬁcation. Alternatively, proper scores can be

used for most predictive tasks but a bias-variance

decomposition for model uncertainty does not

exist in the current literature. In this work we in-

troduce a general bias-variance decomposition for

strictly proper scores, giving rise to the Bregman

Information as the variance term. We discover

how exponential families and the classiﬁcation

log-likelihood are special cases and provide novel

formulations. Surprisingly, we can express the

classiﬁcation case purely in the logit space. We

showcase the practical relevance of this decom-

position on several downstream tasks, including

model ensembles and conﬁdence regions. Further,

we demonstrate how different approximations of

the instance-level Bregman Information allow out-

of-distribution detection for all degrees of domain

drift.

1 INTRODUCTION

A core principle behind the success of modern Machine

and Deep Learning approaches are loss functions, which

are used to optimize and compare the goodness-of-ﬁt of

Proceedings of the 26

International Conference on Artiﬁcial Intel-

ligence and Statistics (AISTATS) 2023, Valencia, Spain. PMLR:

predictive models. Typical loss functions, such as the

Brier score or the negative log-likelihood, capture not

only predictive power (in the sense of accuracy) but also

predictive uncertainty. The latter is particularly relevant in

sensitive forecasting domains, such as cancer diagnostics

(Haggenm

uller et al., 2021), genotype-based disease

prediction (Katsaouni et al., 2021) or climate prediction

(Yen et al., 2019).

Proper scores are a common occurrence as loss functions

for probabilistic modelling since their deﬁning criterion

is to assign the best value to the target distribution as

prediction. Consequently, they are widely applicable from

quantile regression (Gneiting & Raftery, 2007) to generative

models (Song et al., 2021). They are a generalization of

the log-likelihood and also cover exponential families

(Gr

unwald & Dawid, 2004). However, for such loss

functions, it is not clear how we can decompose them such

that a component capturing predictive uncertainty arises.

Consequently, predictive uncertainty is typically only con-

sidered as variance of predictions or, in classiﬁcation, via

the conﬁdence score associated to the top-label prediction.

Such conﬁdence scores capture the predictive uncertainty

well if they are calibrated, namely if the conﬁdence of a

prediction matches its true likelihood (Guo et al., 2017).

However, the calibration error of these conﬁdence scores

typically increases under domain drift, making them an

unreliable measure for predictive uncertainty in many

real-world applications (Ovadia et al., 2019; Tomani &

Buettner, 2021).

In this work, we discover the Bregman Information as a

natural replacement of model variance via a bias-variance

decomposition for strictly proper scores. The Bregman

Information generalizes the variance of a random variable

via a closed-form deﬁnition based on a generating function

(Banerjee et al., 2005). In the case of our decompo-

sition, this generating function is a convex conjugate

directly associated with the respective proper score. The

source code for the experiments is openly accessible at

https://github.com/MLO-lab/Uncertainty_

arXiv:2210.12256v3 [cs.LG] 20 Apr 2023

Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition

Figure 1: Accuracy after discarding test instances with high

levels of uncertainty. We can discard fewer samples to reach

better accuracy when using Bregman Information as un-

certainty measure. For example, to achieve the validation

set accuracy (dotted line) for severely corrupted data we

only have to discard

∼

7% of most uncertain in-domain sam-

ples contrary to

∼

14% when using the conﬁdence score.

The standard deviation bounds stem from different types of

corruption.

Estimates_via_BVD

. We summarize our

contribu-

tions in the following:

•

In Section 3, we generalize relevant properties to func-

tional Bregman divergences, which allows for deriv-

ing a bias-variance decomposition for strictly proper

scores. Via Bregman Information, we give novel for-

mulations for decompositions of exponential families

and the classiﬁcation log-likelihood in the logit space.

•

We generalize the law of total variance to Breg-

man Information and show how ensemble predictions

marginalize out a speciﬁc source of uncertainty in Sec-

tion 3.5. We also propose a general way to give conﬁ-

dence regions for predictions in Section 3.6.

•

We showcase experiments on how typical classiﬁers

differ in their Bregman Information in Section 4. There,

we demonstrate that the Bregman Information can be

a more meaningful measure of out-of-domain uncer-

tainty compared to the conﬁdence score in the case

of corrupted CIFAR-10 and ImageNet (Figure 1 and

Algorithm 1).

2 BACKGROUND

In this section, we ﬁrst start with a basic introduction of

Bregman divergences and Bregman Information. We specif-

ically mention recent developments for functional Bregman

divergences as we will require and provide generalizations

to this topic. Then follows another introduction into the

basic concepts of proper scores and exponential families,

which are related to Bregman divergences. Finally, we will

discuss other proposed bias-variance decompositions in the

literature to put our contribution into perspective.

2.1 Bregman Divergences and Bregman Information

Bregman divergences are a class of divergences occurring

in a wide range of applications (Bregman, 1967; Banerjee

et al., 2005; Frigyik et al., 2006; Si et al., 2009; Gupta et al.,

2022). We use the following deﬁnition.

Deﬁnition 2.1

(Bregman (1967))

Let

φ∶U→R

be a dif-

ferentiable, convex function with

U⊂Rd

. The

Bregman

divergence generated by φof x, y ∈Uis deﬁned as

dφ(x, y)=φ(y)−φ(x)−∇φ(x), y −x.

It can be interpreted geometrically as the difference be-

tween

and the supporting hyperplane of

φ(x)

. We

have dφ(x, y)≥0with dφ(x, y)=0if x=y.

By deﬁnition, we can use Bregman divergences for scalar

and vector inputs. But, in the inﬁnite-dimensional case,

for example when dealing with a continuous distribution

space

, the gradient vector and the inner product are

not deﬁned anymore. As a solution to this, Frigyik et al.

(2006) introduce functional Bregman divergences by re-

placing the inner product term with the Fr

echet derivative.

The authors showed that the functional case generalizes

the standard case. Since some relevant functions are not

echet differentiable, Ovcharov (2018) offers an alterna-

tive approach to deﬁne the functional case based on sub-

gradients. In the context of dual vector spaces with pairing

., .

, a subgradient

x′

at point

x∈U

of a convex function

φ∶U→R

fulﬁlls the property

φ(y)≥φ(x)+x′, y −x

for

all

y∈U

. A function

φ′(x)

which maps to a subgradient

φ(x)

for all

x∈U

is called a selection of subgradients,

or, if it is unambiguous in the context, just

subgradient

. In general, subgradients are not unique, unlike gradients.

Ovcharov (2018) proposes to use the vector space

L(P)

-integrable functions and the vector space

spanP

of ﬁ-

nite linear combinations of elements from

. These spaces

are dual with the pairing ”

⋅

” deﬁned as

f⋅P=∫fdP

for

f∈L(P)

and

P∈spanP

. They proceed to deﬁne

the by

φ, φ′

generated

functional Bregman divergence

dφ,φ′(x, y)=φ(y)−φ(x)−φ′(x)⋅(y−x)

. We will also

use this deﬁnition for general vector spaces as long as a

subgradient is deﬁned. In Section 3, we encounter the case

when a subgradient

φ′

is only deﬁned on a smaller

domain

V⊂U

. Then, we refer to

dφ,φ′

r∶V×U→R

as a

restricted functional Bregman divergence.

The

convex conjugate φ∗

of a function

is deﬁned as

φ∗x∗=supyx∗, y−φ(y)

in the context of dual vec-

tor spaces (Zalinescu, 2002). If

is differentiable and

Sebastian G. Gruber, Florian Buettner

−3−2−1 1 2 3 4

ln 1+ex

E[X]

Bσ+[X]

Figure 2: Illustration of the Bregman Information generated

by the softplus function

σ+(x)=ln 1+ex

of a binary

random variable.

strictly convex, then

(∇φ)−1=∇φ∗

and

φ∗∗ =φ

. For

this case, Banerjee et al. (2005) give the important fact

that

dφ(x, y)=dφ∗(∇φ(y),∇φ(x))

. That is, by using the

convex conjugate, we can ﬂip the arguments in a Bregman

divergence. To derive our main contribution, we will state a

generalization of this property to functional Bregman diver-

gences in Lemma 3.1.

We can also use Bregman divergences to quantify the vari-

ability or deviation of a random variable. Throughout this

work, the following deﬁnition is a central concept.

Deﬁnition 2.2

(Banerjee et al. (2005))

Let

φ∶U→R

be a

differentiable, convex function. The

Bregman Information

(generated by

) of a random variable

with realizations

in Uis deﬁned as

Bφ[X]=Edφ(E[X], X).

The Bregman Information generalizes the variance of a ran-

dom variable since both are equal if we set

U=R

and

φ(x)=x2

. Thus, one interpretation of the Bregman In-

formation is that it measures the divergence of a random

variable from its mean. Another representation, which

does not depend on

dφ

, is

Bφ[X]=E[φ(X)]−φ(E[X])

Banerjee et al. (2005) show that this follows from the

original deﬁnition. Recall that Jensen’s inequality gives

E[φ(X)]≥φ(E[X])

. Consequently, a second interpreta-

tion of the Bregman Information is that it measures the gap

between both sides of Jensen’s inequality of the convex func-

tion

and random variable

(Banerjee et al., 2005). It also

shows that we do not require a subgradient for a generaliza-

tion to the functional case. Thus, we deﬁne the

functional

Bregman Information

generated by a non-differentiable

convex

Bφ[X]≔E[φ(X)]−φ(E[X])

. The Bregman

Information generated by the softplus function is depicted in

Figure 2 for a binary random variable. The softplus ﬁnds use

as an activation function in neural networks (Glorot et al.,

2011; Murphy, 2022). Its generalization is the so-called

LogSumExp function (c.f. Section 3.4).

In Section 3, the Bregman Information will play a critical

role in our bias-variance decompositions since it represents

the variance term. Further, the LogSumExp-generated ver-

sion covers the variance term for classiﬁcation. It reduces

to the softplus version for the binary case.

2.2 Proper Scores and Exponential Families

Gneiting & Raftery (2007) give an extensive and approach-

able overview of proper scores. In short, proper scores put a

negative loss on a distribution prediction

P∈P

for a target

random variable

Y∼Q∈P

and reach their maximum

P=Q

. For a concise statement of our main result, we

require a more technical deﬁnition provided in the following

similar to Hendrickson & Buehler (1971), Ovcharov (2015),

and Ovcharov (2018). We call a function

S∶P→L(P)

scoring rule

or just

score

. Note that for a given

S(P)

maps into a function space and can be again evaluated on

an observation

, like

S(P)(y)

. To assess the goodness-

of-ﬁt between distributions

and

, we use the expected

score

S(P)⋅Q=EY∼Q[S(P)(Y)]

. A score is deﬁned

to be

proper

if and only if

S(P)⋅Q≤S(Q)⋅Q

holds for all

P, Q ∈P

, and

strictly proper

if and only

if an equality implies

P=Q

. In other words, a score is

proper if predicting the target distribution gives the best

expectation and strictly proper if no other prediction can

achieve this value. Note that the choice of

is rele-

vant: The negative squared error of a mean prediction is

strictly proper for normal distributions with ﬁxed variance

but only proper if the variance varies. Given a proper score,

the associated

negative entropy G∶P→R

is deﬁned as

G(Q)=S(Q)⋅Q

. It represents the highest reachable value

for a given target. If

is convex, the negative entropy has

as a subgradient and is (strictly) convex if and only if

(strictly) proper. For this case, Ovcharov (2018) proved that

a proper score is closely related to a functional Bregman

divergence generated by the associated negative entropy via

G(Q)−S(P)⋅Q=dG,S (P, Q)

. An example of such a

relation is the Kullback-Leibler divergence and the Shannon

entropy associated with the log score (log-likelihood).

Next, we summarize relevant aspects of exponential families.

Banerjee et al. (2005) provides a more extensive introduc-

tion. For a support set

, the probability density/mass

function

pθ

at a point

x∈T

of an

exponential family

given by

pθ(x)=exp (θ, T (x)−A(θ))h(x)

. Here, we

call

θ∈Θ

the natural parameter of the convex parameter

space

Θ⊂Rd

is the sufﬁcient statistic, and

is the log-

partition. Table 1 gives two relevant examples and shows

the mapping between typical and natural parameters. Fur-

ther examples are the Dirichlet, exponential, and Poisson

distributions. There are two relevant properties which we

will require for our results. One is that

is a strictly con-

Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition

Table 1: Examples of exponential families. The mapping deﬁnes the relation between natural parameters and typical

parameters. We denote the dummy-encoding for a

x∈{1, . . . , k}

with

, class probabilities with

, mean with

, and

standard deviation with σ. The Bernoulli distribution is a special case of the categorical distribution for k=2.

Distribution TΘT(x)A(θ)h(x)Mapping

Categorical (k-classes) {1, . . . , k}Rk−1dxln 1+∑k−1

i=1exp θi1pj=exp θj

1+∑k−1

i=1exp θi

Normal (known σ)R R xθ2

exp −x2

2σ2

2πσ µ=θσ

vex function. The other is

E[T(X)]=∇A(θ)

for

X∼pθ

Banerjee et al. (2005) proved under mild conditions that

an exponential family relates to a Bregman divergence and

vice versa via the negative log-likelihood. Gr

unwald &

Dawid (2004) proved a similar link between proper scores

and exponential families.

As we can see, exponential families, proper scores, and

Bregman divergences have strong relationships to one an-

other. By generalizing some properties to the functional

case, this relationship will allow us to state every variance

term as (functional) Bregman Information. In the case of

the log-likelihood of an exponential family, the functional

Bregman Information reduces to a vector-based Bregman

Information.

2.3 Other Bias-Variance Decompositions

In general, all general decompositions in current literature

are either for categorical, real-valued, or parametric predic-

tions, and it is not clear if a decomposition for proper scores

of non-parametric distributions is possible.

James & Hastie (1997) formulate a decomposition for any

loss function of categorical or real-valued predictions but do

not provide a closed-form solution for a given case. Domin-

gos (2000) introduce how a general bias-variance decompo-

sition should look, though they stated it is unclear when or if

this decomposition holds for a loss function. James (2003)

provide a bias-variance decomposition for symmetric loss

functions. Heskes (1998) use the bias-variance decompo-

sitions for the Kullback-Leibler divergence, which allows

to derive a decomposition for exponential families. Hansen

& Heskes (2000) proves that a bias-variance decomposition

of a parametric prediction is only possible if the prediction

belongs to an exponential family. Importantly, they only

introduce the speciﬁc decomposition for a given exponential

family. The decomposition is not formulated for the natural

parameters and relies on the canonical link function. Con-

sequently, a relation to Bregman divergences and Bregman

Information is missing, which we will provide.

A Pythagorean-like theorem for vector-based Bregman di-

vergences is a known fact in literature (Jones & Byrne,

1990; Csiszar, 1991; Della Pietra et al., 2002; Dawid,

2007; Telgarsky & Dasgupta, 2012). An equality in

this theorem implies a decomposition in the form of

Edφ(X, y) =EdφX, x∗ +dφx∗, y

with

x∗=

arg minzEdφ(X, z)

(Pfau, 2013). Brofos et al. (2019),

Brinda et al. (2019), and Yang et al. (2020) relate the

classiﬁcation log-likelihood to the Kullback-Leibler diver-

gence and provide a bias-variance decomposition, where

takes the form of a predictive probability vector. They set

x∗∝exp E[log X]

. Note that predictions in the logit space

require normalization to the log space, which will not be

the case in our formulation. Gupta et al. (2022) build upon

the Bregman divergence decomposition and use the notion

of primal and dual space of the variance. Even though the

deﬁnitions are similar, the authors did not state the relation

between Bregman Information and dual variance, for which

they introduce a general law of total variance.

Due to the restriction of Bregman divergences to vector

inputs, it is not clear if a decomposition for proper scores of

non-parametric distributions is possible. In other words, we

require an extension of the current literature to functional

Bregman divergences for a positive result. In the following

section, we provide the required generalization and unify

the variance term in previous literature via the Bregman

Information.

3 A GENERAL BIAS-VARIANCE

DECOMPOSITION

In this section, we offer a general bias-variance decompo-

sition for strictly proper scores. The only assumptions are

that the distribution set

is convex, the associated negative

entropy is lower semicontinuous, and each respective ex-

pectation exists. Further, we will discover that the variance

term is the Bregman Information generated by the convex

conjugate of the associated negative entropy. This discovery

generalizes and uniﬁes decompositions in current literature

for which exists a concrete form (Hansen & Heskes, 2000),

but also provides a closed formulation contrary to other gen-

eral bias-variance decompositions (James & Hastie, 1997).

All technical details and proofs are presented in Appendix

Sebastian G. Gruber, Florian Buettner

3.1 Functional Bregman Divergences of Convex

Conjugates

The essential part for deriving our main result is the ex-

change of arguments in a functional Bregman divergence.

Note that a subgradient of a strictly convex function is injec-

tive. Thus, its inverse exists on an appropriate domain, and

this inverse is again a subgradient of the convex conjugate.

With that in mind, we can state the following.

Lemma 3.1.

Assume a strictly convex, lower semicontinu-

ous function

G∶P→R

has a subgradient

. Then,

dG∗,S−1

is a restricted functional Bregman divergence with the prop-

erties

•dG,S (p, q)=dG∗,S−1(S(q), S (p)), and

•dG∗,S−1p′, q′=dG,S S−1q′, S−1p′.

For an appropriate random variable Q′we have

EdG∗,S−1p′, Q′=BG∗Q′+dG∗,S−1p′,EQ′.

The last property is a generalization of the decomposition

of Bregman divergences combined with the Deﬁnition of

Bregman Information. Lemma 3.1 conﬁrms that a critical

property of Bregman divergences is also well-deﬁned for

functional Bregman divergences. Namely, we can exchange

the arguments by changing the generating convex function

to the convex conjugate in a dual space. This insight leads

us now to the main theoretical contribution of this work.

3.2 A Decomposition for Proper Scores

In the following, we present a bias-variance decomposition

for strictly proper scores. Note that we require no assump-

tions about the score or its entropy being differentiable.

Theorem 3.2.

For a strictly proper score

with associated

lower semicontinuous negative entropy

, an estimated

prediction ˆ

f, and the target Y∼Q, we have

E−S(ˆ

f)(Y)



Error

−G(Q)



Noise

+BG∗S(ˆ

f)



”Variance”

+dG∗,S−1S(Q),ES(ˆ

f)



Bias

As we can see, the variance term is always represented by

the Bregman Information generated by the convex conju-

gate of the associated negative entropy. The theorem di-

rectly follows from the relationship between proper scores

and functional Bregman divergences provided by Ovcharov

(2018) in combination with Lemma 3.1 (c.f. Appendix B).

We provide an example in the form of the prominent log

score. For conciseness, we denote all distributions with their

respective densities.

Example 3.3.The log score

S=ln

is strictly proper on any

set of densities. Its negative entropy is the negative Shannon

entropy

H(p)=∫p(x)ln p(x)dx

. The convex conjugate

is the log partition function

H∗p∗=ln ∫ep∗(x)dx

. Since

EH∗ln ˆ

f=0

, we receive the Bregman Information in

the form BH∗Sˆ

f=−H∗Eln ˆ

f.

Dealing with non-parametric distributions can be challeng-

ing both in theory and practice. Consequently, we provide

the following restriction to exponential families and then

express neatly the decomposition through the natural param-

eters. Speciﬁcally,

BH∗

will be reduced to a vector-based

Bregman Information.

3.3 Exponential Families as a Special Case

When we deal with densities, it is often more practical to

consider parametric densities since they can be represented

by a parameter vector instead of a function. Particularly

relevant classes are exponential families for which one uses

the log-likelihood to assess the goodness of ﬁt. In the last

section, we derived the decomposition for the log-likelihood

expressed in the function space in Example 3.3. Thus, we

provide the following special case as a novel formulation of

(Hansen & Heskes, 2000).

Proposition 3.4.

For an exponential family density/mass

function

pˆ

with estimated natural parameter vector

, log-

partition

, and reference function

, we have the log likeli-

hood decomposition

E−ln pˆ

θ(Y)



NLL

=n(θ)



Noise

+BAˆ

θ



”Variance”

+dAθ, Eˆ

θ



Bias

where n(θ)=−A∗(∇A(θ))−E[ln h(Y)]and Y∼pθ.

The proof in Appendix B uses

θ=∇A∗(E[T(Y)])

in the

last step, where

is the sufﬁcient statistic. This is a fun-

damental property of exponential families but only holds if

the distribution assumption holds with the data. Since this

is usually not the case in practical scenarios, it is important

to note that the decomposition still holds if we replace every

θwith ∇A∗(E[T(Y)]).

Example 3.5.We can recover the textbook decomposition

of the MSE by using

Y∼Nµ, σ2

with unknown mean

and known variance

σ2

. The necessary information is

provided in Table 1. For the LHS in Proposition 3.4 we

have

E−ln pˆ

θ(Y) =

E(Y−ˆµ)2

2σ2+ln 2πσ

. And for the

RHS, we get

n(θ)=1

2+ln 2πσ

BAˆ

θ=1

2σ2V[ˆµ]

, and

dAθ, Eˆ

θ=(µ−E[ˆµ])2

2σ2

. Finally, we subtract

ln 2πσ

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UncertaintyEstimatesofPredictionsviaaGeneralBias-VarianceDecompositionSebastianG.GruberFlorianBuettnerGermanCancerResearchCenter(DKFZ)GermanCancerConsortium(DKTK)GoetheUniversityFrankfurt,Germanysebastian.gruber@dkfz.deGermanCancerResearchCenter(DKFZ)GermanCancerConsortium(DKTK)FrankfurtCancerInstit...

展开>> 收起<<

Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition Sebastian G. Gruber Florian Buettner.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition Sebastian G. Gruber Florian Buettner

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: