
Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition
Table 1: Examples of exponential families. The mapping defines the relation between natural parameters and typical
parameters. We denote the dummy-encoding for a
x∈{1, . . . , k}
with
dx
, class probabilities with
pj
, mean with
µ
, and
standard deviation with σ. The Bernoulli distribution is a special case of the categorical distribution for k=2.
Distribution TΘT(x)A(θ)h(x)Mapping
Categorical (k-classes) {1, . . . , k}Rk−1dxln 1+∑k−1
i=1exp θi1pj=exp θj
1+∑k−1
i=1exp θi
Normal (known σ)R R xθ2
2
exp −x2
2σ2
2πσ µ=θσ
vex function. The other is
E[T(X)]=∇A(θ)
for
X∼pθ
.
Banerjee et al. (2005) proved under mild conditions that
an exponential family relates to a Bregman divergence and
vice versa via the negative log-likelihood. Gr
¨
unwald &
Dawid (2004) proved a similar link between proper scores
and exponential families.
As we can see, exponential families, proper scores, and
Bregman divergences have strong relationships to one an-
other. By generalizing some properties to the functional
case, this relationship will allow us to state every variance
term as (functional) Bregman Information. In the case of
the log-likelihood of an exponential family, the functional
Bregman Information reduces to a vector-based Bregman
Information.
2.3 Other Bias-Variance Decompositions
In general, all general decompositions in current literature
are either for categorical, real-valued, or parametric predic-
tions, and it is not clear if a decomposition for proper scores
of non-parametric distributions is possible.
James & Hastie (1997) formulate a decomposition for any
loss function of categorical or real-valued predictions but do
not provide a closed-form solution for a given case. Domin-
gos (2000) introduce how a general bias-variance decompo-
sition should look, though they stated it is unclear when or if
this decomposition holds for a loss function. James (2003)
provide a bias-variance decomposition for symmetric loss
functions. Heskes (1998) use the bias-variance decompo-
sitions for the Kullback-Leibler divergence, which allows
to derive a decomposition for exponential families. Hansen
& Heskes (2000) proves that a bias-variance decomposition
of a parametric prediction is only possible if the prediction
belongs to an exponential family. Importantly, they only
introduce the specific decomposition for a given exponential
family. The decomposition is not formulated for the natural
parameters and relies on the canonical link function. Con-
sequently, a relation to Bregman divergences and Bregman
Information is missing, which we will provide.
A Pythagorean-like theorem for vector-based Bregman di-
vergences is a known fact in literature (Jones & Byrne,
1990; Csiszar, 1991; Della Pietra et al., 2002; Dawid,
2007; Telgarsky & Dasgupta, 2012). An equality in
this theorem implies a decomposition in the form of
Edφ(X, y) =EdφX, x∗ +dφx∗, y
with
x∗=
arg minzEdφ(X, z)
(Pfau, 2013). Brofos et al. (2019),
Brinda et al. (2019), and Yang et al. (2020) relate the
classification log-likelihood to the Kullback-Leibler diver-
gence and provide a bias-variance decomposition, where
X
takes the form of a predictive probability vector. They set
x∗∝exp E[log X]
. Note that predictions in the logit space
require normalization to the log space, which will not be
the case in our formulation. Gupta et al. (2022) build upon
the Bregman divergence decomposition and use the notion
of primal and dual space of the variance. Even though the
definitions are similar, the authors did not state the relation
between Bregman Information and dual variance, for which
they introduce a general law of total variance.
Due to the restriction of Bregman divergences to vector
inputs, it is not clear if a decomposition for proper scores of
non-parametric distributions is possible. In other words, we
require an extension of the current literature to functional
Bregman divergences for a positive result. In the following
section, we provide the required generalization and unify
the variance term in previous literature via the Bregman
Information.
3 A GENERAL BIAS-VARIANCE
DECOMPOSITION
In this section, we offer a general bias-variance decompo-
sition for strictly proper scores. The only assumptions are
that the distribution set
P
is convex, the associated negative
entropy is lower semicontinuous, and each respective ex-
pectation exists. Further, we will discover that the variance
term is the Bregman Information generated by the convex
conjugate of the associated negative entropy. This discovery
generalizes and unifies decompositions in current literature
for which exists a concrete form (Hansen & Heskes, 2000),
but also provides a closed formulation contrary to other gen-
eral bias-variance decompositions (James & Hastie, 1997).
All technical details and proofs are presented in Appendix
B.