
Boundary-Aware Uncertainty for Feature Attribution Explainers
prior. We disentangle each explanation into two com-
ponents, H(Xn) and ηn, which represent two separate
sources of uncertainty: 1) a decision boundary-aware
uncertainty which we capture using the kernel similar-
ity, and 2) a function approximation uncertainty from
the explainer. After specifying H(Xn) and ηn, we can
combine the two sources by calculating the predictive
distribution for x∗. We take the variance of this dis-
tribution as the GPEC uncertainty estimate:
Vd[x∗] = k(x∗, x∗)−k(X, x∗)⊺[K+σ2
dIN]−1k(X, x∗)
(2)
where K∈RN×Nis the kernel matrix s.t. Kij =
k(Xi, Xj)∀i, j ∈ {1...N},k(X, x∗)∈RN×1has ele-
ments k(X, x∗)i=k(Xi, x∗)i∈ {1...N},σ2
d∈RN
+is
the variance parameter for explanation noise, and INis
the identity matrix. From Eq. (2) we see that predic-
tive variance captures DB-aware uncertainty through
the kernel function k(·,·), and also the function ap-
proximation uncertainty through the σ2
dINterm.
Function Approximation Uncertainty. The ηn
component of Eq. (1) represents the uncertainty
stemming from explainer estimation. For example,
ηncan represent the variance due to sampling (e.g.
perturbation-based explainers) or explainer training
(e.g. surrogate-based explainers). Explainers that in-
clude some estimate of uncertainty (e.g. BayesLIME,
BayesSHAP, CXPlain) can be directly used to esti-
mate σ2
n. For other stochastic explanation methods,
we can estimate σ2
nempirically by resampling Jex-
planations for the same sample Xn:
ˆσ2
n=1
|J|
J
X
i=1 Hi(Xn)−1
|J|
J
X
j=1
Hj(Xn)2(3)
where each Hi(Xn) is a sampled explanation. Alter-
natively, for deterministic explanation methods we can
omit the ηnterm and assume noiseless explanations.
Decision Boundary-Aware Uncertainty. In con-
trast, the H(Xn) component of Eq. (1) represents the
distribution of functions that could have generated the
observed explanations. The choice of kernel k(·,·) en-
codes our a priori assumption regarding the similarity
between explanations based on the similarity of their
corresponding inputs. In other words, given two sam-
ples x, x′∈ X , how much information do we expect
a given explanation H(x) to provide for a nearby ex-
planation H(x′)? As the DB between H(x) and H(x′)
becomes more complex, we would expect for this infor-
mation to decrease. In Section 4, we consider a novel
kernel formulation that reflects the complexity of the
DB in a local neighborhood of the samples.
4 WEG KERNEL
Intuitively, the GP kernel encodes the assumption that
each explanation provides some information about
other nearby explanations, which is defined through
kernel similarity. To capture boundary-aware uncer-
tainty, we want to define a similarity k(x, x′) that is
inversely related to the complexity or smoothness of
the DB between x, x′∈ X .
4.1 Geometry of the Decision Boundary
We represent the DB as a hypersurface embedded in
RDwith co-dimension one. Given the classifier F, we
define the DB2as MF={m∈RD:F(m) = 1
2}. For
any two points m, m′∈ MF, let γ: [0,1] → MFbe
a differentiable map such that γ(0) = mand γ(1) =
m′, representing a 1-dimensional curve on MF. We
can then define distances along the DB as geodesic
distances in MF(Fig 3A):
dgeo(m, m′) = min
γZ1
0
|| ˙γ(t)||dt∀m, m′∈ MF(4)
The relative complexity of the DB can be character-
ized by the geodesic distances between points on the
DB. For example, the simplest form that the DB can
take is a linear boundary. Consider a black-box model
with linear DB M1. For two points z, z′∈ M1,
dgeo(z, z′) = ||z−z′||2which corresponds with the
minimum geodesic distance in the ambient space. For
any nonlinear DB M2that also contains z, z′, it fol-
lows that dgeo(z, z′)>||z−z′||2. As the complexity
of the DB increases, there is a general corresponding
increase in geodesic distances between fixed points on
the DB. We can adapt geodesic distance in our kernel
selection through the exponential geodesic (EG) kernel
(Feragen et al.,2015).
kEG(m, m′) = exp [−λdgeo(m, m′)] (5)
The EG kernel has been previously investigated in the
context of Riemannian manifolds (Feragen et al.,2015;
Feragen and Hauberg,2016). In particular, while prior
work shows that the EG kernel fails to be positive
definite for all values of λin non-Euclidean space, there
exists large intervals of λ > 0 for which the EG kernel
is positive definite. Appropriate values can be selected
through grid search and cross validation; we assume
that a valid value of λhas been selected.
Therefore, by sampling MF, we can use the EG kernel
matrix to capture DB complexity. However, a chal-
lenge remains in relating points x, x′∈ X \ MFto the
nearby DB. In Section 4.2 we consider a continuous
weighting over MFbased on distance to x, x′.
2Without loss of generality, we assume that the classifier
decision rule is 1
2