Uncertainty Estimation for Multi-view Data The Power of Seeing the Whole Picture Myong Chol Jung

2025-05-06 0 0 4.58MB 27 页 10玖币
侵权投诉
Uncertainty Estimation for Multi-view Data:
The Power of Seeing the Whole Picture
Myong Chol Jung
Monash University
david.jung@monash.edu
He Zhao
CSIRO’s Data61
he.zhao@ieee.org
Joanna Dipnall
Monash University
jo.dipnall@monash.edu
Belinda Gabbe
Monash University
belinda.gabbe@monash.edu
Lan Du
Monash University
lan.du@monash.edu
Abstract
Uncertainty estimation is essential to make neural networks trustworthy in real-
world applications. Extensive research efforts have been made to quantify and
reduce predictive uncertainty. However, most existing works are designed for
unimodal data, whereas multi-view uncertainty estimation has not been sufficiently
investigated. Therefore, we propose a new multi-view classification framework
for better uncertainty estimation and out-of-domain sample detection, where we
associate each view with an uncertainty-aware classifier and combine the predic-
tions of all the views in a principled way. The experimental results with real-world
datasets demonstrate that our proposed approach is an accurate, reliable, and well-
calibrated classifier, which predominantly outperforms the multi-view baselines
tested in terms of expected calibration error, robustness to noise, and accuracy
for the in-domain sample classification and the out-of-domain sample detection
tasks2.
1 Introduction
Reliable uncertainty estimation is critical for deploying deep learning models in a number of domains
such as medical imaging diagnosis [
39
] or autonomous driving [
9
]. Even with accurate predictions,
domain experts still raise questions of how trustworthy the models are [
43
]. For example, when a
model’s prediction contradicts a domain expert’s opinion, the quantification of the uncertainty of the
model’s predictions can help determine model’s reliability and justify model use.
Recently, uncertainty estimation of neural networks has been an active research area, where many
methods of quantifying uncertainty in predictions have been proposed [
48
,
10
,
44
,
54
,
35
,
27
]. The
majority of existing work focuses on uncertainty estimation for unimodal data. However, in many
practical problems, data can exhibit in multi-views or multi-modalities. For example, LiDAR, radar,
and RGB cameras can simultaneously capture complementary information about a scene [
53
], and
computed tomography (CT) scans and x-ray images can be analyzed together to diagnose a disease
[
3
]. Trustworthy uncertainty estimation with multi-view or multi-modalities data is important because
the challenges it faces may differ from a unimodal setting (e.g., maintaining accurate predictions
with one of the input views’ domain shifted). Despite the success of existing work on unimodality,
modelling and estimating uncertainty for multi-view data remain a less explored question [13].
Corresponding author
2
We provide our code at
https://github.com/davidmcjung/multiview_uncertainty_estimation
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.02676v1 [cs.LG] 6 Oct 2022
(a) Input view 1 (b) Input view 2 (c) SNGP (d) MGP (Ours)
(e) Noisy view (NV)
(f) SNGP with NV (g) MGP with NV
Figure 1: Visualization on a synthetic multi-view moon dataset. Top row: the dataset has two views
and two classes (e.g., blue upper circles in (a) and (b) are two views of Class 1), and an OOD class
(grey); (c) and (d) are the predictive probability surfaces of SNGP and our MGP. Bottom row: A new
noisy view (e) is added to the data; (f) and (g) are the predictive probability surfaces of SNGP and
our MGP with the noisy view. The darker the region is (i.e., dark blue), the lower the probability of
being class 1. Since SNGP is a unimodal model, input views are fused into a unimodal dataset. The
difference between (c) and (f) shows SNGP cannot correctly capture the input shape in the presence
of noise. MGP, however, is robust to noise, shown by minimal difference between (d) and (g).
A way of solving this problem is to fuse multi-modalities into one modality and directly apply existing
unimodal methods. However, even the state-of-the-art unimodal model (e.g., SNGP [
32
]) can be
prone to noise if one of the views in a multi-view dataset is noisy, as shown in Figure 1. Without
the noisy view, unimodal models can produce accurate and confident predictions nearby the training
domain. However, with the noisy view, the predictions become uncertain for samples even close to
the training domain (see Figure 3). We also show that the existing multi-view classifiers (e.g., TMC
[
13
]) have limited capacity to detect out-of-domain (OOD) samples in our experiments (see Table 4).
To this end, we propose the Multi-view Gaussian Process (MGP) that is a tailored framework
providing intrinsic uncertainty estimation for classification of multi-view/modal data. Specifically,
MGP consists of a dedicated Gaussian process (GP) expert for each view whose predictions are
aggregated by the product of expert (PoE). In our proposed method, there is a natural way of capturing
uncertainty by measuring the distance between training set and test samples in the reproducing kernel
Hilbert space (RKHS). The contributions of our method can be summarized as follows:
1.
We propose a new uncertainty estimation framework with GPs for multi-view data, which is an
under-explored yet increasingly important problem in safety-critical applications.
2.
The framework provides better uncertainty estimation through a product of expert model, providing
more robustness in dealing with noise and better capacity of detecting OOD data.
3.
We develop an effective variational inference algorithm to approximate multi-view posterior
distributions in a principled way.
4.
We conduct comprehensive and extensive experiments on both synthetic and real-world data,
which show that our method achieves the state-of-the-art performance for uncertainty estimation
of multi-view/modal data.
2 Muti-view Gaussian Process
Given training data
X={X1,X2, ..., XV}
where
V
is the number of views, each view consists of
a training set of
N
samples
Xv={xv,i}N
i=1
and labels
y={yi}N
i=1
. In other words, the
ith
data
sample consists of
V
views
{xv,i}V
v=1
(e.g., the CUB dataset consists of images as the first view and
captions as the second view) and
yi
is the data sample’s ground-truth label shared across the views.
Without loss of generality, a multiview classification or regression problem can be formulated as
2
predicting
y
given testing samples
{X,v}V
v=1
. In this paper, we propose the Multi-view Gaussian
Process (MGP), a novel framework for multi-view/modal data, where in a nutshell we first apply a
GP to each view of the data and then combine in a principled way the predictions from all the GPs as
a unified prediction by using the product of expert (PoE) [31].
2.1 GP for an Individual View
For each view, we consider a multiclass classification problem with
C
classes. We set
C
independent
Gaussian priors over latent function
fv(·)
with zero-mean and
N×N
covariance matrix
KNN
whose element is
Kij =k(xv,i,xv,j )
, where
k(·,·)
is a kernel function. The radial basis function
(RBF) which is commonly used in the GP literature [
55
,
37
] is selected in this paper as the covariance
function. It is defined as
k(x,x0) = σ2
vexp (xx0)2
2l2
v
, where
σ2
v
is the signal variance and
lv
is
the length-scale for each GP which are parameters to be optimized.
To bypass the limitations of standard GPs [
31
], namely high computational cost
O(N3)
and inconve-
nience of applying stochastic gradient descent (SGD), we propose to leverage the sparse variational
GP (SVGP) [
31
,
16
,
17
,
47
], which is detailed as follows. With SVGP, we introduce
M
(
M < N
)
inducing points
Zv
representing the training samples of view
v
with a smaller number of points, and
the inducing variable
uv
is the latent function evaluated at the inducing points (i.e.,
uv=fv(Zv)
)
where both
Zv
and
uv
are random variables to be optimized. Similar to the Gaussian prior set for the
latent function, a joint prior can be set as:
fv
uv∼ N 0,KNN KNM
KT
NM KMM  (1)
where we use
fv
to indicate
fv(Xv)
for notation convenience. The use of inducing points can reduce
the computational cost to
O(M3)
[
31
]. We outline the likelihood of GPs in Section 2.2 and the
posterior in Section 2.3.
2.2 GPs for Multi-view Data with PoE
Product of Experts (PoE)
With one GP expert for each view, we propose to combine the GP
experts into a unified prediction by using the PoE mechanism [
19
,
31
,
7
,
6
]. Specifically, we
aggregate posterior distributions of individual views by:
p(f|X,y)Y
v
p(fv|y)(2)
For Gaussian posteriors with mean
µv
and covariance
Σv
, the aggregation using Equation
(2)
forms
the unified posterior’s mean and covariance expressed as:
µ= X
v
µvΣ1
v!Σ,Σ= X
v
Σ1
v!1
(3)
Dirichlet-based Likelihood
In order to apply Equation
(3)
to a multi-view problem, the latent
function
fv
in each view should refer to the same observable variable (i.e.,
N(a|b, c)
cannot be
combined with
N(c|d, e)
for
a6=c
). However, in GP classification, the latent function is a non-
observable nuisance function that is squashed through sigmoid or softmax function to estimate labels
[
55
], which is not necessarily the same for every independent view. We alleviate this problem by
reparameterizing the class labels to regression labels by:
e
yi=fv(xv,i) + , N (0,˜σ2
i)
where
e
yi
is the transformed label and
˜σ2
i
is the noise parameter fixed for all views. Since
e
yi
and
˜σ2
i
are
shared across the views, we ensure that
fv(xv,i)
refers to the same variable. By using the log-normal
distribution, the Gaussian likelihood can be used in the log space as p(e
yi|fv) = N(fv,˜σ2
i).
To transform the class labels to regression labels, we propose to adopt representing the class probability
πi= [πi,1, πi,2, ..., πi,C ]over a Dirichlet distribution with the categorical likelihood [37]:
p(yi|αi) = Cat(πi),where πiDir(αi)
πi,c =gi,c
PC
j=1 gj,c
,where gi,c Gamma(αi,c,1) (4)
3
where
αi= [αi,1, αi,2, ..., αi,C ]
is the concentration parameters, the shape parameter for Gamma
distribution is
αi,c
, and the scale parameter for Gamma distribution is
θ= 1
. We approximate the
Gamma distribution in (4) with ˜gi,c Lognormal(˜yi,c,˜σ2
i,c)by moment matching:
αi,c = exp (˜yi,c + ˜σ2
i,c/2), αi,c =exp (˜σ2
i,c)1exp (2˜yi,c + ˜σ2
i,c)
Thus, the transformed labels and the noisy parameter are expressed in terms of the concentration
parameters:
˜σ2
i,c = log (1i,c + 1),˜yi,c = log αi,c ˜σ2
i,c/2(5)
where
αi,c = 1 + α
if
yi,c = 1
and
αi,c =α
if
yi,c = 0
with the one-hot label
yi,c
.
α
is a
parameter to prevent the noise parameter from converging to infinity. See Appendix for the impacts
of
α
on the model performance. Compared with other transforming methods such as Platt scaling
[
42
], the used Dirichlet likelihood compromises classification accuracy less and requires no post-hoc
calibrations after training.
2.3 Training of the Proposed Framework
Given the priors from Section 2.1 and the Gaussian likelihood from Section 2.2, the goal of training
our framework is to estimate a posterior distribution via variational inference (VI) [
16
,
17
,
4
]. By
using Equation (2), we propose an aggregated variational distribution for all the views as:
qP oE (f)Y
v
q(fv)(6)
where
q(fv)
is the variational distribution for each view that approximates the true posterior. We
define q(fv)as:
q(fv):=Zp(fv|uv)q(uv)duv.(7)
where
p(fv|uv)
is the conditional prior from Equation
(1)
, and
q(uv)
is the marginal variational
distribution of
N(mv,Sv)
with optimizable model parameters
mv
and
Sv
. The analytical solution
of
(7)
is provided in Appendix. VI seeks to minimize the following Kullback–Leibler divergence
(KL) between the true posterior and variational distributions:
KL [qP oE (f)||p(f|X,e
yc)] (8)
where e
yc={˜yi,c}N
i=1.
Lemma 1
(Additive Property of KL Divergence)
.
If
x= [x1,· · · , xn]∈ X
,
p(x) = Qn
ip(xi)
and
q(x) = Qn
iq(xi), we have:
KL [p(x)||q(x)] =
n
X
i
KL [p(xi)||q(xi)] (9)
Theorem 2 (KL Divergence with PoE).With Equations (2) and (6), we have:
KL [qP oE (f)||p(f|X,e
yc)] = X
v
KL [q(fv)||p(fv|e
yc)] (10)
According to Theorem 2, the VI for the PoE splits to the VI of each expert/view. For the
vth
view,
the VI minimizes
KL [q(fv)||p(fv|e
yc)]
, which can be turned into the maximization of the evidence
lower bound (ELBO):
ELBOv=
N
X
i=1
Eq(fv,i )[log p(˜yi,c|fv,i)] β·KL [q(uv)||p(uv)] (11)
where
β
is a parameter to control the KL term, similar to [
18
], which can be interpreted as a
regularization term. Proofs of Equation (9)-(11) are provided in Appendix.
In order to apply SGD, we match the expectation of stochastic gradient of the expected log likelihood
term to the full gradient by multiplying the number of batches to the log likelihood term in Equation
(11) [17]. The overall loss for all experts is:
L=
V
X
v=1
ELBOv(12)
The training steps are summarized in Algorithm 1.
4
Algorithm 1: Learning MGP
Input: Vviews of training data
X={Xv}V
v=1 where each view
has Nsamples of Xv={xv,i}N
i=1
and y={yi}N
i=1.
Transform: Reparameterize e
ycby (5)
1for minibatch do
2for v= 1 to Vdo
3Compute q(fv)by (7)
4Calculate ELBOvby (11)
5end for
6Sum ELBOs by (12)
7SGD update {lv, σ2
v,Zv,mv,Sv}V
v=1
8end for
Algorithm 2: Inference of MGP
Input: Vviews of testing data
X={X,v}V
v=1
1for v= 1 to Vdo
2Compute q(f,v)by (13)
3Calculate γ(X,v)by (17)
4end for
5Aggregate qP oE (f)by (16)
Output: E[πi,c]and V[πi,c]of class
probability by (14)
2.4 Inference on Test Samples
Given test samples
X={X,v}V
v=1
, the predictive distribution
p(f,v|e
yc)
is estimated by the
varitional distribution as:
p(f,v|e
yc)q(f,v) = Zp(f,v|uv)q(uv)duv(13)
where
p(f,v|uv)
can be formed by the joint prior distribution similar to Equation
(1)
(see Appendix
for a full derivation). Similar to Equation
(6)
, we aggregate the predictive distributions to form
qP oE (f)
that is sampled to approximate Gamma-distributed samples which in the end form the
posterior of Dirichlet distribution as follows:
E[πi,c] = Zexp (fi,c,)
Pjexp (fi,j,)qP oE (fi,c,)df
V[πi,c] = Z exp (fi,c,)
Pjexp (fi,j,)E[πi,c]!2
qP oE (fi,c,)df(14)
Equation
(14)
can be approximated with the Monte Carlo method. See Appendix for the impacts of
the number of Monte Carlo samples on classification performance and inference time. The aggregated
predictive distribution can also be weighted by each expert’s predictive distribution by:
qP oE (f)Y
v
(q(f,v))γ(X,v )(15)
where
γ(X,v)
is the weight controlling the influence of each expert to the aggregated prediction.
The mean and covariance of qP oE (f)with γ(X,v )are:
µW= X
v
µvγ(X,v)Σ1
v!ΣW,ΣW= X
v
γ(X,v)Σ1
v!1
(16)
In our experiments, we use negative entropy of predictive distribution:
γ(X,v) = H(q(f,v)) (17)
Note that the original PoE [
19
] in Equation
(6)
is recovered if
γ(X,v)=1
. The intuition behind
choosing negative entropy is that the experts with lower posterior entropy, which means the lower
uncertainty, gain more contribution to the aggregated predictions. Please note that other choices of
γ(X,v)
can also be applied such as the difference in entropy from prior to posterior [
6
] and negative
predictive variance [
7
]. We obtain the better empirical results with negative entropy, but the choice of
function is flexible. The inference steps are summarized in Algorithm 2.
5
摘要:

UncertaintyEstimationforMulti-viewData:ThePowerofSeeingtheWholePictureMyongCholJungMonashUniversitydavid.jung@monash.eduHeZhaoCSIRO'sData61he.zhao@ieee.orgJoannaDipnallMonashUniversityjo.dipnall@monash.eduBelindaGabbeMonashUniversitybelinda.gabbe@monash.eduLanDuMonashUniversitylan.du@monash.eduAbst...

展开>> 收起<<
Uncertainty Estimation for Multi-view Data The Power of Seeing the Whole Picture Myong Chol Jung.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:27 页 大小:4.58MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注