Uncertainty Estimation for Multi-view Data The Power of Seeing the Whole Picture Myong Chol Jung

2025-05-06 0 0 4.58MB 27 页 10玖币

侵权投诉

Uncertainty Estimation for Multi-view Data:

The Power of Seeing the Whole Picture

Myong Chol Jung

Monash University

david.jung@monash.edu

He Zhao

CSIRO’s Data61

he.zhao@ieee.org

Joanna Dipnall

Monash University

jo.dipnall@monash.edu

Belinda Gabbe

Monash University

belinda.gabbe@monash.edu

Lan Du∗

Monash University

lan.du@monash.edu

Abstract

Uncertainty estimation is essential to make neural networks trustworthy in real-

world applications. Extensive research efforts have been made to quantify and

reduce predictive uncertainty. However, most existing works are designed for

unimodal data, whereas multi-view uncertainty estimation has not been sufﬁciently

investigated. Therefore, we propose a new multi-view classiﬁcation framework

for better uncertainty estimation and out-of-domain sample detection, where we

associate each view with an uncertainty-aware classiﬁer and combine the predic-

tions of all the views in a principled way. The experimental results with real-world

datasets demonstrate that our proposed approach is an accurate, reliable, and well-

calibrated classiﬁer, which predominantly outperforms the multi-view baselines

tested in terms of expected calibration error, robustness to noise, and accuracy

for the in-domain sample classiﬁcation and the out-of-domain sample detection

tasks2.

1 Introduction

Reliable uncertainty estimation is critical for deploying deep learning models in a number of domains

such as medical imaging diagnosis [

] or autonomous driving [

]. Even with accurate predictions,

domain experts still raise questions of how trustworthy the models are [

]. For example, when a

model’s prediction contradicts a domain expert’s opinion, the quantiﬁcation of the uncertainty of the

model’s predictions can help determine model’s reliability and justify model use.

Recently, uncertainty estimation of neural networks has been an active research area, where many

methods of quantifying uncertainty in predictions have been proposed [

]. The

majority of existing work focuses on uncertainty estimation for unimodal data. However, in many

practical problems, data can exhibit in multi-views or multi-modalities. For example, LiDAR, radar,

and RGB cameras can simultaneously capture complementary information about a scene [

], and

computed tomography (CT) scans and x-ray images can be analyzed together to diagnose a disease

[

]. Trustworthy uncertainty estimation with multi-view or multi-modalities data is important because

the challenges it faces may differ from a unimodal setting (e.g., maintaining accurate predictions

with one of the input views’ domain shifted). Despite the success of existing work on unimodality,

modelling and estimating uncertainty for multi-view data remain a less explored question [13].

∗Corresponding author

We provide our code at

https://github.com/davidmcjung/multiview_uncertainty_estimation

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.02676v1 [cs.LG] 6 Oct 2022

(a) Input view 1 (b) Input view 2 (c) SNGP (d) MGP (Ours)

(e) Noisy view (NV)

(f) SNGP with NV (g) MGP with NV

Figure 1: Visualization on a synthetic multi-view moon dataset. Top row: the dataset has two views

and two classes (e.g., blue upper circles in (a) and (b) are two views of Class 1), and an OOD class

(grey); (c) and (d) are the predictive probability surfaces of SNGP and our MGP. Bottom row: A new

noisy view (e) is added to the data; (f) and (g) are the predictive probability surfaces of SNGP and

our MGP with the noisy view. The darker the region is (i.e., dark blue), the lower the probability of

being class 1. Since SNGP is a unimodal model, input views are fused into a unimodal dataset. The

difference between (c) and (f) shows SNGP cannot correctly capture the input shape in the presence

of noise. MGP, however, is robust to noise, shown by minimal difference between (d) and (g).

A way of solving this problem is to fuse multi-modalities into one modality and directly apply existing

unimodal methods. However, even the state-of-the-art unimodal model (e.g., SNGP [

]) can be

prone to noise if one of the views in a multi-view dataset is noisy, as shown in Figure 1. Without

the noisy view, unimodal models can produce accurate and conﬁdent predictions nearby the training

domain. However, with the noisy view, the predictions become uncertain for samples even close to

the training domain (see Figure 3). We also show that the existing multi-view classiﬁers (e.g., TMC

[

]) have limited capacity to detect out-of-domain (OOD) samples in our experiments (see Table 4).

To this end, we propose the Multi-view Gaussian Process (MGP) that is a tailored framework

providing intrinsic uncertainty estimation for classiﬁcation of multi-view/modal data. Speciﬁcally,

MGP consists of a dedicated Gaussian process (GP) expert for each view whose predictions are

aggregated by the product of expert (PoE). In our proposed method, there is a natural way of capturing

uncertainty by measuring the distance between training set and test samples in the reproducing kernel

Hilbert space (RKHS). The contributions of our method can be summarized as follows:

We propose a new uncertainty estimation framework with GPs for multi-view data, which is an

under-explored yet increasingly important problem in safety-critical applications.

The framework provides better uncertainty estimation through a product of expert model, providing

more robustness in dealing with noise and better capacity of detecting OOD data.

We develop an effective variational inference algorithm to approximate multi-view posterior

distributions in a principled way.

We conduct comprehensive and extensive experiments on both synthetic and real-world data,

which show that our method achieves the state-of-the-art performance for uncertainty estimation

of multi-view/modal data.

2 Muti-view Gaussian Process

Given training data

X={X1,X2, ..., XV}

where

is the number of views, each view consists of

a training set of

samples

Xv={xv,i}N

i=1

and labels

y={yi}N

i=1

. In other words, the

ith

data

sample consists of

views

{xv,i}V

v=1

(e.g., the CUB dataset consists of images as the ﬁrst view and

captions as the second view) and

is the data sample’s ground-truth label shared across the views.

Without loss of generality, a multiview classiﬁcation or regression problem can be formulated as

predicting

y∗

given testing samples

{X∗,v}V

v=1

. In this paper, we propose the Multi-view Gaussian

Process (MGP), a novel framework for multi-view/modal data, where in a nutshell we ﬁrst apply a

GP to each view of the data and then combine in a principled way the predictions from all the GPs as

a uniﬁed prediction by using the product of expert (PoE) [31].

2.1 GP for an Individual View

For each view, we consider a multiclass classiﬁcation problem with

classes. We set

independent

Gaussian priors over latent function

fv(·)

with zero-mean and

N×N

covariance matrix

KNN

whose element is

Kij =k(xv,i,xv,j )

, where

k(·,·)

is a kernel function. The radial basis function

(RBF) which is commonly used in the GP literature [

] is selected in this paper as the covariance

function. It is deﬁned as

k(x,x0) = σ2

vexp −(x−x0)2

2l2

v

, where

σ2

is the signal variance and

the length-scale for each GP which are parameters to be optimized.

To bypass the limitations of standard GPs [

], namely high computational cost

O(N3)

and inconve-

nience of applying stochastic gradient descent (SGD), we propose to leverage the sparse variational

GP (SVGP) [

], which is detailed as follows. With SVGP, we introduce

(

M < N

)

inducing points

representing the training samples of view

with a smaller number of points, and

the inducing variable

is the latent function evaluated at the inducing points (i.e.,

uv=fv(Zv)

)

where both

and

are random variables to be optimized. Similar to the Gaussian prior set for the

latent function, a joint prior can be set as:

fv

uv∼ N 0,KNN KNM

NM KMM  (1)

where we use

to indicate

fv(Xv)

for notation convenience. The use of inducing points can reduce

the computational cost to

O(M3)

[

]. We outline the likelihood of GPs in Section 2.2 and the

posterior in Section 2.3.

2.2 GPs for Multi-view Data with PoE

Product of Experts (PoE)

With one GP expert for each view, we propose to combine the GP

experts into a uniﬁed prediction by using the PoE mechanism [

]. Speciﬁcally, we

aggregate posterior distributions of individual views by:

p(f|X,y)∝Y

p(fv|y)(2)

For Gaussian posteriors with mean

µv

and covariance

Σv

, the aggregation using Equation

(2)

forms

the uniﬁed posterior’s mean and covariance expressed as:

µ= X

µvΣ−1

v!Σ,Σ= X

Σ−1

v!−1

(3)

Dirichlet-based Likelihood

In order to apply Equation

(3)

to a multi-view problem, the latent

function

in each view should refer to the same observable variable (i.e.,

N(a|b, c)

cannot be

combined with

N(c|d, e)

for

a6=c

). However, in GP classiﬁcation, the latent function is a non-

observable nuisance function that is squashed through sigmoid or softmax function to estimate labels

[

], which is not necessarily the same for every independent view. We alleviate this problem by

reparameterizing the class labels to regression labels by:

yi=fv(xv,i) + ,∼ N (0,˜σ2

where

is the transformed label and

˜σ2

is the noise parameter ﬁxed for all views. Since

and

˜σ2

are

shared across the views, we ensure that

fv(xv,i)

refers to the same variable. By using the log-normal

distribution, the Gaussian likelihood can be used in the log space as p(e

yi|fv) = N(fv,˜σ2

i).

To transform the class labels to regression labels, we propose to adopt representing the class probability

πi= [πi,1, πi,2, ..., πi,C ]over a Dirichlet distribution with the categorical likelihood [37]:

p(yi|αi) = Cat(πi),where πi∼Dir(αi)

πi,c =gi,c

j=1 gj,c

,where gi,c ∼Gamma(αi,c,1) (4)

where

αi= [αi,1, αi,2, ..., αi,C ]

is the concentration parameters, the shape parameter for Gamma

distribution is

αi,c

, and the scale parameter for Gamma distribution is

θ= 1

. We approximate the

Gamma distribution in (4) with ˜gi,c ∼Lognormal(˜yi,c,˜σ2

i,c)by moment matching:

αi,c = exp (˜yi,c + ˜σ2

i,c/2), αi,c =exp (˜σ2

i,c)−1exp (2˜yi,c + ˜σ2

i,c)

Thus, the transformed labels and the noisy parameter are expressed in terms of the concentration

parameters:

˜σ2

i,c = log (1/αi,c + 1),˜yi,c = log αi,c −˜σ2

i,c/2(5)

where

αi,c = 1 + α

yi,c = 1

and

αi,c =α

yi,c = 0

with the one-hot label

yi,c

α

is a

parameter to prevent the noise parameter from converging to inﬁnity. See Appendix for the impacts

α

on the model performance. Compared with other transforming methods such as Platt scaling

[

], the used Dirichlet likelihood compromises classiﬁcation accuracy less and requires no post-hoc

calibrations after training.

2.3 Training of the Proposed Framework

Given the priors from Section 2.1 and the Gaussian likelihood from Section 2.2, the goal of training

our framework is to estimate a posterior distribution via variational inference (VI) [

]. By

using Equation (2), we propose an aggregated variational distribution for all the views as:

qP oE (f)∝Y

q(fv)(6)

where

q(fv)

is the variational distribution for each view that approximates the true posterior. We

deﬁne q(fv)as:

q(fv):=Zp(fv|uv)q(uv)duv.(7)

where

p(fv|uv)

is the conditional prior from Equation

(1)

, and

q(uv)

is the marginal variational

distribution of

N(mv,Sv)

with optimizable model parameters

and

. The analytical solution

(7)

is provided in Appendix. VI seeks to minimize the following Kullback–Leibler divergence

(KL) between the true posterior and variational distributions:

KL [qP oE (f)||p(f|X,e

yc)] (8)

where e

yc={˜yi,c}N

i=1.

Lemma 1

(Additive Property of KL Divergence)

x= [x1,· · · , xn]∈ X

p(x) = Qn

ip(xi)

and

q(x) = Qn

iq(xi), we have:

KL [p(x)||q(x)] =

KL [p(xi)||q(xi)] (9)

Theorem 2 (KL Divergence with PoE).With Equations (2) and (6), we have:

KL [qP oE (f)||p(f|X,e

yc)] = X

KL [q(fv)||p(fv|e

yc)] (10)

According to Theorem 2, the VI for the PoE splits to the VI of each expert/view. For the

vth

view,

the VI minimizes

KL [q(fv)||p(fv|e

yc)]

, which can be turned into the maximization of the evidence

lower bound (ELBO):

ELBOv=

i=1

Eq(fv,i )[log p(˜yi,c|fv,i)] −β·KL [q(uv)||p(uv)] (11)

where

is a parameter to control the KL term, similar to [

], which can be interpreted as a

regularization term. Proofs of Equation (9)-(11) are provided in Appendix.

In order to apply SGD, we match the expectation of stochastic gradient of the expected log likelihood

term to the full gradient by multiplying the number of batches to the log likelihood term in Equation

(11) [17]. The overall loss for all experts is:

L=−

v=1

ELBOv(12)

The training steps are summarized in Algorithm 1.

Algorithm 1: Learning MGP

Input: Vviews of training data

X={Xv}V

v=1 where each view

has Nsamples of Xv={xv,i}N

i=1

and y={yi}N

i=1.

Transform: Reparameterize e

ycby (5)

1for minibatch do

2for v= 1 to Vdo

3Compute q(fv)by (7)

4Calculate ELBOvby (11)

5end for

6Sum ELBOs by (12)

7SGD update {lv, σ2

v,Zv,mv,Sv}V

v=1

8end for

Algorithm 2: Inference of MGP

Input: Vviews of testing data

X∗={X∗,v}V

v=1

1for v= 1 to Vdo

2Compute q(f∗,v)by (13)

3Calculate γ(X∗,v)by (17)

4end for

5Aggregate qP oE (f∗)by (16)

Output: E[πi,c]and V[πi,c]of class

probability by (14)

2.4 Inference on Test Samples

Given test samples

X∗={X∗,v}V

v=1

, the predictive distribution

p(f∗,v|e

yc)

is estimated by the

varitional distribution as:

p(f∗,v|e

yc)≈q(f∗,v) = Zp(f∗,v|uv)q(uv)duv(13)

where

p(f∗,v|uv)

can be formed by the joint prior distribution similar to Equation

(1)

(see Appendix

for a full derivation). Similar to Equation

(6)

, we aggregate the predictive distributions to form

qP oE (f∗)

that is sampled to approximate Gamma-distributed samples which in the end form the

posterior of Dirichlet distribution as follows:

E[πi,c] = Zexp (fi,c,∗)

Pjexp (fi,j,∗)qP oE (fi,c,∗)df∗

V[πi,c] = Z exp (fi,c,∗)

Pjexp (fi,j,∗)−E[πi,c]!2

qP oE (fi,c,∗)df∗(14)

Equation

(14)

can be approximated with the Monte Carlo method. See Appendix for the impacts of

the number of Monte Carlo samples on classiﬁcation performance and inference time. The aggregated

predictive distribution can also be weighted by each expert’s predictive distribution by:

qP oE (f∗)∝Y

(q(f∗,v))γ(X∗,v )(15)

where

γ(X∗,v)

is the weight controlling the inﬂuence of each expert to the aggregated prediction.

The mean and covariance of qP oE (f∗)with γ(X∗,v )are:

µW= X

µvγ(X∗,v)Σ−1

v!ΣW,ΣW= X

γ(X∗,v)Σ−1

v!−1

(16)

In our experiments, we use negative entropy of predictive distribution:

γ(X∗,v) = −H(q(f∗,v)) (17)

Note that the original PoE [

] in Equation

(6)

is recovered if

γ(X∗,v)=1

. The intuition behind

choosing negative entropy is that the experts with lower posterior entropy, which means the lower

uncertainty, gain more contribution to the aggregated predictions. Please note that other choices of

γ(X∗,v)

can also be applied such as the difference in entropy from prior to posterior [

] and negative

predictive variance [

]. We obtain the better empirical results with negative entropy, but the choice of

function is ﬂexible. The inference steps are summarized in Algorithm 2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UncertaintyEstimationforMulti-viewData:ThePowerofSeeingtheWholePictureMyongCholJungMonashUniversitydavid.jung@monash.eduHeZhaoCSIRO'sData61he.zhao@ieee.orgJoannaDipnallMonashUniversityjo.dipnall@monash.eduBelindaGabbeMonashUniversitybelinda.gabbe@monash.eduLanDuMonashUniversitylan.du@monash.eduAbst...

展开>> 收起<<

Uncertainty Estimation for Multi-view Data The Power of Seeing the Whole Picture Myong Chol Jung.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Uncertainty Estimation for Multi-view Data The Power of Seeing the Whole Picture Myong Chol Jung

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: