Multi-View Independent Component Analysis with Shared and Individual Sources Teodora Pandeva12Patrick Forré1

2025-05-02 0 0 2.89MB 23 页 10玖币

侵权投诉

Multi-View Independent Component Analysis

with Shared and Individual Sources

Teodora Pandeva1,2 Patrick Forré1

1AI4Science, AMLab, University of Amsterdam

2Swammerdam Institute for Life Sciences, University of Amsterdam

Abstract

Independent component analysis (ICA) is a blind

source separation method for linear disentangle-

ment of independent latent sources from observed

data. We investigate the special setting of noisy lin-

ear ICA where the observations are split among

different views, each receiving a mixture of shared

and individual sources. We prove that the corre-

sponding linear structure is identiﬁable, and the

sources distribution can be recovered. To compu-

tationally estimate the sources, we optimize a con-

strained form of the joint log-likelihood of the ob-

served data among all views. We also show empiri-

cally that our objective recovers the sources also

in the case when the measurements are corrupted

by noise. Furthermore, we propose a model selec-

tion procedure for recovering the number of shared

sources which we verify empirically. Finally, we

apply the proposed model in a challenging real-life

application, where the estimated shared sources

from two large transcriptome datasets (observed

data) provided by two different labs (two different

views) lead to recovering (shared) sources utilized

for ﬁnding a plausible representation of the under-

lying graph structure.

1 INTRODUCTION

Independent component analysis (ICA) is a method for solv-

ing blind source separation (BSS) problems [Comon, 1994]

where the goal is to separate independent latent sources

from mixed observed signals and, thus, uncover essential

structures in various data types. Historically, linear ICA has

proven to be a successful approach for recovering spatially

independent sources representing brain activity regions from

magnetoencephalography (MEG) data [Vigário et al., 1997]

or functional MRI (fMRI) data [McKeown and Sejnowski,

1998]. The utility of ICA is not only limited to neuroscience,

but it has a wide range of applications in omics data analysis,

e.g. [Zheng et al., 2008, Nazarov et al., 2019, Zhou and Alt-

man, 2018, Tan et al., 2020, Urzúa-Traslaviña et al., 2021,

Rusan et al., 2020, Cary et al., 2020, Dubois et al., 2019,

Aynaud et al., 2020]. In these works, the interpretation of

the latent sources relies on the assumption that each experi-

mental outcome is a linear mixture of independent biologi-

cal processes (the sources). For example, the latent sources

could represent gene proﬁles that are used to predict gene

regulation [Sastry et al., 2021, 2019] or cell-type speciﬁc

expressions from tumor samples [Avila Cobos et al., 2018]

for studying cell-type decompositions in cancer research.

The fast advancement of technology in the biomedical do-

main has provided a unique opportunity to ﬁnd valuable

insights from large-scale data integration studies. Many of

these applications can be transformed into multiview BSS

problems. A signiﬁcant body of research has been devoted

to developing multiview ICA methods focused on unrav-

eling group-level (shared) brain activity patterns in multi-

subject fMRI and EEG datasets [Salman et al., 2019, Hus-

ter et al., 2015, Congedo et al., 2010, Durieux et al., 2019,

Congedo et al., 2010, Calhoun et al., 2001]. However, these

methods cannot be applied directly to problems where one

is interested in retrieving both shared and view-speciﬁc sig-

nals, e.g. investigating the individual-speciﬁc brain func-

tions (view-speciﬁc) and shared phenotypes patterns in in-

dividuals’ brain activity in a natural stimuli experiment

[Dubois et al., 2016, Bartolomeo et al., 2017]. Another ap-

plication, where the estimation of both shared and view-

speciﬁc sources is essential, is omics data integration. A

typical example is combining heterogeneous gene expres-

sion data sets for achieving better gene regulation discovery.

In this scenario, the observed samples are realizations of

diverse and complex experiments. The shared information

between the datasets refer to genes with stable expression

across almost all conditions and the individual signals rep-

resent experiment-speciﬁc gene activities such as measure-

ments of gene knock-outs, stress conditions, etc.

arXiv:2210.02083v2 [cs.LG] 3 Mar 2023

Summary.

To address these and similar scientiﬁc applica-

tions, we formalize the described multi-view BSS problem

as a linear noisy generative model for a multi-view data

regime, assuming that the mixing matrix and number of

individual sources are view-speciﬁc. We call the resulting

model, ShIndICA. By requiring that the sources are non-

Gaussian and mutually independent and the linear mixing

matrices have full column rank, we provide identiﬁability

guarantees for the mixing matrices and latent sources in dis-

tribution. We maximize the joint log-likelihood of the ob-

served views to estimate the mixing matrices. Furthermore,

we provide a model selection criterion for selecting the cor-

rect number of shared sources. Finally, we apply ShIndICA

on a data integration problem of two large transcriptome

datasets. We show empirically that our method works well

compared to the baselines when the estimated components

are used for a graph inference task.

Contributions.

Our contributions can be summarized as

follows:

We propose a new multi-view generative BSS model

with shared and individual sources, called ShIndICA.

We provide theoretical guarantees for the identiﬁability

of the recovered linear structure and the source and

noise distributions.

We derive the closed form joint likelihood of ShIndICA

which is used for estimating the mixing matrices.

We propose a selection criterion for inferring the cor-

rect number of shared sources derived from the genera-

tive model assumptions.

2 PROBLEM FORMALIZATION

Consider the following

-view multivariate linear BSS

model where for d∈ {1, . . . , D}

xd=Ad(˜sd+d) = Ad0s0+Ad1sd+Add,

(1)

and it holds that

1. xd∈Rkdis a random vector with E[xd]=0,

2. ˜sd= (s>

0, s>

d)>

are latent non-Gaussian random

sources with

s0∈Rc

and

sd∈Rkd−c

being the shared

and individual sources and

E[˜sd] = 0

and

Var[˜sd] =

Ikd,

3. Ad∈Rkd×kd

is a mixing matrix with full column

rank,

Ad0

and

Ad1

are the columns corresponding to

the shared and individual sources,

4. d∼ N (0, σ2Ikd)is Gaussian noise,

all latent source components and noise variables are

mutually independent.

sdd

d= 1 ... D

Figure 1: A graphical representation of Equation 1 where

the observed variable,

denotes the shared sources,

the view-

speciﬁc ones and dis the Gaussian noise.

Note that for

D= 1

the model becomes a standard lin-

ear ICA model which is solved by Comon [1994], Hyväri-

nen and Oja [2000], Bell and Sejnowski [1995] for inde-

pendent non-Gaussian latent sources

z:= ˜s1+1

. The

Gaussian noise in Equation 1 can be interpreted as a mea-

surement error on the device with variance

σ2AdA>

(simi-

larly to [Richard et al., 2020, 2021]). We choose this setting

compared to the

Ad˜sd+d

because we can derive a joint

data likelihood in a closed form (see Section 4) which is

not available in the latter representation. Moreover, assump-

tion

implies that the noise is not expected to inﬂuence the

true signal and vice versa which is a common assumption

in measurement error models known as classical errors. See

Figure 1 for a graphical representation of Equation 1.

3 IDENTIFIABILITY RESULTS

In unsupervised machine learning methods, the reliability

of the algorithm cannot be directly veriﬁed outside of simu-

lations due to the non-existence of labels. For this reason,

theoretical guarantees are necessary to trust that the algo-

rithm estimates the quantities of interest. For a BSS prob-

lem solution, such as ICA, we want the sources and mixing

matrices to be (up to certain equivalence relations) unam-

biguously determined (or identiﬁable) by the data, at least

in the large sample limit.

Identiﬁability results for noiseless single-view ICA are

proved by [Comon, 1994]. It turns out that if at most one

of the latent sources is normal and the mixing matrix is in-

vertible, then both the mixing matrix and sources can be re-

covered almost surely up to permutation, sign and scaling.

However, this result does not hold in the general additive

noise setting. Davies [2004] shows that if the mixing matrix

has a full column rank, then the structure is identiﬁable, but

not the latent sources.

By employing the multi-view (

D≥2

) noisy setting in-

spired from our model (see Equation 1), we extend the re-

sults by Comon [1994], Davies [2004], Kagan et al. [1973],

Richard et al. [2020]. Compared to previous work, we pro-

vide identiﬁability guarantees not only for the mixture ma-

trices up to sign and permutation, but also for the source

and noise distributions (up to the same sign and permuta-

tion), and the latent (both shared and individual) sources

dimensions

. Moreover, our identiﬁability results hold for

a more general case than Equation 1 since the noise distri-

bution can be view-speciﬁc, and the mixing matrices can

be non-square. This is stated in the following Theorem 3.1,

proved in Appendix A:

Theorem 3.1.

Let

x1, . . . , xD

for

D≥2

be random vectors

with the following two representations:

A(1)

d"s(1)

s(1)

d#+(1)

d=xd=A(2)

d"s(2)

s(2)

d#+(2)

d,

where

d∈ {1, . . . , D}

, with the following properties for

i= 1,2

1. A(i)

d∈Rpd×k(i)

is a (non-random) matrix with full

column rank, i.e. rank(A(i)

d) = k(i)

2. (i)

d∈Rk(i)

and

(i)

d∼ N (0, σ(i)2

dIk(i)

)

is a

k(i)

variate normal random variable,

3. ˜s(i)

d= (s(i)>

0, s(i)>

d)>

with

s(i)

0∈Rc(i)

and

s(i)

d∈

Rk(i)

d−c(i)is a random vector such that:

(a)

the components of

˜s(i)

are mutually independent

and each of them is a.s. a non-constant random

variable,

(b) ˜s(i)

is non-normal with

mean and unit variance.

4. (i)

is independent from

s(i)

and

s(i)

(i)

s(i)

and

(i)

s(i)

Then, the number of shared sources is identiﬁable, i.e.

c(1) =c(2) =: c

and for all

d= 1, . . . , D

we get that

k(1)

d=k(2)

d=: kd,

and there exist a sign matrix

Γd

and a

permutation matrix Pd∈Rkd×kdsuch that:

A(2)

d=A(1)

dPdΓd,

and furthermore the source and noise distributions are iden-

tiﬁable, i.e.

"s(2)

s(2)

d#∼Γ−1

dP−1

d"s(1)

s(1)

d#, σ(2)

d=σ(1)

Note that the requirement

D≥2

is essential for the iden-

tiﬁability of the non-Gaussian latent source and noise dis-

tributions. In contrast, in the single-view case, Kagan et

al. [1973] shows that we cannot identify any arbitrary non-

Gaussian source distribution unless we impose an additional

Note that the identiﬁability of the source distributions is a

weaker notion of identiﬁability than the almost surely one (i.e.

recovering the exact sources) in the noiseless case [Comon, 1994].

constraint on the latent sources to have non-normal compo-

nents (e.g., see Theorem A.2 )2.

Moreover, a necessary assumption for the identiﬁability

of the linear structure is the non-normality of the latent

sources, which is a standard assumption in the ICA literature

[Comon, 1994] as we stated above. In the more restrictive

multi-view shared ICA case, Richard et al. [2021] shows

that the sources can be Gaussian if we impose additional

assumptions about the diversity of the noise distributions.

However, this is not applicable in our case since we do not

make these assumptions for our model.

4 JOINT DATA LOG-LIKELIHOOD

Here, we derive the joint log-likelihood of the observed

views which we use for estimating the mixing matrices. Fol-

lowing the standard ICA approaches [Bell and Sejnowski,

1995, Hyvärinen and Oja, 2000], instead of optimizing di-

rectly for the mixing matrices

, we estimate their inverses

Wd=A−1

d, called unmixing matrices.

Let

zd:= Wdxd= ˜sd+d,

and

zd,0:= s0+d0∈Rc

and

zd,1:= sd+d1∈Rkd−c,

i.e.

zd= (z>

d,0, z>

d,1)>

are

the estimated noisy sources of the

-th view. Furthermore,

let

pZd,1

be the probability distribution of

zd,1

and

|Wd|=

|det Wd|

. Then we can derive the the data log-likelihood

of Equation 1 for

observed samples per view (proved in

Appendix B), which is given by

L(W1, . . . , WD) =

i=1 log f(¯si

0) +

d=1

log pZd,1(zi

d,1)

(2)

−1

2σ2

d=1 trace(Zd,0Z>

d,0)−1

l=1

trace(Zd,0Z>

l,0)

d=1

log |Wd|+C

where

Zd,0∈Rc×N

for

d= 1, . . . , D

is the data matrix

that stores

observations of

zd,0

, We estimate the shared

sources via

¯si

0=PD

d=1 zi

d,0/D

with probability distribu-

tion f(¯s0) = Rexp −Dks0−¯s0k2

2σ2pS0(s0)ds0.

We further simplify the loss function by assuming that

the data matrices

X1∈Rk1×N, . . . , XD∈RkD×N

are

whitened. That consists of linearly transforming the random

variables’ realizations

such that the resulting variable

˜xd=Kdxd

has uncorrelated components, i.e. unit variance,

E[˜xd˜x>

d] = Ikd

, where

is the whitening matrix. This

step transforms the mixing matrix to an orthogonal one

A random variable

is said to have non-normal components

if for every representation

x∼v+w

with

, then

and

are non-normal.

In the new optimization problem after whitening, we aim

to ﬁnd orthogonal unmixing matrices

Wd=˜

such that

they maximize the transformed data log-likelihood:

L(˜

W1,..., ˜

WD)∝

i=1 log fσ(˜

¯si

0) +

d=1

log p˜

Zd,1(˜zi

d,1)

(3)

+1 + σ2

2Dσ2

d=1

l=1

trace( ˜

Zd,0˜

l,0),

where analogously to Equation 2:

˜zd= (˜z>

d,0,˜z>

d,1)>=

Wd˜xd

and

¯si

0=PD

d=1 ˜zi

d,0/D

. Note that after whitening

we have

trace( ˜

Zd,0˜

d,1) = c

and

|˜

Wd|= 1

are constants

and thus vanish from Equation 3 (see Appendix B for de-

tailed derivations). The ﬁrst line of Equation 3 represents

the sources log-likelihoods and the second line has the role

of a regularization term for ﬁnding the shared information

between the views. In our work, Equation 3 is used for the

parameter estimation where both the density of the shared

and individual sources

fσ(·)

and

p˜

Zd,1

are approximated

by a nonlinear function

g(s)

, e.g.

g(s) = −log cosh(s)

for

super-Gaussian or

g(s) = e−s2/2

for sub-Gaussian sources.

Moreover, we treat the noise variance

σ2

as a Lagrange

multiplier via the relation

λ=1+σ2

σ2.

Finally, after train-

ing we compute the mixing matrices

by setting

Ad=

K−1

d˜

. Thus, we recover the true ones

up to scaling

with (1 + σ2)1

2, sign and column permutation.

5 MODEL SELECTION

By leveraging the data generation model assumptions, we

can select the number of shared sources

in a completely

unsupervised way. More precisely, let

k < kd

for all

1, . . . , D

be a candidate for

which is unknown. Under the

assumption that

is a correct guess (i.e.

k=c

), our gen-

erative model yields that

ˆzd=zd,0−1

DPD

l=1 zl,0

is nor-

mally distributed with

mean and variance

D−1

Dσ2Ik

for

each

d= 1, . . . , D,

where

zd,0

is deﬁned as in Equation 2.

We propose an evaluation metric, called normalized recon-

struction error (NRE), deﬁned by the following relation to

the log-likelihood of ˆzd:

NRE(k) :=

d=1

kˆzdk2

∝ −

d=1

log p(ˆzd)

d=1

Dkˆzdk2

2(D−1)σ2k−

klog(2πσ2)

2

The two quantities differ by translation and multiplication

with constants that do not depend on the parameter of in-

terest

Thus, by minimizing

NRE(k)

we maximize the

sum of the (normalized) log-likelihoods of the normal vari-

ables

ˆzd

. Intuitively, due to the normalization,

NRE(k)

can

be interpreted as the average reconstruction error over the

shared sources (summed over all views). This allows for a

fair comparison of the NRE scores for different k.

We select an optimal parameter by employing the following

procedure. First, we split the data (that applies for each view)

into two disjunct sets

and

, with not necessarily the

same sample sizes. We estimate the unmixing matrices for

a ﬁxed

(train set) and estimate the shared sources

on the test data

. Then we compute the mean

NRE(k)

on the recovered test shared sources (not on the train set

due to possible overﬁtting, see Section 7). We repeat this

for various

and we choose the maximum of all

s that

minimize NRE, i.e.

k∗= max{arg min

kNRE(k)},

where

NRE(k) = 1

N1X

i≤|D1|

NRE(k)i=1

|D1|X

i≤|D1|

kˆzi

dk2

is the average NRE score over all observed test samples in

D1.

The NRE score serves as a goodness of ﬁt measure and

indicates how well the true shared sources are reconstructed

from the test data. Due to the model ﬁtting, we can get

high-quality shared sources even when

k << c

, as we will

demonstrate this empirically. Thus, we prefer to select the

highest possible

for which the average shared sources

reconstruction error is minimal.

6 RELATED WORK

The existing body of work on linear multi-view BSS, in-

spired by the ICA literature, considers mostly shared re-

sponse model applications (i.e., no individual sources), some

of them adopting a maximum likelihood approach [Guo and

Pagnoni, 2008, Richard et al., 2020, 2021] to model the

noisy views of the proposed models. Other methods, such

as independent vector analysis (IVA), relax the assumption

about the shared sources by assuming that they have the

same ﬁrst or highest order moments across view [Lee et

al., 2008, Anderson et al., 2011, 2014, Engberg et al., 2016,

Vía et al., 2011]. Many of these approaches, such as Group

ICA [Calhoun et al., 2001], shared response ICA (SR-ICA)

[Zhang et al., 2016], MultiViewICA [Richard et al., 2020],

and ShICA[Richard et al., 2021], incorporate a dimension-

ality reduction step for every view (CCA [Varoquaux et

al., 2009, Richard et al., 2021] or PCA) to extract the mu-

tual signal between the multiple objects before applying an

ICA procedure on the reduced data. However, there are no

guarantees that the pre-processing procedure will entirely

remove the inﬂuence of the object-speciﬁc sources on the

transformed data. In the ICA literature, there exist three

methods for extracting shared and individual sources from

data. Maneshi et al. [2016] proposes a heuristic way of us-

ing FastICA for the given task without discussing the identi-

ﬁability of the results; [Long et al., 2020] suggests to apply

ICA on each view separately followed by statistical analysis

to separate the individual from the shared sources; [Lukic et

al., 2002] exploits temporal correlations rather than the non-

Gaussianity of the sources and thus is not applicable in the

context we are considering.

A common tool for analyzing multi-view data is canonical

correlation analysis (CCA), initially proposed by Hotelling

[1936]. It ﬁnds two datasets’ projections that maximize the

correlation between the projected variables. Gaussian-CCA

[Bach et al., 2005], its kernelized version [Bach et al., 2002]

and deep learning [Andrew et al., 2013] formulations of the

classical CCA problem aim to recover shared latent sources

of variations from the multiple views. There are extensions

of CCA that model the observed variables as a linear combi-

nation of group-speciﬁc and dataset-speciﬁc latent variables:

estimated with Bayesian inference methods [Klami et al.,

2013] or exponential families with MCMC inference [Virta-

nen, 2010]. However, most of them assume that the latent

sources are Gaussian or non-linearly related to the observed

data [Wang et al., 2016] and thus lack identiﬁability results.

Existing non-linear multiview versions such as [Tian et al.,

2020, Federici et al., 2020] cannot recover both shared and

individual signals across multiple measurements, and do not

assure the identiﬁability of the proposed generative models.

There are identiﬁable deep non-linear versions of ICA (e.g.

[Hyvärinen et al., 2019]) which can be employed for this

task. However, their assumptions for achieving identiﬁabil-

ity are often hard to satisfy in real-life applications, espe-

cially in the biomedical domains with low-data regimes.

7 EXPERIMENTS

Model Implementation and Training.

We used the python

library

pytorch

[Paszke et al., 2017] to implement our

method. We model each view with a separate unmixing

matrix. To impose orthogonality constraints on the unmixing

matrices, we made use of the

geotorch

library, which is

an the extension of

pytorch

[Lezcano-Casado, 2019]. The

stochastic gradient-based method applied for training is L-

BFGS. Before running any of the ICA-based methods (our

or the baselines), we whiten every single view by performing

PCA to speed up computation. We estimate the mixing

matrix up to scale (due to the whitening) and permutation

(see Sections 3 and 4). To force the algorithm to output

the shared sources in the same order across all views we

initialize the unmixing matrices by means of CCA. This

follows from the fact that the CCA weights are orthogonal

matrices, and the transformed views’ components are paired

and ordered across views. For all conducted experiments,

we ﬁxed the parameter λfrom Equation 3 to 1.

Figure 2: Comparison of ShIndICA (this paper) to ShICA, Infomax,

GroupICA, MultiViewICA and ShICA-ML. The datasets come

from two different views with total number of sources 100 and

sample size 1000. We vary the number of the true number of shared

sources from 10 to 100 (x-axis), which are considered to be known

to the user before training. We compute the Amari distance (y-axis)

between the estimated unmixing matrices and ground truth (the

lower the better) in each case. ShIndICA consistently outperforms

all baselines.

Baselines Implementation.

We compare ShIndICA to the

standard single-view ICA method Infomax [Ablin et al.,

2018]. To adapt it to the multi-view setting, we run Info-

max on each view separately, and then we apply the Hun-

garian algorithm [Kuhn and Yaw, 1955] to match compo-

nents from different views based on their cross-correlation.

For the shared response model settings, ShIndICA is com-

pared to related methods such as MultiViewICA Richard et

al. [2020], ShICA, ShICA-ML Richard et al. [2021], and

GroupICA as proposed by Richard et al. [2020]. The latter

involves a two-step pre-processing procedure, ﬁrst whiten-

ing the data in the single views and then dimensionality re-

duction on the joint views. For the data integration exper-

iment we use a method based on partial least squares esti-

mation, closely related to CCA, that extracts between-views

correlated components and view-speciﬁc ones. This method

is provided by the

OmicsPLS

R package Bouhaddani et al.

[2018] and is especially developed for data integration of

omics data. We refer to this method as PLS.

7.1 SYNTHETIC EXPERIMENTS

Data Simulation.

We simulated the data using the Laplace

distribution

exp(−1

2|x|)

, and the mixing matrices are sam-

pled with normally distributed entries with mean

and

0.1

standard deviation. The realizations of the observed views

are obtained according to the proposed model. In the differ-

ent scenarios described below we vary the noise distribution.

We conducted each experiment

times and based on that

we provided error bars in all ﬁgures where applicable. Addi-

tional experiments are provided in Appendix D.2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-ViewIndependentComponentAnalysiswithSharedandIndividualSourcesTeodoraPandeva1,2PatrickForré11AI4Science,AMLab,UniversityofAmsterdam2SwammerdamInstituteforLifeSciences,UniversityofAmsterdamAbstractIndependentcomponentanalysis(ICA)isablindsourceseparationmethodforlineardisentangle-mentofindepend...

展开>> 收起<<

Multi-View Independent Component Analysis with Shared and Individual Sources Teodora Pandeva12Patrick Forré1.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-View Independent Component Analysis with Shared and Individual Sources Teodora Pandeva12Patrick Forré1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: