Multi-View Independent Component Analysis with Shared and Individual Sources Teodora Pandeva12Patrick Forré1

2025-05-02 0 0 2.89MB 23 页 10玖币
侵权投诉
Multi-View Independent Component Analysis
with Shared and Individual Sources
Teodora Pandeva1,2 Patrick Forré1
1AI4Science, AMLab, University of Amsterdam
2Swammerdam Institute for Life Sciences, University of Amsterdam
Abstract
Independent component analysis (ICA) is a blind
source separation method for linear disentangle-
ment of independent latent sources from observed
data. We investigate the special setting of noisy lin-
ear ICA where the observations are split among
different views, each receiving a mixture of shared
and individual sources. We prove that the corre-
sponding linear structure is identifiable, and the
sources distribution can be recovered. To compu-
tationally estimate the sources, we optimize a con-
strained form of the joint log-likelihood of the ob-
served data among all views. We also show empiri-
cally that our objective recovers the sources also
in the case when the measurements are corrupted
by noise. Furthermore, we propose a model selec-
tion procedure for recovering the number of shared
sources which we verify empirically. Finally, we
apply the proposed model in a challenging real-life
application, where the estimated shared sources
from two large transcriptome datasets (observed
data) provided by two different labs (two different
views) lead to recovering (shared) sources utilized
for finding a plausible representation of the under-
lying graph structure.
1 INTRODUCTION
Independent component analysis (ICA) is a method for solv-
ing blind source separation (BSS) problems [Comon, 1994]
where the goal is to separate independent latent sources
from mixed observed signals and, thus, uncover essential
structures in various data types. Historically, linear ICA has
proven to be a successful approach for recovering spatially
independent sources representing brain activity regions from
magnetoencephalography (MEG) data [Vigário et al., 1997]
or functional MRI (fMRI) data [McKeown and Sejnowski,
1998]. The utility of ICA is not only limited to neuroscience,
but it has a wide range of applications in omics data analysis,
e.g. [Zheng et al., 2008, Nazarov et al., 2019, Zhou and Alt-
man, 2018, Tan et al., 2020, Urzúa-Traslaviña et al., 2021,
Rusan et al., 2020, Cary et al., 2020, Dubois et al., 2019,
Aynaud et al., 2020]. In these works, the interpretation of
the latent sources relies on the assumption that each experi-
mental outcome is a linear mixture of independent biologi-
cal processes (the sources). For example, the latent sources
could represent gene profiles that are used to predict gene
regulation [Sastry et al., 2021, 2019] or cell-type specific
expressions from tumor samples [Avila Cobos et al., 2018]
for studying cell-type decompositions in cancer research.
The fast advancement of technology in the biomedical do-
main has provided a unique opportunity to find valuable
insights from large-scale data integration studies. Many of
these applications can be transformed into multiview BSS
problems. A significant body of research has been devoted
to developing multiview ICA methods focused on unrav-
eling group-level (shared) brain activity patterns in multi-
subject fMRI and EEG datasets [Salman et al., 2019, Hus-
ter et al., 2015, Congedo et al., 2010, Durieux et al., 2019,
Congedo et al., 2010, Calhoun et al., 2001]. However, these
methods cannot be applied directly to problems where one
is interested in retrieving both shared and view-specific sig-
nals, e.g. investigating the individual-specific brain func-
tions (view-specific) and shared phenotypes patterns in in-
dividuals’ brain activity in a natural stimuli experiment
[Dubois et al., 2016, Bartolomeo et al., 2017]. Another ap-
plication, where the estimation of both shared and view-
specific sources is essential, is omics data integration. A
typical example is combining heterogeneous gene expres-
sion data sets for achieving better gene regulation discovery.
In this scenario, the observed samples are realizations of
diverse and complex experiments. The shared information
between the datasets refer to genes with stable expression
across almost all conditions and the individual signals rep-
resent experiment-specific gene activities such as measure-
ments of gene knock-outs, stress conditions, etc.
arXiv:2210.02083v2 [cs.LG] 3 Mar 2023
Summary.
To address these and similar scientific applica-
tions, we formalize the described multi-view BSS problem
as a linear noisy generative model for a multi-view data
regime, assuming that the mixing matrix and number of
individual sources are view-specific. We call the resulting
model, ShIndICA. By requiring that the sources are non-
Gaussian and mutually independent and the linear mixing
matrices have full column rank, we provide identifiability
guarantees for the mixing matrices and latent sources in dis-
tribution. We maximize the joint log-likelihood of the ob-
served views to estimate the mixing matrices. Furthermore,
we provide a model selection criterion for selecting the cor-
rect number of shared sources. Finally, we apply ShIndICA
on a data integration problem of two large transcriptome
datasets. We show empirically that our method works well
compared to the baselines when the estimated components
are used for a graph inference task.
Contributions.
Our contributions can be summarized as
follows:
1.
We propose a new multi-view generative BSS model
with shared and individual sources, called ShIndICA.
2.
We provide theoretical guarantees for the identifiability
of the recovered linear structure and the source and
noise distributions.
3.
We derive the closed form joint likelihood of ShIndICA
which is used for estimating the mixing matrices.
4.
We propose a selection criterion for inferring the cor-
rect number of shared sources derived from the genera-
tive model assumptions.
2 PROBLEM FORMALIZATION
Consider the following
D
-view multivariate linear BSS
model where for d∈ {1, . . . , D}
xd=Ad(˜sd+d) = Ad0s0+Ad1sd+Add,
(1)
and it holds that
1. xdRkdis a random vector with E[xd]=0,
2. ˜sd= (s>
0, s>
d)>
are latent non-Gaussian random
sources with
s0Rc
and
sdRkdc
being the shared
and individual sources and
E[˜sd] = 0
and
Var[˜sd] =
Ikd,
3. AdRkd×kd
is a mixing matrix with full column
rank,
Ad0
and
Ad1
are the columns corresponding to
the shared and individual sources,
4. d N (0, σ2Ikd)is Gaussian noise,
5.
all latent source components and noise variables are
mutually independent.
s0
sdd
xd
d= 1 ... D
Figure 1: A graphical representation of Equation 1 where
xd
is
the observed variable,
s0
denotes the shared sources,
sd
the view-
specific ones and dis the Gaussian noise.
Note that for
D= 1
the model becomes a standard lin-
ear ICA model which is solved by Comon [1994], Hyväri-
nen and Oja [2000], Bell and Sejnowski [1995] for inde-
pendent non-Gaussian latent sources
z:= ˜s1+1
. The
Gaussian noise in Equation 1 can be interpreted as a mea-
surement error on the device with variance
σ2AdA>
d
(simi-
larly to [Richard et al., 2020, 2021]). We choose this setting
compared to the
Ad˜sd+d
because we can derive a joint
data likelihood in a closed form (see Section 4) which is
not available in the latter representation. Moreover, assump-
tion
5
implies that the noise is not expected to influence the
true signal and vice versa which is a common assumption
in measurement error models known as classical errors. See
Figure 1 for a graphical representation of Equation 1.
3 IDENTIFIABILITY RESULTS
In unsupervised machine learning methods, the reliability
of the algorithm cannot be directly verified outside of simu-
lations due to the non-existence of labels. For this reason,
theoretical guarantees are necessary to trust that the algo-
rithm estimates the quantities of interest. For a BSS prob-
lem solution, such as ICA, we want the sources and mixing
matrices to be (up to certain equivalence relations) unam-
biguously determined (or identifiable) by the data, at least
in the large sample limit.
Identifiability results for noiseless single-view ICA are
proved by [Comon, 1994]. It turns out that if at most one
of the latent sources is normal and the mixing matrix is in-
vertible, then both the mixing matrix and sources can be re-
covered almost surely up to permutation, sign and scaling.
However, this result does not hold in the general additive
noise setting. Davies [2004] shows that if the mixing matrix
has a full column rank, then the structure is identifiable, but
not the latent sources.
By employing the multi-view (
D2
) noisy setting in-
spired from our model (see Equation 1), we extend the re-
sults by Comon [1994], Davies [2004], Kagan et al. [1973],
Richard et al. [2020]. Compared to previous work, we pro-
vide identifiability guarantees not only for the mixture ma-
trices up to sign and permutation, but also for the source
and noise distributions (up to the same sign and permuta-
tion), and the latent (both shared and individual) sources
dimensions
1
. Moreover, our identifiability results hold for
a more general case than Equation 1 since the noise distri-
bution can be view-specific, and the mixing matrices can
be non-square. This is stated in the following Theorem 3.1,
proved in Appendix A:
Theorem 3.1.
Let
x1, . . . , xD
for
D2
be random vectors
with the following two representations:
A(1)
d"s(1)
0
s(1)
d#+(1)
d=xd=A(2)
d"s(2)
0
s(2)
d#+(2)
d,
where
d∈ {1, . . . , D}
, with the following properties for
i= 1,2
1. A(i)
dRpd×k(i)
d
is a (non-random) matrix with full
column rank, i.e. rank(A(i)
d) = k(i)
d,
2. (i)
dRk(i)
d
and
(i)
d N (0, σ(i)2
dIk(i)
d
)
is a
k(i)
d
-
variate normal random variable,
3. ˜s(i)
d= (s(i)>
0, s(i)>
d)>
with
s(i)
0Rc(i)
and
s(i)
d
Rk(i)
dc(i)is a random vector such that:
(a)
the components of
˜s(i)
d
are mutually independent
and each of them is a.s. a non-constant random
variable,
(b) ˜s(i)
d
is non-normal with
0
mean and unit variance.
4. (i)
d
is independent from
s(i)
0
and
s(i)
d
:
(i)
d
|=
s(i)
0
and
(i)
d
|=
s(i)
d.
Then, the number of shared sources is identifiable, i.e.
c(1) =c(2) =: c
and for all
d= 1, . . . , D
we get that
k(1)
d=k(2)
d=: kd,
and there exist a sign matrix
Γd
and a
permutation matrix PdRkd×kdsuch that:
A(2)
d=A(1)
dPdΓd,
and furthermore the source and noise distributions are iden-
tifiable, i.e.
"s(2)
0
s(2)
d#Γ1
dP1
d"s(1)
0
s(1)
d#, σ(2)
d=σ(1)
d.
Note that the requirement
D2
is essential for the iden-
tifiability of the non-Gaussian latent source and noise dis-
tributions. In contrast, in the single-view case, Kagan et
al. [1973] shows that we cannot identify any arbitrary non-
Gaussian source distribution unless we impose an additional
1
Note that the identifiability of the source distributions is a
weaker notion of identifiability than the almost surely one (i.e.
recovering the exact sources) in the noiseless case [Comon, 1994].
constraint on the latent sources to have non-normal compo-
nents (e.g., see Theorem A.2 )2.
Moreover, a necessary assumption for the identifiability
of the linear structure is the non-normality of the latent
sources, which is a standard assumption in the ICA literature
[Comon, 1994] as we stated above. In the more restrictive
multi-view shared ICA case, Richard et al. [2021] shows
that the sources can be Gaussian if we impose additional
assumptions about the diversity of the noise distributions.
However, this is not applicable in our case since we do not
make these assumptions for our model.
4 JOINT DATA LOG-LIKELIHOOD
Here, we derive the joint log-likelihood of the observed
views which we use for estimating the mixing matrices. Fol-
lowing the standard ICA approaches [Bell and Sejnowski,
1995, Hyvärinen and Oja, 2000], instead of optimizing di-
rectly for the mixing matrices
Ad
, we estimate their inverses
Wd=A1
d, called unmixing matrices.
Let
zd:= Wdxd= ˜sd+d,
and
zd,0:= s0+d0Rc
and
zd,1:= sd+d1Rkdc,
i.e.
zd= (z>
d,0, z>
d,1)>
are
the estimated noisy sources of the
d
-th view. Furthermore,
let
pZd,1
be the probability distribution of
zd,1
and
|Wd|=
|det Wd|
. Then we can derive the the data log-likelihood
of Equation 1 for
N
observed samples per view (proved in
Appendix B), which is given by
L(W1, . . . , WD) =
N
X
i=1 log f(¯si
0) +
D
X
d=1
log pZd,1(zi
d,1)
(2)
1
2σ2
D
X
d=1 trace(Zd,0Z>
d,0)1
D
D
X
l=1
trace(Zd,0Z>
l,0)
+N
D
X
d=1
log |Wd|+C
where
Zd,0Rc×N
for
d= 1, . . . , D
is the data matrix
that stores
N
observations of
zd,0
, We estimate the shared
sources via
¯si
0=PD
d=1 zi
d,0/D
with probability distribu-
tion f(¯s0) = Rexp Dks0¯s0k2
2σ2pS0(s0)ds0.
We further simplify the loss function by assuming that
the data matrices
X1Rk1×N, . . . , XDRkD×N
are
whitened. That consists of linearly transforming the random
variables’ realizations
xd
such that the resulting variable
˜xd=Kdxd
has uncorrelated components, i.e. unit variance,
E[˜xd˜x>
d] = Ikd
, where
Kd
is the whitening matrix. This
step transforms the mixing matrix to an orthogonal one
˜
Ad
.
2
A random variable
x
is said to have non-normal components
if for every representation
xv+w
with
v
|=
w
, then
v
and
w
are non-normal.
In the new optimization problem after whitening, we aim
to find orthogonal unmixing matrices
˜
Wd=˜
A>
d
such that
they maximize the transformed data log-likelihood:
L(˜
W1,..., ˜
WD)
N
X
i=1 log fσ(˜
¯si
0) +
D
X
d=1
log p˜
Zd,1(˜zi
d,1)
(3)
+1 + σ2
2Dσ2
D
X
d=1
D
X
l=1
trace( ˜
Zd,0˜
Z>
l,0),
where analogously to Equation 2:
˜zd= (˜z>
d,0,˜z>
d,1)>=
˜
Wd˜xd
and
˜
¯si
0=PD
d=1 ˜zi
d,0/D
. Note that after whitening
we have
trace( ˜
Zd,0˜
Z>
d,1) = c
and
|˜
Wd|= 1
are constants
and thus vanish from Equation 3 (see Appendix B for de-
tailed derivations). The first line of Equation 3 represents
the sources log-likelihoods and the second line has the role
of a regularization term for finding the shared information
between the views. In our work, Equation 3 is used for the
parameter estimation where both the density of the shared
and individual sources
fσ(·)
and
p˜
Zd,1
are approximated
by a nonlinear function
g(s)
, e.g.
g(s) = log cosh(s)
for
super-Gaussian or
g(s) = es2/2
for sub-Gaussian sources.
Moreover, we treat the noise variance
σ2
as a Lagrange
multiplier via the relation
λ=1+σ2
σ2.
Finally, after train-
ing we compute the mixing matrices
ˆ
Ad
by setting
ˆ
Ad=
K1
d˜
Wd
. Thus, we recover the true ones
Ad
up to scaling
with (1 + σ2)1
2, sign and column permutation.
5 MODEL SELECTION
By leveraging the data generation model assumptions, we
can select the number of shared sources
c
in a completely
unsupervised way. More precisely, let
k < kd
for all
d=
1, . . . , D
be a candidate for
c
which is unknown. Under the
assumption that
k
is a correct guess (i.e.
k=c
), our gen-
erative model yields that
ˆzd=zd,01
DPD
l=1 zl,0
is nor-
mally distributed with
0
mean and variance
D1
Dσ2Ik
for
each
d= 1, . . . , D,
where
zd,0
is defined as in Equation 2.
We propose an evaluation metric, called normalized recon-
struction error (NRE), defined by the following relation to
the log-likelihood of ˆzd:
NRE(k) :=
D
X
d=1
kˆzdk2
k
+
∝ −
D
X
d=1
log p(ˆzd)
k
=
D
X
d=1
Dkˆzdk2
2(D1)σ2k
klog(2πσ2)
2
k.
The two quantities differ by translation and multiplication
with constants that do not depend on the parameter of in-
terest
k.
Thus, by minimizing
NRE(k)
we maximize the
sum of the (normalized) log-likelihoods of the normal vari-
ables
ˆzd
. Intuitively, due to the normalization,
NRE(k)
can
be interpreted as the average reconstruction error over the
k
shared sources (summed over all views). This allows for a
fair comparison of the NRE scores for different k.
We select an optimal parameter by employing the following
procedure. First, we split the data (that applies for each view)
into two disjunct sets
D0
and
D1
, with not necessarily the
same sample sizes. We estimate the unmixing matrices for
a fixed
k
on
D0
(train set) and estimate the shared sources
on the test data
D1
. Then we compute the mean
NRE(k)
on the recovered test shared sources (not on the train set
due to possible overfitting, see Section 7). We repeat this
for various
k
and we choose the maximum of all
k
s that
minimize NRE, i.e.
k= max{arg min
kNRE(k)},
where
NRE(k) = 1
N1X
i≤|D1|
NRE(k)i=1
|D1|X
i≤|D1|
kˆzi
dk2
k
is the average NRE score over all observed test samples in
D1.
The NRE score serves as a goodness of fit measure and
indicates how well the true shared sources are reconstructed
from the test data. Due to the model fitting, we can get
high-quality shared sources even when
k << c
, as we will
demonstrate this empirically. Thus, we prefer to select the
highest possible
k
for which the average shared sources
reconstruction error is minimal.
6 RELATED WORK
The existing body of work on linear multi-view BSS, in-
spired by the ICA literature, considers mostly shared re-
sponse model applications (i.e., no individual sources), some
of them adopting a maximum likelihood approach [Guo and
Pagnoni, 2008, Richard et al., 2020, 2021] to model the
noisy views of the proposed models. Other methods, such
as independent vector analysis (IVA), relax the assumption
about the shared sources by assuming that they have the
same first or highest order moments across view [Lee et
al., 2008, Anderson et al., 2011, 2014, Engberg et al., 2016,
Vía et al., 2011]. Many of these approaches, such as Group
ICA [Calhoun et al., 2001], shared response ICA (SR-ICA)
[Zhang et al., 2016], MultiViewICA [Richard et al., 2020],
and ShICA[Richard et al., 2021], incorporate a dimension-
ality reduction step for every view (CCA [Varoquaux et
al., 2009, Richard et al., 2021] or PCA) to extract the mu-
tual signal between the multiple objects before applying an
ICA procedure on the reduced data. However, there are no
guarantees that the pre-processing procedure will entirely
remove the influence of the object-specific sources on the
transformed data. In the ICA literature, there exist three
methods for extracting shared and individual sources from
data. Maneshi et al. [2016] proposes a heuristic way of us-
ing FastICA for the given task without discussing the identi-
fiability of the results; [Long et al., 2020] suggests to apply
ICA on each view separately followed by statistical analysis
to separate the individual from the shared sources; [Lukic et
al., 2002] exploits temporal correlations rather than the non-
Gaussianity of the sources and thus is not applicable in the
context we are considering.
A common tool for analyzing multi-view data is canonical
correlation analysis (CCA), initially proposed by Hotelling
[1936]. It finds two datasets’ projections that maximize the
correlation between the projected variables. Gaussian-CCA
[Bach et al., 2005], its kernelized version [Bach et al., 2002]
and deep learning [Andrew et al., 2013] formulations of the
classical CCA problem aim to recover shared latent sources
of variations from the multiple views. There are extensions
of CCA that model the observed variables as a linear combi-
nation of group-specific and dataset-specific latent variables:
estimated with Bayesian inference methods [Klami et al.,
2013] or exponential families with MCMC inference [Virta-
nen, 2010]. However, most of them assume that the latent
sources are Gaussian or non-linearly related to the observed
data [Wang et al., 2016] and thus lack identifiability results.
Existing non-linear multiview versions such as [Tian et al.,
2020, Federici et al., 2020] cannot recover both shared and
individual signals across multiple measurements, and do not
assure the identifiability of the proposed generative models.
There are identifiable deep non-linear versions of ICA (e.g.
[Hyvärinen et al., 2019]) which can be employed for this
task. However, their assumptions for achieving identifiabil-
ity are often hard to satisfy in real-life applications, espe-
cially in the biomedical domains with low-data regimes.
7 EXPERIMENTS
Model Implementation and Training.
We used the python
library
pytorch
[Paszke et al., 2017] to implement our
method. We model each view with a separate unmixing
matrix. To impose orthogonality constraints on the unmixing
matrices, we made use of the
geotorch
library, which is
an the extension of
pytorch
[Lezcano-Casado, 2019]. The
stochastic gradient-based method applied for training is L-
BFGS. Before running any of the ICA-based methods (our
or the baselines), we whiten every single view by performing
PCA to speed up computation. We estimate the mixing
matrix up to scale (due to the whitening) and permutation
(see Sections 3 and 4). To force the algorithm to output
the shared sources in the same order across all views we
initialize the unmixing matrices by means of CCA. This
follows from the fact that the CCA weights are orthogonal
matrices, and the transformed views’ components are paired
and ordered across views. For all conducted experiments,
we fixed the parameter λfrom Equation 3 to 1.
Figure 2: Comparison of ShIndICA (this paper) to ShICA, Infomax,
GroupICA, MultiViewICA and ShICA-ML. The datasets come
from two different views with total number of sources 100 and
sample size 1000. We vary the number of the true number of shared
sources from 10 to 100 (x-axis), which are considered to be known
to the user before training. We compute the Amari distance (y-axis)
between the estimated unmixing matrices and ground truth (the
lower the better) in each case. ShIndICA consistently outperforms
all baselines.
Baselines Implementation.
We compare ShIndICA to the
standard single-view ICA method Infomax [Ablin et al.,
2018]. To adapt it to the multi-view setting, we run Info-
max on each view separately, and then we apply the Hun-
garian algorithm [Kuhn and Yaw, 1955] to match compo-
nents from different views based on their cross-correlation.
For the shared response model settings, ShIndICA is com-
pared to related methods such as MultiViewICA Richard et
al. [2020], ShICA, ShICA-ML Richard et al. [2021], and
GroupICA as proposed by Richard et al. [2020]. The latter
involves a two-step pre-processing procedure, first whiten-
ing the data in the single views and then dimensionality re-
duction on the joint views. For the data integration exper-
iment we use a method based on partial least squares esti-
mation, closely related to CCA, that extracts between-views
correlated components and view-specific ones. This method
is provided by the
OmicsPLS
R package Bouhaddani et al.
[2018] and is especially developed for data integration of
omics data. We refer to this method as PLS.
7.1 SYNTHETIC EXPERIMENTS
Data Simulation.
We simulated the data using the Laplace
distribution
exp(1
2|x|)
, and the mixing matrices are sam-
pled with normally distributed entries with mean
1
and
0.1
standard deviation. The realizations of the observed views
are obtained according to the proposed model. In the differ-
ent scenarios described below we vary the noise distribution.
We conducted each experiment
50
times and based on that
we provided error bars in all figures where applicable. Addi-
tional experiments are provided in Appendix D.2.
摘要:

Multi-ViewIndependentComponentAnalysiswithSharedandIndividualSourcesTeodoraPandeva1,2PatrickForré11AI4Science,AMLab,UniversityofAmsterdam2SwammerdamInstituteforLifeSciences,UniversityofAmsterdamAbstractIndependentcomponentanalysis(ICA)isablindsourceseparationmethodforlineardisentangle-mentofindepend...

展开>> 收起<<
Multi-View Independent Component Analysis with Shared and Individual Sources Teodora Pandeva12Patrick Forré1.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:2.89MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注