LEARNING LATENT STRUCTURAL CAUSAL MODELS Jithendaraa Subramanian Mila McGill UniversityYashas Annadani

2025-05-02 0 0 1.2MB 21 页 10玖币
侵权投诉
LEARNING LATENT STRUCTURAL CAUSAL MODELS
Jithendaraa Subramanian
Mila, McGill University
Yashas Annadani
KTH Stockholm
Ivaxi Sheth
Mila, ´
ETS Montr´
eal
Nan Rosemary Ke
Mila, Deepmind
Tristan Deleu
Mila, Universit´
e de Montr´
eal
Stefan Bauer
KTH Stockholm
Derek Nowrouzezahrai
Mila, McGill University
Samira Ebrahimi Kahou
Mila, ´
ETS Montreal, CIFAR AI Chair
ABSTRACT
Causal learning has long concerned itself with the accurate recovery of underlying
causal mechanisms. Such causal modelling enables better explanations of out-of-
distribution data. Prior works on causal learning assume that the high-level causal
variables are given. However, in machine learning tasks, one often operates on
low-level data like image pixels or high-dimensional vectors. In such settings,
the entire Structural Causal Model (SCM) – structure, parameters, and high-level
causal variables – is unobserved and needs to be learnt from low-level data. We
treat this problem as Bayesian inference of the latent SCM, given low-level data.
For linear Gaussian additive noise SCMs, we present a tractable approximate in-
ference method which performs joint inference over the causal variables, structure
and parameters of the latent SCM from random, known interventions. Experi-
ments are performed on synthetic datasets and a causally generated image dataset
to demonstrate the efficacy of our approach. We also perform image generation
from unseen interventions, thereby verifying out of distribution generalization for
the proposed causal model.
1 INTRODUCTION
Learning variables of interest and uncovering causal dependencies is crucial for intelligent systems
to reason and predict in scenarios that differ from the training distribution. In the causality literature,
causal variables and mechanisms are often assumed to be known. This knowledge enables reasoning
and prediction under unseen interventions. In machine learning, however, one does not have direct
access to the underlying variables of interest nor the causal structure and mechanisms corresponding
to them. Rather, these have to be learned from observed low-level data like pixels of an image which
are usually high-dimensional. Having a learned causal model can then be useful for generalizing to
out-of-distribution data (Scherrer et al.,2022;Ke et al.,2021), estimating the effect of interventions
(Pearl,2009;Sch¨
olkopf et al.,2021), disentangling underlying factors of variation (Bengio et al.,
2012;Wang and Jordan,2021), and transfer learning (Schoelkopf et al.,2012;Bengio et al.,2019).
Structure learning (Spirtes et al.,2000;Zheng et al.,2018) learns the structure and parameters of
the Structural Causal Model (SCM) (Pearl,2009) that best explains some observed high-level causal
variables. In causal machine learning and representation learning, however, these causal variables
may no longer be observable. This serves as the motivation for our work. We address the problem of
learning the entire SCM – consisting its causal variables, structure and parameters – which is latent,
by learning to generate observed low-level data. Since one often operates in low-data regimes or
non-identifiable settings, we adopt a Bayesian formulation so as to quantify epistemic uncertainty
over the learned latent SCM. Given a dataset, we use variational inference to learn a joint posterior
over the causal variables, structure and parameters of the latent SCM. To the best of our knowledge,
ours is the first work to address the problem of causal discovery in linear Gaussian latent SCMs from
low-level data, where causal variables are unobserved. Our contributions are as follows:
Correspondence to jithen.subra@gmail.com
1
arXiv:2210.13583v1 [cs.LG] 24 Oct 2022
Figure 1: Model architecture of the proposed generative model for the Bayesian latent causal dis-
covery task to learn latent SCM from low-level data.
We propose a general algorithm for Bayesian causal discovery in the latent space of a
generative model, learning a distribution over causal variables, structure and parameters
in linear Gaussian latent SCMs with random, known interventions. Figure 1illustrates an
overview of the proposed method.
By learning the structure and parameters of a latent SCM, we implicitly induce a joint dis-
tribution over the causal variables. Hence, sampling from this distribution is equivalent to
ancestral sampling through the latent SCM. As such, we address a challenging, simultane-
ous optimization problem that is often encountered during causal discovery in latent space:
one cannot find the right graph without the right causal variables, and vice versa.
On a synthetically generated dataset and an image dataset used to benchmark causal model
performance (Ke et al.,2021), we evaluate our method along three axes – uncovering causal
variables, structure, and parameters – consistently outperforming baselines. We demon-
strate its ability to perform image generation from unseen interventional distributions.
2 PRELIMINARIES
2.1 STRUCTURAL CAUSAL MODELS
z
GΘ
N
Figure 2: BN for prior
works in causal discovery
and structure learning
A Structural Causal Model (SCM) is defined by a set of equations
which represent the mechanisms by which each endogenous variable
zidepends on its direct causes zG
pa(i)and a corresponding exogenous
noise variable i. The direct causes are subsets of other endogenous
variables. If the causal parent assignment is assumed to be acyclic,
then an SCM is associated with a Directed Acyclic Graph (DAG)
G= (V, E), where V corresponds to the endogenous variables and
Eencodes direct cause-effect relationships. The exact value taken on
by a causal variable zi, is given by local causal mechanisms ficon-
ditional on zG
pa(i), the parameters Θi, and the node’s noise variable
i, as given in equation 1. For linear Gaussian additive noise SCMs
with equal noise variance, i.e., the setting that we focus on in this work, all fis are linear functions,
and Θdenotes the weighted adjacency matrix W, where each Wji is the edge weight from ji.
The linear Gaussian additive noise SCM thus reduces to equation 2,
zi=fi(zG
pa(i),Θ, i),(1) zi=X
jpaG(i)
Wji ·zj+i.(2)
2
2.2 CAUSAL DISCOVERY
Structure learning in prior work refers to learning a DAG according to some optimization criterion
with or without the notion of causality (e.g., He et al. (2019)). The task of causal discovery on
the other hand, is more specific in that it refers to learning the structure (also parameters, in some
cases) of SCMs, and subscribes to causality and interventions like that of Pearl (2009). That is, the
methods aim to estimate (G,Θ). These approaches often resort to modular likelihood scores over
causal variables – like the BGe score (Geiger and Heckerman,1994;Kuipers et al.,2022) and BDe
score (Heckerman et al.,1995) – to learn the right structure. However, these methods all assume a
dataset of observed causal variables. These approaches either obtain a maximum likelihood estimate,
G= arg max
G
p(Z| G)or (G,Θ) = arg max
G,Θ
p(Z| G,Θ) ,(3)
or in the case of Bayesian causal discovery (Heckerman et al.,1997), variational inference is typi-
cally used to approximate a joint posterior distribution qφ(G,Θ) to the true posterior p(G,Θ|Z)by
minimizing the KL divergence between the two,
DKL(qφ(G,Θ) || p(G,Θ|Z)) = E(G,Θ)qφlog p(Z| G,Θ) log qφ(G,Θ)
p(G,Θ) ,(4)
where p(G,Θ) is a prior over the structure and parameters of the SCM – possibly encoding DAGness,
sparse connections, or low-magnitude edge weights. Figure 2shows the Bayesian Network (BN)
over which inference is performed for causal discovery tasks.
2.3 LATENT CAUSAL DISCOVERY
x
z
GΘ
N
Figure 3: BN for the latent causal
discovery task that generalizes stan-
dard causal discovery setups
In more realistic scenarios, the learner does not directly ob-
serve causal variables and they must be learned from low-
level data. The causal variables, structure, and parameters are
part of a latent SCM. The goal of causal representation learn-
ing models is to perform inference of, and generation from,
the true latent SCM. Yang et al. (2021) proposes a Causal VAE
but is in a supervised setup where one has labels on causal
variables and the focus is on disentanglement. Kocaoglu et al.
(2017) present causal generative models trained in an adver-
sarial manner but assumes observations of causal variables.
Given the right causal structure as a prior, the work focuses on
generation from conditional and interventional distributions.
In both the causal representation learning and causal genera-
tive model scenarios mentioned above, the Ground Truth (GT)
causal graph and parameters of the latent SCM are arbitrarily defined on real datasets and the setting
is supervised. Contrary to this, our setting is unsupervised and we are interested in recovering the
GT underlying SCM and causal variables that generate the low-level observed data – we define this
as the problem of latent causal discovery, and the BN over which we want to perform inference
on is given in figure 3. In the upcoming sections, we discuss related work, formulate our prob-
lem setup and propose an algorithm for Bayesian latent causal discovery, evaluate with experiments
on causally created vector data and image data, and perform sampling from unseen interventional
image distributions to showcase generalization of learned latent SCMs.
3 RELATED WORK
Prior work can be classified into Bayesian (Koivisto and Sood,2004;Heckerman et al.,2006;Fried-
man and Koller,2013) or maximum likelihood (Brouillard et al.,2020;Wei et al.,2020;Ng et al.,
2022) methods, that learn the structure and parameters of SCMs using either score-based (Kass and
Raftery,1995;Barron et al.,1998;Heckerman et al.,1995) or constraint-based (Cheng et al.,2002;
Lehmann and Romano,2005) approaches.
Causal discovery and structure learning: Work in this category assume causal variables are ob-
served and do not operate on low-level data (Spirtes et al.,2000;Viinikka et al.,2020;Yu et al.,
3
2021;Zhang et al.,2022). Peters and B¨
uhlmann (2014) prove identifiability of linear Gaussian
SCMs with equal noise variances. Bengio et al. (2019) use the speed of adaptation as a signal to
learn the causal direction. Ke et al. (2019) explore learning causal models from unknown, while
Scherrer et al. (2021); Tigas et al. (2022); Agrawal et al. (2019); Toth et al. (2022) focus on active
learning and experimental design setups on how to perform interventions to efficiently learn causal
models. Transformer (Vaswani et al.,2017) based approach learns structure from synthetic datasets
and generalize to naturalistic graphs (Ke et al.,2022). Zheng et al. (2018) introduce an acyclicity
constraint that penalizes cyclic graphs, thereby restricting search close to the DAG space. Lachapelle
et al. (2019) leverages this constraint to learn DAGs in nonlinear SCMs. Pamfil et al. (2020); Lippe
et al. (2022) perform structure learning on temporal data.
Latent variable models with predefined structure: Examples include VAE (Kingma and Welling,
2013;Rezende et al.,2014) which has an independence assumption between latent variables. To
overcome this, Sønderby et al. (2016) and Zhao et al. (2017) define latent variables with a chain
structure in VAEs. Kingma et al. (2016) uses inverse autoregressive flows to improve upon the
diagonal covariance of latent variables in VAEs.
Latent variable models with learned structure: GraphVAE (He et al.,2019) learns the edges
between latent variables without incorporating notions of causality. Brehmer et al. (2022) present
identifiability theory for learning causal representations and propose a practical algorithm for this
under the assumption of having access to pairs of observational and interventional data.
Supervised causal representation learning:Kocaoglu et al. (2017); Shen et al. (2020); Moraffah
et al. (2020) introduce generative models that use an SCM-based prior in latent space. In Shen
et al. (2020), the goal is to learn causally disentangled variables. Yang et al. (2021) learn a DAG on
CelebA and a causally generated pendulum image dataset but assume complete access to the causal
variables. Lopez-Paz et al. (2016) establishes observable causal footprints in images.
4 LEARNING LATENT SCMS FROM LOW-LEVEL DATA
4.1 PROBLEM SCENARIO
We are given a dataset D={x(1), ..., x(N)}, where each x(i)is a high-dimensional observed data
– for simplicity, we assume x(i)is a vector in RDbut the setup extends to other inputs as well (like
images, as we will see in the next section). We assume that there exist latent variables Z={z(i)
Rd}N
i=1 with dD, that explain the data, and these latent variables have an SCM with structure
GGT and parameters ΘGT associated with them. We wish to invert the data generation process
g: (Z,G,Θ) → D where the causal variables Zare in the latent space. In the setting, we also have
access to the intervention targets I={I(i)}N
i=1 where each I(i)∈ {0,1}d. The jth dimension of
I(i)takes a value of 1 if node jwas intervened on in data sample i, and 0 otherwise.
4.2 GENERAL METHOD
We aim to obtain a posterior estimate over the entire latent SCM, p(Z,G,Θ| D). Computing
the true posterior analytically requires calculating the marginal likelihood p(D)which gets quickly
intractable due to the number of possible DAGs growing super-exponentially with respect to the
number of nodes. Thus, we resort to variational inference (Blei et al.,2017) that provides a tractable
way to learn an approximate posterior qφ(Z,G,Θ) with variational parameters φ, close to the true
posterior p(Z,G,Θ| D)by maximizing the Evidence Lower Bound (ELBO),
L(ψ, φ) = E
qφ(Z,G,Θ)log pψ(D | Z,G,Θ) log qφ(Z,G,Θ)
p(Z,G,Θ) ,(5)
where p(Z,G,Θ) is the prior, pψ(D | Z,G,Θ) is the likelihood model with parameters ψ, the
likelihood model maps the latent variables Zto high-dimensional vectors X. An approach to learn
this posterior could be to factorize it as
qφ(Z,G,Θ) = qφ(Z)·qφ(G,Θ|Z).(6)
Given a way to obtain qφ(Z), the conditional qφ(G,Θ|Z)can be obtained using existing Bayesian
structure learning methods. Otherwise, one has to perform a hard simultaneous optimization which
4
would require alternating optimizations on Zand on (G,Θ) given an estimate of Z. Difficulty of
such an alternate optimization is discussed in Brehmer et al. (2022).
Alternate factorization of the posterior: Rather than factorizing as in equation 6, we propose to
factorize according to the BN in figure 3. This is given by qφ(Z,G,Θ) = qφ(Z| G,Θ) ·qφ(G,Θ).
The advantage of this factorization is that the distribution over Zis completely determined from
the SCM given (G,Θ) and exogenous noise variables (assumed to be Gaussian). Thus, the prior
p(Z| G,Θ) and the posterior p(Z| G,Θ,D) = qφ(Z| G,Θ) are identical. This conveniently
avoids the hard simultaneous optimization problem mentioned above since optimizing for qφ(Z)is
not necessary. Equation 5can then be simplified as
L(ψ, φ) = E
qφ(Z,G,Θ)log pψ(D | Z)log qφ(G,Θ)
p(G,Θ)
: 0
log qφ(Z| G,Θ)
p(Z| G,Θ) .(7)
Such a posterior can be used to obtain an SCM by sampling Gand Θfrom the approximated pos-
terior. As long as the samples Gare always acyclic, one can perform ancestral sampling through
the SCM to obtain samples corresponding to the causal variables ˆ
z(i). For additive noise models
like in equation 2, these samples are already reparameterized and differentiable with respect to their
parameters. The samples of causal variables are then fed to the likelihood model to predict ˆ
x(i)to
reconstruct the observed data x(i).
4.3 POSTERIOR PARAMETERIZATIONS AND PRIORS
For linear Gaussian latent SCMs, which is the focus of this work, learning a posterior over (G,Θ) is
equivalent to learning qφ(W, Σ) – a posterior over weighted adjacency matrices Wand noise covari-
ances Σ. We follow an approach similar to (Cundy et al.,2021). We express Wvia a permutation
matrix P1and a lower triangular edge weight matrix L, according to W=PTLTP. Here, Lis
defined in the space of all weighted adjacency matrices with a fixed node ordering where node j
can be a parent of node ionly if j > i. Search over permutations corresponds to search over differ-
ent node orderings and thus, Wand Σparameterize the space of SCMs. Further, we factorize the
approximate posterior qφ(P, L, Σ) as
qφ(G, Θ) qφ(W, Σ) qφ(P, L, Σ) = qφ(P|L, Σ) ·qφ(L, Σ) .(8)
Combining equation 7and 8leads to the following ELBO which has to be maximized (derived in
A.1), and the overall algorithm for Bayesian latent causal discovery is summarized in algorithm 1,
L(ψ, φ)= E
qφ(L,Σ)"E
qφ(P|L,Σ)E
qφ(Z|P,L,Σ)
[ log pψ(D | Z) ]log qφ(P|L, Σ)
p(P)log qφ(L, Σ)
p(L)p(Σ)#.(9)
Distribution over (L, Σ): The posterior distribution qφ(L, Σ) has (d(d1)
2+1) elements to be learnt
in the equal noise variance setting. This is parameterized as a diagonal covariance normal distribu-
tion. For the prior p(L)over the edge weights, we promote sparse DAGs by using a horseshoe prior
(Carvalho et al.,2009), similar to Cundy et al. (2021). A Gaussian prior is defined over log Σ.
Distribution over P: Since the values of Pare discrete, performing a discrete optimization is com-
binatorial and becomes quickly intractable with increasing d. This can be handled by relaxing the
discrete permutation learning problem to a continuous optimization problem. This is commonly
done by introducing a Gumbel-Sinkhorn (Mena et al.,2018) distribution and where one has to cal-
culate S((T+γ)), where Tis the parameter of the Gumbel-Sinkhorn, γis a matrix of standard
Gumbel noise, and τis a temperature parameter. The logits Tare predicted by passing the predicted
(L, Σ) through an MLP. In the limit of infinite iterations and as τ0, sampling from the distribu-
tion returns a doubly stochastic matrix. During the forward pass, a hard permutation Pis obtained
by using the Hungarian algorithm (Kuhn,1955) which allows τ0. During the backward pass,
a soft permutation is used to calculate gradients similar to (Cundy et al.,2021;Charpentier et al.,
2022). We use a uniform prior p(P)over permutations.
1A permutation matrix P∈ {0,1}d×dis a bistochastic matrix with Pipij = 1jand Pjpij = 1i.
5
摘要:

LEARNINGLATENTSTRUCTURALCAUSALMODELSJithendaraaSubramanianMila,McGillUniversityYashasAnnadaniKTHStockholmIvaxiShethMila,´ETSMontr´ealNanRosemaryKeMila,DeepmindTristanDeleuMila,Universit´edeMontr´ealStefanBauerKTHStockholmDerekNowrouzezahraiMila,McGillUniversitySamiraEbrahimiKahouMila,´ETSMontreal,C...

展开>> 收起<<
LEARNING LATENT STRUCTURAL CAUSAL MODELS Jithendaraa Subramanian Mila McGill UniversityYashas Annadani.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.2MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注