LEARNING LATENT STRUCTURAL CAUSAL MODELS Jithendaraa Subramanian Mila McGill UniversityYashas Annadani

2025-05-02 0 0 1.2MB 21 页 10玖币

侵权投诉

LEARNING LATENT STRUCTURAL CAUSAL MODELS

Jithendaraa Subramanian∗

Mila, McGill University

Yashas Annadani

KTH Stockholm

Ivaxi Sheth

Mila, ´

ETS Montr´

eal

Nan Rosemary Ke

Mila, Deepmind

Tristan Deleu

Mila, Universit´

e de Montr´

eal

Stefan Bauer

KTH Stockholm

Derek Nowrouzezahrai

Mila, McGill University

Samira Ebrahimi Kahou

Mila, ´

ETS Montreal, CIFAR AI Chair

ABSTRACT

Causal learning has long concerned itself with the accurate recovery of underlying

causal mechanisms. Such causal modelling enables better explanations of out-of-

distribution data. Prior works on causal learning assume that the high-level causal

variables are given. However, in machine learning tasks, one often operates on

low-level data like image pixels or high-dimensional vectors. In such settings,

the entire Structural Causal Model (SCM) – structure, parameters, and high-level

causal variables – is unobserved and needs to be learnt from low-level data. We

treat this problem as Bayesian inference of the latent SCM, given low-level data.

For linear Gaussian additive noise SCMs, we present a tractable approximate in-

ference method which performs joint inference over the causal variables, structure

and parameters of the latent SCM from random, known interventions. Experi-

ments are performed on synthetic datasets and a causally generated image dataset

to demonstrate the efﬁcacy of our approach. We also perform image generation

from unseen interventions, thereby verifying out of distribution generalization for

the proposed causal model.

1 INTRODUCTION

Learning variables of interest and uncovering causal dependencies is crucial for intelligent systems

to reason and predict in scenarios that differ from the training distribution. In the causality literature,

causal variables and mechanisms are often assumed to be known. This knowledge enables reasoning

and prediction under unseen interventions. In machine learning, however, one does not have direct

access to the underlying variables of interest nor the causal structure and mechanisms corresponding

to them. Rather, these have to be learned from observed low-level data like pixels of an image which

are usually high-dimensional. Having a learned causal model can then be useful for generalizing to

out-of-distribution data (Scherrer et al.,2022;Ke et al.,2021), estimating the effect of interventions

(Pearl,2009;Sch¨

olkopf et al.,2021), disentangling underlying factors of variation (Bengio et al.,

2012;Wang and Jordan,2021), and transfer learning (Schoelkopf et al.,2012;Bengio et al.,2019).

Structure learning (Spirtes et al.,2000;Zheng et al.,2018) learns the structure and parameters of

the Structural Causal Model (SCM) (Pearl,2009) that best explains some observed high-level causal

variables. In causal machine learning and representation learning, however, these causal variables

may no longer be observable. This serves as the motivation for our work. We address the problem of

learning the entire SCM – consisting its causal variables, structure and parameters – which is latent,

by learning to generate observed low-level data. Since one often operates in low-data regimes or

non-identiﬁable settings, we adopt a Bayesian formulation so as to quantify epistemic uncertainty

over the learned latent SCM. Given a dataset, we use variational inference to learn a joint posterior

over the causal variables, structure and parameters of the latent SCM. To the best of our knowledge,

ours is the ﬁrst work to address the problem of causal discovery in linear Gaussian latent SCMs from

low-level data, where causal variables are unobserved. Our contributions are as follows:

∗Correspondence to jithen.subra@gmail.com

arXiv:2210.13583v1 [cs.LG] 24 Oct 2022

Figure 1: Model architecture of the proposed generative model for the Bayesian latent causal dis-

covery task to learn latent SCM from low-level data.

• We propose a general algorithm for Bayesian causal discovery in the latent space of a

generative model, learning a distribution over causal variables, structure and parameters

in linear Gaussian latent SCMs with random, known interventions. Figure 1illustrates an

overview of the proposed method.

• By learning the structure and parameters of a latent SCM, we implicitly induce a joint dis-

tribution over the causal variables. Hence, sampling from this distribution is equivalent to

ancestral sampling through the latent SCM. As such, we address a challenging, simultane-

ous optimization problem that is often encountered during causal discovery in latent space:

one cannot ﬁnd the right graph without the right causal variables, and vice versa.

• On a synthetically generated dataset and an image dataset used to benchmark causal model

performance (Ke et al.,2021), we evaluate our method along three axes – uncovering causal

variables, structure, and parameters – consistently outperforming baselines. We demon-

strate its ability to perform image generation from unseen interventional distributions.

2 PRELIMINARIES

2.1 STRUCTURAL CAUSAL MODELS

GΘ

Figure 2: BN for prior

works in causal discovery

and structure learning

A Structural Causal Model (SCM) is deﬁned by a set of equations

which represent the mechanisms by which each endogenous variable

zidepends on its direct causes zG

pa(i)and a corresponding exogenous

noise variable i. The direct causes are subsets of other endogenous

variables. If the causal parent assignment is assumed to be acyclic,

then an SCM is associated with a Directed Acyclic Graph (DAG)

G= (V, E), where V corresponds to the endogenous variables and

Eencodes direct cause-effect relationships. The exact value taken on

by a causal variable zi, is given by local causal mechanisms ficon-

ditional on zG

pa(i), the parameters Θi, and the node’s noise variable

i, as given in equation 1. For linear Gaussian additive noise SCMs

with equal noise variance, i.e., the setting that we focus on in this work, all fi’s are linear functions,

and Θdenotes the weighted adjacency matrix W, where each Wji is the edge weight from j→i.

The linear Gaussian additive noise SCM thus reduces to equation 2,

zi=fi(zG

pa(i),Θ, i),(1) zi=X

j∈paG(i)

Wji ·zj+i.(2)

2.2 CAUSAL DISCOVERY

Structure learning in prior work refers to learning a DAG according to some optimization criterion

with or without the notion of causality (e.g., He et al. (2019)). The task of causal discovery on

the other hand, is more speciﬁc in that it refers to learning the structure (also parameters, in some

cases) of SCMs, and subscribes to causality and interventions like that of Pearl (2009). That is, the

methods aim to estimate (G,Θ). These approaches often resort to modular likelihood scores over

causal variables – like the BGe score (Geiger and Heckerman,1994;Kuipers et al.,2022) and BDe

score (Heckerman et al.,1995) – to learn the right structure. However, these methods all assume a

dataset of observed causal variables. These approaches either obtain a maximum likelihood estimate,

G∗= arg max

p(Z| G)or (G∗,Θ∗) = arg max

G,Θ

p(Z| G,Θ) ,(3)

or in the case of Bayesian causal discovery (Heckerman et al.,1997), variational inference is typi-

cally used to approximate a joint posterior distribution qφ(G,Θ) to the true posterior p(G,Θ|Z)by

minimizing the KL divergence between the two,

DKL(qφ(G,Θ) || p(G,Θ|Z)) = −E(G,Θ)∼qφlog p(Z| G,Θ) −log qφ(G,Θ)

p(G,Θ) ,(4)

where p(G,Θ) is a prior over the structure and parameters of the SCM – possibly encoding DAGness,

sparse connections, or low-magnitude edge weights. Figure 2shows the Bayesian Network (BN)

over which inference is performed for causal discovery tasks.

2.3 LATENT CAUSAL DISCOVERY

GΘ

Figure 3: BN for the latent causal

discovery task that generalizes stan-

dard causal discovery setups

In more realistic scenarios, the learner does not directly ob-

serve causal variables and they must be learned from low-

level data. The causal variables, structure, and parameters are

part of a latent SCM. The goal of causal representation learn-

ing models is to perform inference of, and generation from,

the true latent SCM. Yang et al. (2021) proposes a Causal VAE

but is in a supervised setup where one has labels on causal

variables and the focus is on disentanglement. Kocaoglu et al.

(2017) present causal generative models trained in an adver-

sarial manner but assumes observations of causal variables.

Given the right causal structure as a prior, the work focuses on

generation from conditional and interventional distributions.

In both the causal representation learning and causal genera-

tive model scenarios mentioned above, the Ground Truth (GT)

causal graph and parameters of the latent SCM are arbitrarily deﬁned on real datasets and the setting

is supervised. Contrary to this, our setting is unsupervised and we are interested in recovering the

GT underlying SCM and causal variables that generate the low-level observed data – we deﬁne this

as the problem of latent causal discovery, and the BN over which we want to perform inference

on is given in ﬁgure 3. In the upcoming sections, we discuss related work, formulate our prob-

lem setup and propose an algorithm for Bayesian latent causal discovery, evaluate with experiments

on causally created vector data and image data, and perform sampling from unseen interventional

image distributions to showcase generalization of learned latent SCMs.

3 RELATED WORK

Prior work can be classiﬁed into Bayesian (Koivisto and Sood,2004;Heckerman et al.,2006;Fried-

man and Koller,2013) or maximum likelihood (Brouillard et al.,2020;Wei et al.,2020;Ng et al.,

2022) methods, that learn the structure and parameters of SCMs using either score-based (Kass and

Raftery,1995;Barron et al.,1998;Heckerman et al.,1995) or constraint-based (Cheng et al.,2002;

Lehmann and Romano,2005) approaches.

Causal discovery and structure learning: Work in this category assume causal variables are ob-

served and do not operate on low-level data (Spirtes et al.,2000;Viinikka et al.,2020;Yu et al.,

2021;Zhang et al.,2022). Peters and B¨

uhlmann (2014) prove identiﬁability of linear Gaussian

SCMs with equal noise variances. Bengio et al. (2019) use the speed of adaptation as a signal to

learn the causal direction. Ke et al. (2019) explore learning causal models from unknown, while

Scherrer et al. (2021); Tigas et al. (2022); Agrawal et al. (2019); Toth et al. (2022) focus on active

learning and experimental design setups on how to perform interventions to efﬁciently learn causal

models. Transformer (Vaswani et al.,2017) based approach learns structure from synthetic datasets

and generalize to naturalistic graphs (Ke et al.,2022). Zheng et al. (2018) introduce an acyclicity

constraint that penalizes cyclic graphs, thereby restricting search close to the DAG space. Lachapelle

et al. (2019) leverages this constraint to learn DAGs in nonlinear SCMs. Pamﬁl et al. (2020); Lippe

et al. (2022) perform structure learning on temporal data.

Latent variable models with predeﬁned structure: Examples include VAE (Kingma and Welling,

2013;Rezende et al.,2014) which has an independence assumption between latent variables. To

overcome this, Sønderby et al. (2016) and Zhao et al. (2017) deﬁne latent variables with a chain

structure in VAEs. Kingma et al. (2016) uses inverse autoregressive ﬂows to improve upon the

diagonal covariance of latent variables in VAEs.

Latent variable models with learned structure: GraphVAE (He et al.,2019) learns the edges

between latent variables without incorporating notions of causality. Brehmer et al. (2022) present

identiﬁability theory for learning causal representations and propose a practical algorithm for this

under the assumption of having access to pairs of observational and interventional data.

Supervised causal representation learning:Kocaoglu et al. (2017); Shen et al. (2020); Moraffah

et al. (2020) introduce generative models that use an SCM-based prior in latent space. In Shen

et al. (2020), the goal is to learn causally disentangled variables. Yang et al. (2021) learn a DAG on

CelebA and a causally generated pendulum image dataset but assume complete access to the causal

variables. Lopez-Paz et al. (2016) establishes observable causal footprints in images.

4 LEARNING LATENT SCMS FROM LOW-LEVEL DATA

4.1 PROBLEM SCENARIO

We are given a dataset D={x(1), ..., x(N)}, where each x(i)is a high-dimensional observed data

– for simplicity, we assume x(i)is a vector in RDbut the setup extends to other inputs as well (like

images, as we will see in the next section). We assume that there exist latent variables Z={z(i)∈

Rd}N

i=1 with d≤D, that explain the data, and these latent variables have an SCM with structure

GGT and parameters ΘGT associated with them. We wish to invert the data generation process

g: (Z,G,Θ) → D where the causal variables Zare in the latent space. In the setting, we also have

access to the intervention targets I={I(i)}N

i=1 where each I(i)∈ {0,1}d. The jth dimension of

I(i)takes a value of 1 if node jwas intervened on in data sample i, and 0 otherwise.

4.2 GENERAL METHOD

We aim to obtain a posterior estimate over the entire latent SCM, p(Z,G,Θ| D). Computing

the true posterior analytically requires calculating the marginal likelihood p(D)which gets quickly

intractable due to the number of possible DAGs growing super-exponentially with respect to the

number of nodes. Thus, we resort to variational inference (Blei et al.,2017) that provides a tractable

way to learn an approximate posterior qφ(Z,G,Θ) with variational parameters φ, close to the true

posterior p(Z,G,Θ| D)by maximizing the Evidence Lower Bound (ELBO),

L(ψ, φ) = E

qφ(Z,G,Θ)log pψ(D | Z,G,Θ) −log qφ(Z,G,Θ)

p(Z,G,Θ) ,(5)

where p(Z,G,Θ) is the prior, pψ(D | Z,G,Θ) is the likelihood model with parameters ψ, the

likelihood model maps the latent variables Zto high-dimensional vectors X. An approach to learn

this posterior could be to factorize it as

qφ(Z,G,Θ) = qφ(Z)·qφ(G,Θ|Z).(6)

Given a way to obtain qφ(Z), the conditional qφ(G,Θ|Z)can be obtained using existing Bayesian

structure learning methods. Otherwise, one has to perform a hard simultaneous optimization which

would require alternating optimizations on Zand on (G,Θ) given an estimate of Z. Difﬁculty of

such an alternate optimization is discussed in Brehmer et al. (2022).

Alternate factorization of the posterior: Rather than factorizing as in equation 6, we propose to

factorize according to the BN in ﬁgure 3. This is given by qφ(Z,G,Θ) = qφ(Z| G,Θ) ·qφ(G,Θ).

The advantage of this factorization is that the distribution over Zis completely determined from

the SCM given (G,Θ) and exogenous noise variables (assumed to be Gaussian). Thus, the prior

p(Z| G,Θ) and the posterior p(Z| G,Θ,D) = qφ(Z| G,Θ) are identical. This conveniently

avoids the hard simultaneous optimization problem mentioned above since optimizing for qφ(Z)is

not necessary. Equation 5can then be simpliﬁed as

L(ψ, φ) = E

qφ(Z,G,Θ)log pψ(D | Z)−log qφ(G,Θ)

p(G,Θ) −

: 0

log qφ(Z| G,Θ)

p(Z| G,Θ) .(7)

Such a posterior can be used to obtain an SCM by sampling Gand Θfrom the approximated pos-

terior. As long as the samples Gare always acyclic, one can perform ancestral sampling through

the SCM to obtain samples corresponding to the causal variables ˆ

z(i). For additive noise models

like in equation 2, these samples are already reparameterized and differentiable with respect to their

parameters. The samples of causal variables are then fed to the likelihood model to predict ˆ

x(i)to

reconstruct the observed data x(i).

4.3 POSTERIOR PARAMETERIZATIONS AND PRIORS

For linear Gaussian latent SCMs, which is the focus of this work, learning a posterior over (G,Θ) is

equivalent to learning qφ(W, Σ) – a posterior over weighted adjacency matrices Wand noise covari-

ances Σ. We follow an approach similar to (Cundy et al.,2021). We express Wvia a permutation

matrix P1and a lower triangular edge weight matrix L, according to W=PTLTP. Here, Lis

deﬁned in the space of all weighted adjacency matrices with a ﬁxed node ordering where node j

can be a parent of node ionly if j > i. Search over permutations corresponds to search over differ-

ent node orderings and thus, Wand Σparameterize the space of SCMs. Further, we factorize the

approximate posterior qφ(P, L, Σ) as

qφ(G, Θ) ≡qφ(W, Σ) ≡qφ(P, L, Σ) = qφ(P|L, Σ) ·qφ(L, Σ) .(8)

Combining equation 7and 8leads to the following ELBO which has to be maximized (derived in

A.1), and the overall algorithm for Bayesian latent causal discovery is summarized in algorithm 1,

L(ψ, φ)= E

qφ(L,Σ)"E

qφ(P|L,Σ)E

qφ(Z|P,L,Σ)

[ log pψ(D | Z) ]−log qφ(P|L, Σ)

p(P)−log qφ(L, Σ)

p(L)p(Σ)#.(9)

Distribution over (L, Σ): The posterior distribution qφ(L, Σ) has (d(d−1)

2+1) elements to be learnt

in the equal noise variance setting. This is parameterized as a diagonal covariance normal distribu-

tion. For the prior p(L)over the edge weights, we promote sparse DAGs by using a horseshoe prior

(Carvalho et al.,2009), similar to Cundy et al. (2021). A Gaussian prior is deﬁned over log Σ.

Distribution over P: Since the values of Pare discrete, performing a discrete optimization is com-

binatorial and becomes quickly intractable with increasing d. This can be handled by relaxing the

discrete permutation learning problem to a continuous optimization problem. This is commonly

done by introducing a Gumbel-Sinkhorn (Mena et al.,2018) distribution and where one has to cal-

culate S((T+γ)/τ), where Tis the parameter of the Gumbel-Sinkhorn, γis a matrix of standard

Gumbel noise, and τis a temperature parameter. The logits Tare predicted by passing the predicted

(L, Σ) through an MLP. In the limit of inﬁnite iterations and as τ→0, sampling from the distribu-

tion returns a doubly stochastic matrix. During the forward pass, a hard permutation Pis obtained

by using the Hungarian algorithm (Kuhn,1955) which allows τ→0. During the backward pass,

a soft permutation is used to calculate gradients similar to (Cundy et al.,2021;Charpentier et al.,

2022). We use a uniform prior p(P)over permutations.

1A permutation matrix P∈ {0,1}d×dis a bistochastic matrix with Pipij = 1∀jand Pjpij = 1∀i.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LEARNINGLATENTSTRUCTURALCAUSALMODELSJithendaraaSubramanianMila,McGillUniversityYashasAnnadaniKTHStockholmIvaxiShethMila,´ETSMontr´ealNanRosemaryKeMila,DeepmindTristanDeleuMila,Universit´edeMontr´ealStefanBauerKTHStockholmDerekNowrouzezahraiMila,McGillUniversitySamiraEbrahimiKahouMila,´ETSMontreal,C...

展开>> 收起<<

LEARNING LATENT STRUCTURAL CAUSAL MODELS Jithendaraa Subramanian Mila McGill UniversityYashas Annadani.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LEARNING LATENT STRUCTURAL CAUSAL MODELS Jithendaraa Subramanian Mila McGill UniversityYashas Annadani

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: