NEURAL CAUSAL MODELS FOR COUNTERFACTUAL IDENTIFICATION AND ESTIMATION Kevin Xia andYushu Pan andElias Bareinboim

2025-05-02 0 0 7.45MB 57 页 10玖币

侵权投诉

NEURAL CAUSAL MODELS FOR COUNTERFACTUAL

IDENTIFICATION AND ESTIMATION

Kevin Xia and Yushu Pan and Elias Bareinboim

Causal Artiﬁcial Intelligence Laboratory

Columbia University, USA

{kevinmxia,yushupan,eb}@cs.columbia.edu

ABSTRACT

Evaluating hypothetical statements about how the world would be had a different

course of action been taken is arguably one key capability expected from mod-

ern AI systems. Counterfactual reasoning underpins discussions in fairness, the

determination of blame and responsibility, credit assignment, and regret. In this

paper, we study the evaluation of counterfactual statements through neural models.

Speciﬁcally, we tackle two causal problems required to make such evaluations,

i.e., counterfactual identiﬁcation and estimation from an arbitrary combination of

observational and experimental data. First, we show that neural causal models

(NCMs) are expressive enough and encode the structural constraints necessary

for performing counterfactual reasoning. Second, we develop an algorithm for

simultaneously identifying and estimating counterfactual distributions. We show

that this algorithm is sound and complete for deciding counterfactual identiﬁcation

in general settings. Third, considering the practical implications of these results, we

introduce a new strategy for modeling NCMs using generative adversarial networks.

Simulations corroborate with the proposed methodology.

1 INTRODUCTION

Counterfactual reasoning is one of human’s high-level cognitive capabilities, used across a wide

range of affairs, including determining how objects interact, assigning responsibility, credit and

blame, and articulating explanations. Counterfactual statements underpin prototypical questions of

the form "what if–" and "why–", which inquire about hypothetical worlds that have not necessarily

been realized (Pearl & Mackenzie, 2018). If a patient Alice had taken a drug and died, one may

wonder, "why did Alice die?"; "was it the drug that killed her?"; "would she be alive had she not

taken the drug?". In the context of fairness, why did an applicant, Joe, not get the job offer? Would

the outcome have changed had Joe been a Ph.D.? Or perhaps of a different race? These are examples

of fundamental questions about attribution and explanation, which evoke hypothetical scenarios that

disagree with the current reality and which not even experimental studies can reconstruct.

We build on the semantics of counterfactuals based on a generative process called structural causal

model (SCM) (Pearl, 2000). A fully instantiated SCM

M∗

describes a collection of causal mecha-

nisms and distribution over exogenous conditions. Each

M∗

induces families of qualitatively different

distributions related to the activities of seeing (called observational), doing (interventional), and

imagining (counterfactual), which together are known as the ladder of causation (Pearl & Mackenzie,

2018; Bareinboim et al., 2022); also called the Pearl Causal Hierarchy (PCH). The PCH is a contain-

ment hierarchy in which distributions can be put in increasingly reﬁned layers: observational content

goes into layer 1 (

); experimental to layer 2 (

); counterfactual to layer 3 (

). It is understood

that there are questions about layers 2 and 3 that cannot be answered (i.e. are underdetermined), even

given all information in the world about layer 1; further, layer 3 questions are still underdetermined

given data from layers 1 and 2 (Bareinboim et al., 2022; Ibeling & Icard, 2020).

Counterfactuals represent the more detailed, ﬁnest type of knowledge encoded in the PCH, so

naturally, having the ability to evaluate counterfactual distributions is an attractive proposition. In

practice, a fully speciﬁed model

M∗

is almost never observable, which leads to the question – how

can a counterfactual statement, from

L∗

, be evaluated using a combination of observational and

arXiv:2210.00035v1 [cs.LG] 30 Sep 2022

experimental data (from

L∗

and

L∗

)? This question embodies the challenge of cross-layer inferences,

which entail solving two challenging causal problems in tandem, identiﬁcation and estimation.

(a)

Unobserved

Nature/Truth

(b)

Learned/

Hypothesized

PCH:

SCM M∗

=hF∗, P (U∗)i

L∗

1L∗

2L∗

NCM c

=hb

F, P (b

U)i

L1L2L3

Training (L1=L∗

L2=L∗

Causal

Diagram

G-Constraint

Figure 1: The l.h.s. contains the true SCM

M∗

that induces PCH’s three layers. The

r.h.s. contains a neural model

constrained

by inductive bias

(entailed by

M∗

) and

matching

M∗

and

through training.

In the more traditional literature of causal inference,

there are different symbolic methods for solving these

problems in various settings and under different as-

sumptions. In the context of identiﬁcation, there ex-

ists an arsenal of results that includes celebrated meth-

ods such as Pearl’s do-calculus (Pearl, 1995), and go

through different algorithmic methods when consid-

ering inferences for

- (Tian & Pearl, 2002; Shpitser

& Pearl, 2006; Huang & Valtorta, 2006; Bareinboim

& Pearl, 2012; Lee et al., 2019; Lee & Bareinboim,

2020; 2021) and

-distributions (Heckman, 1992;

Pearl, 2001; Avin et al., 2005; Shpitser & Pearl, 2009;

Shpitser & Sherman, 2018; Zhang & Bareinboim,

2018; Correa et al., 2021). On the estimation side,

there are various methods including the celebrated

Propensity Score/IPW for the backdoor case (Rubin,

1978; Horvitz & Thompson, 1952; Kennedy, 2019;

Kallus & Uehara, 2020), and some more relaxed settings (Fulcher et al., 2019; Jung et al., 2020;

2021), but the literature is somewhat scarcer and less developed. In fact, there is a lack of estimation

methods for L3quantities in most settings.

On another thread in the literature, deep learning methods have achieved outstanding empirical

success in solving a wide range of tasks in ﬁelds such as computer vision (Krizhevsky et al., 2012),

speech recognition (Graves & Jaitly, 2014), and game playing (Mnih et al., 2013). One key feature

of deep learning is its ability to allow inferences to scale with the data to high dimensional settings.

We study here the suitability of the neural approach to tackle the problems of causal identiﬁcation

and estimation while trying to leverage the beneﬁts of these new advances experienced in non-causal

settings.

The idea behind the approach pursued here is illustrated in Fig. 1. Speciﬁcally, we will

search for a neural model

(r.h.s.) that has the same generative capability of the true, unobserved

SCM

M∗

(l.h.s.); in other words,

should be able to generate the same observed/inputted data,

i.e.,

L1=L∗

and

L2=L∗

To tackle this task in practice, we use an inductive bias for the neural

model in the form of a causal diagram (Pearl, 2000; Spirtes et al., 2000; Bareinboim & Pearl, 2016),

which is a parsimonious description of the mechanisms (

F∗

) and exogenous conditions (

P(U∗)

) of

the generating SCM.

The question then becomes: under what conditions can a model trained using

this combination of qualitative inductive bias and the available data be suitable to answer questions

about hypothetical counterfactual worlds, as if we had access to the true M∗?

There exists a growing literature that leverages modern neural methods to solve causal inference

tasks.

Our approach based on proxy causal models will answer causal queries by direct evaluation

through a parameterized neural model

ﬁtted on the data generated by

M∗

For instance, some

recent work solves the estimation of interventional (

) or counterfactual (

) distributions from

observational (

) data in Markovian settings, implemented through architectures such as GANs,

ﬂows, GNNs, and VGAEs (Kocaoglu et al., 2018; Pawlowski et al., 2020; Zecevic et al., 2021;

Sanchez-Martin et al., 2021). In some real-world settings, Markovianity is a too stringent condition

(see discussion in App. D.4) and may be violated, which leads to the separation between layers 1 and

2, and, in turn, issues of causal identiﬁcation.

The proxy approach discussed above was pursued in

Xia et al. (2021) to solve the identiﬁcation and estimation of interventional distributions (

) from

One of our motivations is that these methods showed great promise at estimating effects from observational

data under backdoor/ignorability conditions (Shalit et al., 2017; Louizos et al., 2017; Li & Fu, 2017; Johansson

et al., 2016; Yao et al., 2018; Yoon et al., 2018; Kallus, 2020; Shi et al., 2019; Du et al., 2020; Guo et al., 2020).

This represents an extreme case where all

- and

-distributions are provided as data. In practice, this

may be unrealistic, and our method takes as input any arbitrary subset of distributions from L1and L2.

When imposed on neural models, they enforce equality constraints connecting layer 1 and layer 2 quantities,

deﬁned formally through the causal Bayesian network (CBN) data structure (Bareinboim et al., 2022, Def. 16).

4In general, c

Mdoes not need to, and will not be equal to the true SCM M∗.

5Layer 3 differs from lower layers even in Markovian models; see Bareinboim et al. (2022, Ex. 7).

observational data (

) in non-Markovian settings.

This work introduced an object we leverage

throughout this paper called Neural Causal Model (NCM, for short), which is a class of SCMs

constrained to neural network functions and ﬁxed distributions over the exogenous variables. While

NCMs have been shown to be able to solve the identiﬁcation and estimation tasks for

queries,

their potential for counterfactual inferences is still largely unexplored, and existing implementations

have been constrained to low-dimensional settings.

Despite all the progress achieved so far, no practical methods exist for estimating counterfactual

(

) distributions in the general setting where an arbitrary combination of observational (

) and

experimental (

) distributions is available, and unobserved confounders exist (i.e. Markovianity

does not hold). Hence, in addition to providing the ﬁrst neural method of counterfactual identiﬁcation,

this paper establishes the ﬁrst general counterfactual estimation technique even among non-neural

methods, leveraging the neural toolkit for scalable inferences. Speciﬁcally, our contributions are:

1. We prove that when ﬁtted with a graphical inductive bias, the NCM encodes the

-constraints

necessary for performing counterfactual inference (Thm. 1), and that they are still expressive enough

to model the underlying data-generating model, which is not necessarily a neural network (Thm. 2).

2. We show that counterfactual identiﬁcation within a neural proxy model setting is equivalent to

established symbolic approaches (Thm. 3). We leverage this duality to develop an optimization

procedure (Alg. 1) for counterfactual identiﬁcation and estimation that is both sound and complete

(Corol. 2). The approach is general in that it accepts any combination of inputs from

and

, it

works in any causal diagram setting, and it does not require the Markovianity assumption to hold.

3. We develop a new approach to modeling the NCM using generative adversarial networks (GANs)

(Goodfellow et al., 2014), capable of robustly scaling inferences to high dimensions (Alg. 3). We

then show how GAN-NCMs can solve the challenging optimization problems in identifying and

estimating counterfactuals in practice. Experiments are provided in Sec. 5 and proofs in Appendix A.

Preliminaries.

We now introduce the notation and deﬁnitions used throughout the paper. We use

uppercase letters (

) to denote random variables and lowercase letters (

) to denote corresponding

values. Similarly, bold uppercase (

) and lower case (

) letters are used to denote sets of random

variables and values respectively. We use

to denote the domain of

and

DX=DX1×· · ·×DXk

for the domain of

X={X1, . . . , Xk}

. We denote

P(X=x)

(which we will often shorten to

P(x)

)

as the probability of Xtaking the values xunder the probability distribution P(X).

We utilize the basic semantic framework of structural causal models (SCMs), as deﬁned in (Pearl,

2000, Ch. 7). An SCM

consists of endogenous variables

, exogenous variables

with

distribution

P(U)

, and mechanisms

contains a function

fVi

for each variable

that maps

endogenous parents

PaVi

and exogenous parents

UVi

. Each

induces a causal diagram

where every

Vi∈V

is a vertex, there is a directed arrow

(Vj→Vi)

for every

Vi∈V

and

Vj∈PaVi

and there is a dashed-bidirected arrow

(VjL9999K Vi)

for every pair

Vi, Vj∈V

such that

UVi

and

UVj

are not independent. For further details, see (Bareinboim et al., 2022, Def. 13/16, Thm. 4). The

exogenous

UVi

’s are not assumed independent (i.e. Markovianity is not required). Our treatment is

constrained to recursive SCMs, which implies acyclic causal diagrams, with ﬁnite domains over V.

Each SCM Massigns values to each counterfactual distribution as follows:

Deﬁnition 1

(Layer 3 Valuation)

An SCM

induces layer

L3(M)

, a set of distributions over

each with the form P(Y∗) = P(Y1[x1],Y2[x2],...)such that

PM(y1[x1],y2[x2], . . . ) = ZDUY1[x1](u) = y1,Y2[x2](u) = y2, . . . dP (u),(1)

where Yi[xi](u)is evaluated under Fxi:={fVj:Vj∈V\Xi}∪{fX←x:X∈Xi}.

Each

corresponds to a set of variables in a world where the original mechanisms

are replaced

with constants

for each

X∈Xi

; this is also known as the mutilation procedure. This procedure

corresponds to interventions, and we use subscripts to denote the intervening variables (e.g.

) or

subscripts with brackets when the variables are indexed (e.g.

Y1[x1]

). For instance,

P(yx, y0

x0)

is the

probability of the joint counterfactual event Y=yhad Xbeen xand Y=y0had Xbeen x0.

6Witty et al. (2021) shows a related approach taking the Bayesian route; further details, see Appendix C.

SCM

is said to be

P(Li)

-consistent (for short,

-consistent) with SCM

Li(M1) =

Li(M2)

. We will use

to denote a set of quantities from Layer 2 (i.e.

Z={P(Vzk)}`

k=1

), and we

use Z(M)to denote those same quantities induced by SCM M(i.e. Z(M) = {PM(Vzk)}`

k=1).

We use neural causal models (NCMs) as a substitute (proxy) model for the true SCM, as follows:

Deﬁnition 2

(

-Constrained Neural Causal Model (

-NCM) (Xia et al., 2021, Def. 7))

Given a

causal diagram

, a

-constrained Neural Causal Model (for short,

-NCM)

M(θ)

over variables

with parameters

θ={θVi:Vi∈V}

is an SCM

U,V,b

F,b

P(b

U)i

such that

U={b

UC:C∈C(G)}

where

C(G)

is the set of all maximal cliques over bidirected edges of

, and

U= [0,1]

for all

U∈b

;

F={ˆ

fVi:Vi∈V}

, where each

fVi

is a feedforward neural network parameterized by

θVi∈θ

mapping values of

UVi∪PaVi

to values of

for

UVi={b

UC:b

UC∈b

Us.t. Vi∈C}

and

PaVi=P aG(Vi);b

P(b

U)is deﬁned s.t. b

U∼Unif(0,1) for each b

U∈b

U.

2 NEURAL CAUSAL MODELS FOR COUNTERFACTUAL INFERENCE

We ﬁrst recall that inferences about higher layers of the PCH generated by the true SCM

M∗

cannot

be made in general through an NCM

trained only from lower layer data (Bareinboim et al., 2022;

Xia et al., 2021). This impossibility motivated the use of the inductive bias in the form of a causal

diagram

in the construction of the NCM in Def. 2, which ascertains that the

-consistency property

holds. (See App. D.1 for further discussion.) We next deﬁne consistency w.r.t. to each layer, which

will be key for a more ﬁne-grained discussion later on.

Deﬁnition 3

(

G(Li)

-Consistency)

Let

be the causal diagram induced by the SCM

M∗

. For

any SCM

is said to be

G(Li)

-consistent (w.r.t.

M∗

) if

Li(M)

satisﬁes all layer

equality

constraints implied by G.

This generalization is subtle since regardless of which

is used with the deﬁnition, the causal

diagram

generated by

M∗

is the same. The difference lies in the implied constraints. For instance,

if an SCM

G(L1)

-consistent, that means that

is a Bayesian network for the observational

distribution of

, implying independences readable through d-separation Pearl (1988). If

G(L2)

-consistent, that means that

is a Causal Bayesian network (CBN) (Bareinboim et al., 2022,

Def. 16) for the interventional distributions of

. While several SCMs could share the same

d-separation constraints as

M∗

, there are fewer that share all

constraints encoded by the CBN.

consistency at higher layers imposes a stricter set of constraints, narrowing down the set of compatible

SCMs. There also exist constraints of layer 3 that are important for counterfactual inferences.

To motivate the use of such constraints, consider an example inspired by the multi-armed bandit

problem. A casino has 3 slot machines, labeled “0", “1", and “2". Every day, the casino assigns

one machine a good payout, one a bad payout, and one an average payout, with chances of winning

represented by exogenous variables

U+

U−

, and

, respectively. A customer comes every day and

plays a slot machine.

represents their choice of machine, and

is a binary variable representing

whether they win. Suppose a data scientist creates a model of the situation, and she hypothesizes

that the casino predicts the customer’s choice based on their mood (

) and will always assign the

predicted machine the average payout to maintain proﬁts. Her model is described by the SCM M0:

M0=











U={UM, U+, U=, U−}, UM∈ {0,1,2}, U+, U=, U−∈ {0,1}

V={X, Y }, X ∈ {0,1,2}, Y ∈ {0,1}

F=









fX(uM) = uM

fY(x, uM, u+, u=, u−) = 





u=x=uM

u−x= (uM−1)%3

u+x= (uM+ 1)%3

P(U) : P(UM=i) = 1

3, P (U+=1) = 0.6, P (U== 1) = 0.4, P (U−=1) = 0.2

(2)

It turns out that in this model

P(yx) = P(y|x)

. For example,

P(Y= 1 |X= 0) = P(U==

1) = 0.4

, and

P(YX=0 = 1) = P(UM= 0)P(U== 1) + P(UM= 1)P(U−= 1) + P(UM=

2)P(U+= 1) = 1

3(0.4) + 1

3(0.2) + 1

3(0.6) = 0.4.

Suppose the true model

M∗

employed by the casino (and unknown by the customers and data

scientist) induces graph

G={X→Y}

. Interestingly enough,

would be

G(L2)

-consistent with

M∗

since

is compatible with all

-constraints, including

P(yx) = P(y|x)

and

P(xy) = P(x)

However, and perhaps surprisingly, it would fail to be

G(L3)

-consistent. A further constraint implied

on the third layer is that

P(yx|x0) = P(yx)

, which is not true of

. To witness, note that

P(YX=0 = 1 |X= 2) = P(U+= 1) = 0.6

, which means that if the customer chose

machine 2, they would have had higher payout had they chosen machine 0. This does not match

P(YX=0 = 1) = 0.4, computed earlier, so M0fails to encode the L3-constraints implied by G.

Figure 2: Model-

theoretic visualization

of Thms. 1 and 2.

In general, the causal diagram encodes a family of

-constraints which we

leverage to make cross-layer inferences. A more detailed discussion can be

found in Appendix D. We show next that NCMs encodes all of the equality

constraints related to L3, in addition to the known L2-constraints.

Theorem 1

(NCM

G(L3)

-Consistency)

Any

-NCM

M(θ)

G(L3)

consistent. 

This will be a key result for performing inferences at the counterfactual

level. Similar to how constraints about layer 2 distributions help bridge

the gap between layers 1 and 2, layer 3 constraints allow us to extend our

inference capabilities into layer 3. (In fact, most of

’s distributions are not

obtainable through experimentation.) While this graphical inductive bias is

powerful, the set of NCMs constrained by

is no less expressive than the

set of SCMs constrained by G, as shown next.

Theorem 2

(

Expressiveness)

For any SCM

M∗

that induces causal

diagram

, there exists a

-NCM

M(θ) = hb

U,V,b

F,b

P(b

U)i

s.t.

L3-consistent w.r.t. M∗.

This result ascertains that the NCM class is as expressive, and therefore, contains the same generative

capabilities as the original generating model. More interestingly, even if the original SCM

M∗

does

not belong to the NCM class, but from the higher space, there exists a NCM

M(θ)

that will be

capable of expressing the collection of distributions from all layers of the PCH induced by it.

A visual representation of these two results is shown in Fig. 2. The space of all SCMs is called

Ω∗

and the subspace that contains all SCMs

G((Li)

-consistent w.r.t. the true SCM

M∗

(black dot) is

called

Ω∗(G(Li))

. Note that the

G(Li)

space shrinks with higher layers, indicating a more constrained

space with fewer SCMs. Thm. 1 states that all

-NCMs (

Ω(G)

) are within

Ω∗(G(L3))

, and Thm. 2

states that all SCMs in

Ω∗(G(L3))

can be represented by a corresponding

-NCM on all three layers.

It may seem intuitive that the

-NCM has these two properties by construction, but these properties

are nontrivial and, in fact, not enjoyed by many model classes. Examples can be found in Appendix

D. Together, these two theorems ensure that the NCM has both the constraints and the expressiveness

necessary for counterfactual inference, elaborated further in the next section.

3 NEURAL COUNTERFACTUAL IDENTIFICATION

The problem of identiﬁcation is concerned with determining whether a certain quantity is computable

from a combination of assumptions, usually encoded in the form of a causal diagram, and a collection

of distributions (Pearl, 2000, p. 77). This challenge stems from the fact that even though the space of

SCMs (or NCMs) is constrained upon assuming a certain causal diagram, the quantity of interest

may still be underdetermined. In words, there are many SCMs compatible with the same diagram

but generate different answers for the target distribution. In this section, we investigate the problem

of identiﬁcation and decide whether counterfactual quantities (from

) can be inferred from a

combination of a subset of L2and L1datasets together with G, as formally deﬁned next.

Deﬁnition 4

(Neural Counterfactual Identiﬁcation)

Consider an SCM

M∗

and the corresponding

causal diagram

. Let

Z={P(Vzk)}`

k=1

be a collection of available interventional (or observational

Zk=∅

) distributions from

M∗

. The counterfactual query

P(Y∗=y∗|X∗=x∗)

is said to be

neural identiﬁable (identiﬁable, for short) from the set of

-constrained NCMs

Ω(G)

and

if and

only if

M1(y∗|x∗) = Pc

M2(y∗|x∗)

for every pair of models

M1,c

M2∈Ω(G)

s.t. they match

M∗on all distributions in Z(i.e. Z(M∗) = Z(M1) = Z(M2)>0). 

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NEURALCAUSALMODELSFORCOUNTERFACTUALIDENTIFICATIONANDESTIMATIONKevinXiaandYushuPanandEliasBareinboimCausalArticialIntelligenceLaboratoryColumbiaUniversity,USA{kevinmxia,yushupan,eb}@cs.columbia.eduABSTRACTEvaluatinghypotheticalstatementsabouthowtheworldwouldbehadadifferentcourseofactionbeentakenisar...

展开>> 收起<<

NEURAL CAUSAL MODELS FOR COUNTERFACTUAL IDENTIFICATION AND ESTIMATION Kevin Xia andYushu Pan andElias Bareinboim.pdf

共57页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

NEURAL CAUSAL MODELS FOR COUNTERFACTUAL IDENTIFICATION AND ESTIMATION Kevin Xia andYushu Pan andElias Bareinboim

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: