NEURAL CAUSAL MODELS FOR COUNTERFACTUAL IDENTIFICATION AND ESTIMATION Kevin Xia andYushu Pan andElias Bareinboim

2025-05-02 0 0 7.45MB 57 页 10玖币
侵权投诉
NEURAL CAUSAL MODELS FOR COUNTERFACTUAL
IDENTIFICATION AND ESTIMATION
Kevin Xia and Yushu Pan and Elias Bareinboim
Causal Artificial Intelligence Laboratory
Columbia University, USA
{kevinmxia,yushupan,eb}@cs.columbia.edu
ABSTRACT
Evaluating hypothetical statements about how the world would be had a different
course of action been taken is arguably one key capability expected from mod-
ern AI systems. Counterfactual reasoning underpins discussions in fairness, the
determination of blame and responsibility, credit assignment, and regret. In this
paper, we study the evaluation of counterfactual statements through neural models.
Specifically, we tackle two causal problems required to make such evaluations,
i.e., counterfactual identification and estimation from an arbitrary combination of
observational and experimental data. First, we show that neural causal models
(NCMs) are expressive enough and encode the structural constraints necessary
for performing counterfactual reasoning. Second, we develop an algorithm for
simultaneously identifying and estimating counterfactual distributions. We show
that this algorithm is sound and complete for deciding counterfactual identification
in general settings. Third, considering the practical implications of these results, we
introduce a new strategy for modeling NCMs using generative adversarial networks.
Simulations corroborate with the proposed methodology.
1 INTRODUCTION
Counterfactual reasoning is one of human’s high-level cognitive capabilities, used across a wide
range of affairs, including determining how objects interact, assigning responsibility, credit and
blame, and articulating explanations. Counterfactual statements underpin prototypical questions of
the form "what if–" and "why–", which inquire about hypothetical worlds that have not necessarily
been realized (Pearl & Mackenzie, 2018). If a patient Alice had taken a drug and died, one may
wonder, "why did Alice die?"; "was it the drug that killed her?"; "would she be alive had she not
taken the drug?". In the context of fairness, why did an applicant, Joe, not get the job offer? Would
the outcome have changed had Joe been a Ph.D.? Or perhaps of a different race? These are examples
of fundamental questions about attribution and explanation, which evoke hypothetical scenarios that
disagree with the current reality and which not even experimental studies can reconstruct.
We build on the semantics of counterfactuals based on a generative process called structural causal
model (SCM) (Pearl, 2000). A fully instantiated SCM
M
describes a collection of causal mecha-
nisms and distribution over exogenous conditions. Each
M
induces families of qualitatively different
distributions related to the activities of seeing (called observational), doing (interventional), and
imagining (counterfactual), which together are known as the ladder of causation (Pearl & Mackenzie,
2018; Bareinboim et al., 2022); also called the Pearl Causal Hierarchy (PCH). The PCH is a contain-
ment hierarchy in which distributions can be put in increasingly refined layers: observational content
goes into layer 1 (
L1
); experimental to layer 2 (
L2
); counterfactual to layer 3 (
L3
). It is understood
that there are questions about layers 2 and 3 that cannot be answered (i.e. are underdetermined), even
given all information in the world about layer 1; further, layer 3 questions are still underdetermined
given data from layers 1 and 2 (Bareinboim et al., 2022; Ibeling & Icard, 2020).
Counterfactuals represent the more detailed, finest type of knowledge encoded in the PCH, so
naturally, having the ability to evaluate counterfactual distributions is an attractive proposition. In
practice, a fully specified model
M
is almost never observable, which leads to the question – how
can a counterfactual statement, from
L
3
, be evaluated using a combination of observational and
arXiv:2210.00035v1 [cs.LG] 30 Sep 2022
experimental data (from
L
1
and
L
2
)? This question embodies the challenge of cross-layer inferences,
which entail solving two challenging causal problems in tandem, identification and estimation.
(a)
Unobserved
Nature/Truth
(b)
Learned/
Hypothesized
PCH:
SCM M
=hF, P (U)i
L
1L
2L
3
NCM c
M
=hb
F, P (b
U)i
L1L2L3
Training (L1=L
1,
L2=L
2)
Causal
Diagram
G
G-Constraint
Figure 1: The l.h.s. contains the true SCM
M
that induces PCH’s three layers. The
r.h.s. contains a neural model
c
M
constrained
by inductive bias
G
(entailed by
M
) and
matching
M
on
L1
and
L2
through training.
In the more traditional literature of causal inference,
there are different symbolic methods for solving these
problems in various settings and under different as-
sumptions. In the context of identification, there ex-
ists an arsenal of results that includes celebrated meth-
ods such as Pearl’s do-calculus (Pearl, 1995), and go
through different algorithmic methods when consid-
ering inferences for
L2
- (Tian & Pearl, 2002; Shpitser
& Pearl, 2006; Huang & Valtorta, 2006; Bareinboim
& Pearl, 2012; Lee et al., 2019; Lee & Bareinboim,
2020; 2021) and
L3
-distributions (Heckman, 1992;
Pearl, 2001; Avin et al., 2005; Shpitser & Pearl, 2009;
Shpitser & Sherman, 2018; Zhang & Bareinboim,
2018; Correa et al., 2021). On the estimation side,
there are various methods including the celebrated
Propensity Score/IPW for the backdoor case (Rubin,
1978; Horvitz & Thompson, 1952; Kennedy, 2019;
Kallus & Uehara, 2020), and some more relaxed settings (Fulcher et al., 2019; Jung et al., 2020;
2021), but the literature is somewhat scarcer and less developed. In fact, there is a lack of estimation
methods for L3quantities in most settings.
On another thread in the literature, deep learning methods have achieved outstanding empirical
success in solving a wide range of tasks in fields such as computer vision (Krizhevsky et al., 2012),
speech recognition (Graves & Jaitly, 2014), and game playing (Mnih et al., 2013). One key feature
of deep learning is its ability to allow inferences to scale with the data to high dimensional settings.
We study here the suitability of the neural approach to tackle the problems of causal identification
and estimation while trying to leverage the benefits of these new advances experienced in non-causal
settings.
1
The idea behind the approach pursued here is illustrated in Fig. 1. Specifically, we will
search for a neural model
c
M
(r.h.s.) that has the same generative capability of the true, unobserved
SCM
M
(l.h.s.); in other words,
c
M
should be able to generate the same observed/inputted data,
i.e.,
L1=L
1
and
L2=L
2
.
2
To tackle this task in practice, we use an inductive bias for the neural
model in the form of a causal diagram (Pearl, 2000; Spirtes et al., 2000; Bareinboim & Pearl, 2016),
which is a parsimonious description of the mechanisms (
F
) and exogenous conditions (
P(U)
) of
the generating SCM.
3
The question then becomes: under what conditions can a model trained using
this combination of qualitative inductive bias and the available data be suitable to answer questions
about hypothetical counterfactual worlds, as if we had access to the true M?
There exists a growing literature that leverages modern neural methods to solve causal inference
tasks.
1
Our approach based on proxy causal models will answer causal queries by direct evaluation
through a parameterized neural model
c
M
fitted on the data generated by
M
.
4
For instance, some
recent work solves the estimation of interventional (
L2
) or counterfactual (
L3
) distributions from
observational (
L1
) data in Markovian settings, implemented through architectures such as GANs,
flows, GNNs, and VGAEs (Kocaoglu et al., 2018; Pawlowski et al., 2020; Zecevic et al., 2021;
Sanchez-Martin et al., 2021). In some real-world settings, Markovianity is a too stringent condition
(see discussion in App. D.4) and may be violated, which leads to the separation between layers 1 and
2, and, in turn, issues of causal identification.
5
The proxy approach discussed above was pursued in
Xia et al. (2021) to solve the identification and estimation of interventional distributions (
L2
) from
1
One of our motivations is that these methods showed great promise at estimating effects from observational
data under backdoor/ignorability conditions (Shalit et al., 2017; Louizos et al., 2017; Li & Fu, 2017; Johansson
et al., 2016; Yao et al., 2018; Yoon et al., 2018; Kallus, 2020; Shi et al., 2019; Du et al., 2020; Guo et al., 2020).
2
This represents an extreme case where all
L1
- and
L2
-distributions are provided as data. In practice, this
may be unrealistic, and our method takes as input any arbitrary subset of distributions from L1and L2.
3
When imposed on neural models, they enforce equality constraints connecting layer 1 and layer 2 quantities,
defined formally through the causal Bayesian network (CBN) data structure (Bareinboim et al., 2022, Def. 16).
4In general, c
Mdoes not need to, and will not be equal to the true SCM M.
5Layer 3 differs from lower layers even in Markovian models; see Bareinboim et al. (2022, Ex. 7).
2
observational data (
L1
) in non-Markovian settings.
6
This work introduced an object we leverage
throughout this paper called Neural Causal Model (NCM, for short), which is a class of SCMs
constrained to neural network functions and fixed distributions over the exogenous variables. While
NCMs have been shown to be able to solve the identification and estimation tasks for
L2
queries,
their potential for counterfactual inferences is still largely unexplored, and existing implementations
have been constrained to low-dimensional settings.
Despite all the progress achieved so far, no practical methods exist for estimating counterfactual
(
L3
) distributions in the general setting where an arbitrary combination of observational (
L1
) and
experimental (
L2
) distributions is available, and unobserved confounders exist (i.e. Markovianity
does not hold). Hence, in addition to providing the first neural method of counterfactual identification,
this paper establishes the first general counterfactual estimation technique even among non-neural
methods, leveraging the neural toolkit for scalable inferences. Specifically, our contributions are:
1. We prove that when fitted with a graphical inductive bias, the NCM encodes the
L3
-constraints
necessary for performing counterfactual inference (Thm. 1), and that they are still expressive enough
to model the underlying data-generating model, which is not necessarily a neural network (Thm. 2).
2. We show that counterfactual identification within a neural proxy model setting is equivalent to
established symbolic approaches (Thm. 3). We leverage this duality to develop an optimization
procedure (Alg. 1) for counterfactual identification and estimation that is both sound and complete
(Corol. 2). The approach is general in that it accepts any combination of inputs from
L1
and
L2
, it
works in any causal diagram setting, and it does not require the Markovianity assumption to hold.
3. We develop a new approach to modeling the NCM using generative adversarial networks (GANs)
(Goodfellow et al., 2014), capable of robustly scaling inferences to high dimensions (Alg. 3). We
then show how GAN-NCMs can solve the challenging optimization problems in identifying and
estimating counterfactuals in practice. Experiments are provided in Sec. 5 and proofs in Appendix A.
Preliminaries.
We now introduce the notation and definitions used throughout the paper. We use
uppercase letters (
X
) to denote random variables and lowercase letters (
x
) to denote corresponding
values. Similarly, bold uppercase (
X
) and lower case (
x
) letters are used to denote sets of random
variables and values respectively. We use
DX
to denote the domain of
X
and
DX=DX1×· · ·×DXk
for the domain of
X={X1, . . . , Xk}
. We denote
P(X=x)
(which we will often shorten to
P(x)
)
as the probability of Xtaking the values xunder the probability distribution P(X).
We utilize the basic semantic framework of structural causal models (SCMs), as defined in (Pearl,
2000, Ch. 7). An SCM
M
consists of endogenous variables
V
, exogenous variables
U
with
distribution
P(U)
, and mechanisms
F
.
F
contains a function
fVi
for each variable
Vi
that maps
endogenous parents
PaVi
and exogenous parents
UVi
to
Vi
. Each
M
induces a causal diagram
G
,
where every
ViV
is a vertex, there is a directed arrow
(VjVi)
for every
ViV
and
VjPaVi
,
and there is a dashed-bidirected arrow
(VjL9999K Vi)
for every pair
Vi, VjV
such that
UVi
and
UVj
are not independent. For further details, see (Bareinboim et al., 2022, Def. 13/16, Thm. 4). The
exogenous
UVi
s are not assumed independent (i.e. Markovianity is not required). Our treatment is
constrained to recursive SCMs, which implies acyclic causal diagrams, with finite domains over V.
Each SCM Massigns values to each counterfactual distribution as follows:
Definition 1
(Layer 3 Valuation)
.
An SCM
M
induces layer
L3(M)
, a set of distributions over
V
,
each with the form P(Y) = P(Y1[x1],Y2[x2],...)such that
PM(y1[x1],y2[x2], . . . ) = ZDUY1[x1](u) = y1,Y2[x2](u) = y2, . . . dP (u),(1)
where Yi[xi](u)is evaluated under Fxi:={fVj:VjV\Xi}∪{fXx:XXi}.
Each
Yi
corresponds to a set of variables in a world where the original mechanisms
fX
are replaced
with constants
xi
for each
XXi
; this is also known as the mutilation procedure. This procedure
corresponds to interventions, and we use subscripts to denote the intervening variables (e.g.
Yx
) or
subscripts with brackets when the variables are indexed (e.g.
Y1[x1]
). For instance,
P(yx, y0
x0)
is the
probability of the joint counterfactual event Y=yhad Xbeen xand Y=y0had Xbeen x0.
6Witty et al. (2021) shows a related approach taking the Bayesian route; further details, see Appendix C.
3
SCM
M2
is said to be
P(Li)
-consistent (for short,
Li
-consistent) with SCM
M1
if
Li(M1) =
Li(M2)
. We will use
Z
to denote a set of quantities from Layer 2 (i.e.
Z={P(Vzk)}`
k=1
), and we
use Z(M)to denote those same quantities induced by SCM M(i.e. Z(M) = {PM(Vzk)}`
k=1).
We use neural causal models (NCMs) as a substitute (proxy) model for the true SCM, as follows:
Definition 2
(
G
-Constrained Neural Causal Model (
G
-NCM) (Xia et al., 2021, Def. 7))
.
Given a
causal diagram
G
, a
G
-constrained Neural Causal Model (for short,
G
-NCM)
c
M(θ)
over variables
V
with parameters
θ={θVi:ViV}
is an SCM
hb
U,V,b
F,b
P(b
U)i
such that
b
U={b
UC:CC(G)}
,
where
C(G)
is the set of all maximal cliques over bidirected edges of
G
, and
Db
U= [0,1]
for all
b
Ub
U
;
b
F={ˆ
fVi:ViV}
, where each
ˆ
fVi
is a feedforward neural network parameterized by
θViθ
mapping values of
UViPaVi
to values of
Vi
for
UVi={b
UC:b
UCb
Us.t. ViC}
and
PaVi=P aG(Vi);b
P(b
U)is defined s.t. b
UUnif(0,1) for each b
Ub
U.
2 NEURAL CAUSAL MODELS FOR COUNTERFACTUAL INFERENCE
We first recall that inferences about higher layers of the PCH generated by the true SCM
M
cannot
be made in general through an NCM
c
M
trained only from lower layer data (Bareinboim et al., 2022;
Xia et al., 2021). This impossibility motivated the use of the inductive bias in the form of a causal
diagram
G
in the construction of the NCM in Def. 2, which ascertains that the
G
-consistency property
holds. (See App. D.1 for further discussion.) We next define consistency w.r.t. to each layer, which
will be key for a more fine-grained discussion later on.
Definition 3
(
G(Li)
-Consistency)
.
Let
G
be the causal diagram induced by the SCM
M
. For
any SCM
M
,
M
is said to be
G(Li)
-consistent (w.r.t.
M
) if
Li(M)
satisfies all layer
i
equality
constraints implied by G.
This generalization is subtle since regardless of which
Li
is used with the definition, the causal
diagram
G
generated by
M
is the same. The difference lies in the implied constraints. For instance,
if an SCM
M
is
G(L1)
-consistent, that means that
G
is a Bayesian network for the observational
distribution of
M
, implying independences readable through d-separation Pearl (1988). If
M
is
G(L2)
-consistent, that means that
G
is a Causal Bayesian network (CBN) (Bareinboim et al., 2022,
Def. 16) for the interventional distributions of
M
. While several SCMs could share the same
d-separation constraints as
M
, there are fewer that share all
L2
constraints encoded by the CBN.
G
-
consistency at higher layers imposes a stricter set of constraints, narrowing down the set of compatible
SCMs. There also exist constraints of layer 3 that are important for counterfactual inferences.
To motivate the use of such constraints, consider an example inspired by the multi-armed bandit
problem. A casino has 3 slot machines, labeled “0", “1", and “2". Every day, the casino assigns
one machine a good payout, one a bad payout, and one an average payout, with chances of winning
represented by exogenous variables
U+
,
U
, and
U=
, respectively. A customer comes every day and
plays a slot machine.
X
represents their choice of machine, and
Y
is a binary variable representing
whether they win. Suppose a data scientist creates a model of the situation, and she hypothesizes
that the casino predicts the customer’s choice based on their mood (
UM
) and will always assign the
predicted machine the average payout to maintain profits. Her model is described by the SCM M0:
M0=
U={UM, U+, U=, U}, UM∈ {0,1,2}, U+, U=, U∈ {0,1}
V={X, Y }, X ∈ {0,1,2}, Y ∈ {0,1}
F=
fX(uM) = uM
fY(x, uM, u+, u=, u) =
u=x=uM
ux= (uM1)%3
u+x= (uM+ 1)%3
P(U) : P(UM=i) = 1
3, P (U+=1) = 0.6, P (U== 1) = 0.4, P (U=1) = 0.2
(2)
It turns out that in this model
P(yx) = P(y|x)
. For example,
P(Y= 1 |X= 0) = P(U==
1) = 0.4
, and
P(YX=0 = 1) = P(UM= 0)P(U== 1) + P(UM= 1)P(U= 1) + P(UM=
2)P(U+= 1) = 1
3(0.4) + 1
3(0.2) + 1
3(0.6) = 0.4.
Suppose the true model
M
employed by the casino (and unknown by the customers and data
scientist) induces graph
G={XY}
. Interestingly enough,
M0
would be
G(L2)
-consistent with
4
M
since
M0
is compatible with all
L2
-constraints, including
P(yx) = P(y|x)
and
P(xy) = P(x)
.
However, and perhaps surprisingly, it would fail to be
G(L3)
-consistent. A further constraint implied
by
G
on the third layer is that
P(yx|x0) = P(yx)
, which is not true of
M0
. To witness, note that
P(YX=0 = 1 |X= 2) = P(U+= 1) = 0.6
in
M0
, which means that if the customer chose
machine 2, they would have had higher payout had they chosen machine 0. This does not match
P(YX=0 = 1) = 0.4, computed earlier, so M0fails to encode the L3-constraints implied by G.
Figure 2: Model-
theoretic visualization
of Thms. 1 and 2.
In general, the causal diagram encodes a family of
L3
-constraints which we
leverage to make cross-layer inferences. A more detailed discussion can be
found in Appendix D. We show next that NCMs encodes all of the equality
constraints related to L3, in addition to the known L2-constraints.
Theorem 1
(NCM
G(L3)
-Consistency)
.
Any
G
-NCM
c
M(θ)
is
G(L3)
-
consistent.
This will be a key result for performing inferences at the counterfactual
level. Similar to how constraints about layer 2 distributions help bridge
the gap between layers 1 and 2, layer 3 constraints allow us to extend our
inference capabilities into layer 3. (In fact, most of
L3
s distributions are not
obtainable through experimentation.) While this graphical inductive bias is
powerful, the set of NCMs constrained by
G
is no less expressive than the
set of SCMs constrained by G, as shown next.
Theorem 2
(
L3
-
G
Expressiveness)
.
For any SCM
M
that induces causal
diagram
G
, there exists a
G
-NCM
c
M(θ) = hb
U,V,b
F,b
P(b
U)i
s.t.
c
M
is
L3-consistent w.r.t. M.
This result ascertains that the NCM class is as expressive, and therefore, contains the same generative
capabilities as the original generating model. More interestingly, even if the original SCM
M
does
not belong to the NCM class, but from the higher space, there exists a NCM
c
M(θ)
that will be
capable of expressing the collection of distributions from all layers of the PCH induced by it.
A visual representation of these two results is shown in Fig. 2. The space of all SCMs is called
,
and the subspace that contains all SCMs
G((Li)
-consistent w.r.t. the true SCM
M
(black dot) is
called
(G(Li))
. Note that the
G(Li)
space shrinks with higher layers, indicating a more constrained
space with fewer SCMs. Thm. 1 states that all
G
-NCMs (
Ω(G)
) are within
(G(L3))
, and Thm. 2
states that all SCMs in
(G(L3))
can be represented by a corresponding
G
-NCM on all three layers.
It may seem intuitive that the
G
-NCM has these two properties by construction, but these properties
are nontrivial and, in fact, not enjoyed by many model classes. Examples can be found in Appendix
D. Together, these two theorems ensure that the NCM has both the constraints and the expressiveness
necessary for counterfactual inference, elaborated further in the next section.
3 NEURAL COUNTERFACTUAL IDENTIFICATION
The problem of identification is concerned with determining whether a certain quantity is computable
from a combination of assumptions, usually encoded in the form of a causal diagram, and a collection
of distributions (Pearl, 2000, p. 77). This challenge stems from the fact that even though the space of
SCMs (or NCMs) is constrained upon assuming a certain causal diagram, the quantity of interest
may still be underdetermined. In words, there are many SCMs compatible with the same diagram
G
but generate different answers for the target distribution. In this section, we investigate the problem
of identification and decide whether counterfactual quantities (from
L3
) can be inferred from a
combination of a subset of L2and L1datasets together with G, as formally defined next.
Definition 4
(Neural Counterfactual Identification)
.
Consider an SCM
M
and the corresponding
causal diagram
G
. Let
Z={P(Vzk)}`
k=1
be a collection of available interventional (or observational
if
Zk=
) distributions from
M
. The counterfactual query
P(Y=y|X=x)
is said to be
neural identifiable (identifiable, for short) from the set of
G
-constrained NCMs
Ω(G)
and
Z
if and
only if
Pc
M1(y|x) = Pc
M2(y|x)
for every pair of models
c
M1,c
M2Ω(G)
s.t. they match
Mon all distributions in Z(i.e. Z(M) = Z(M1) = Z(M2)>0).
5
摘要:

NEURALCAUSALMODELSFORCOUNTERFACTUALIDENTIFICATIONANDESTIMATIONKevinXiaandYushuPanandEliasBareinboimCausalArticialIntelligenceLaboratoryColumbiaUniversity,USA{kevinmxia,yushupan,eb}@cs.columbia.eduABSTRACTEvaluatinghypotheticalstatementsabouthowtheworldwouldbehadadifferentcourseofactionbeentakenisar...

展开>> 收起<<
NEURAL CAUSAL MODELS FOR COUNTERFACTUAL IDENTIFICATION AND ESTIMATION Kevin Xia andYushu Pan andElias Bareinboim.pdf

共57页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:57 页 大小:7.45MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 57
客服
关注