
experimental data (from
L∗
1
and
L∗
2
)? This question embodies the challenge of cross-layer inferences,
which entail solving two challenging causal problems in tandem, identification and estimation.
(a)
Unobserved
Nature/Truth
(b)
Learned/
Hypothesized
PCH:
SCM M∗
=hF∗, P (U∗)i
L∗
1L∗
2L∗
3
NCM c
M
=hb
F, P (b
U)i
L1L2L3
Training (L1=L∗
1,
L2=L∗
2)
Causal
Diagram
G
G-Constraint
Figure 1: The l.h.s. contains the true SCM
M∗
that induces PCH’s three layers. The
r.h.s. contains a neural model
c
M
constrained
by inductive bias
G
(entailed by
M∗
) and
matching
M∗
on
L1
and
L2
through training.
In the more traditional literature of causal inference,
there are different symbolic methods for solving these
problems in various settings and under different as-
sumptions. In the context of identification, there ex-
ists an arsenal of results that includes celebrated meth-
ods such as Pearl’s do-calculus (Pearl, 1995), and go
through different algorithmic methods when consid-
ering inferences for
L2
- (Tian & Pearl, 2002; Shpitser
& Pearl, 2006; Huang & Valtorta, 2006; Bareinboim
& Pearl, 2012; Lee et al., 2019; Lee & Bareinboim,
2020; 2021) and
L3
-distributions (Heckman, 1992;
Pearl, 2001; Avin et al., 2005; Shpitser & Pearl, 2009;
Shpitser & Sherman, 2018; Zhang & Bareinboim,
2018; Correa et al., 2021). On the estimation side,
there are various methods including the celebrated
Propensity Score/IPW for the backdoor case (Rubin,
1978; Horvitz & Thompson, 1952; Kennedy, 2019;
Kallus & Uehara, 2020), and some more relaxed settings (Fulcher et al., 2019; Jung et al., 2020;
2021), but the literature is somewhat scarcer and less developed. In fact, there is a lack of estimation
methods for L3quantities in most settings.
On another thread in the literature, deep learning methods have achieved outstanding empirical
success in solving a wide range of tasks in fields such as computer vision (Krizhevsky et al., 2012),
speech recognition (Graves & Jaitly, 2014), and game playing (Mnih et al., 2013). One key feature
of deep learning is its ability to allow inferences to scale with the data to high dimensional settings.
We study here the suitability of the neural approach to tackle the problems of causal identification
and estimation while trying to leverage the benefits of these new advances experienced in non-causal
settings.
1
The idea behind the approach pursued here is illustrated in Fig. 1. Specifically, we will
search for a neural model
c
M
(r.h.s.) that has the same generative capability of the true, unobserved
SCM
M∗
(l.h.s.); in other words,
c
M
should be able to generate the same observed/inputted data,
i.e.,
L1=L∗
1
and
L2=L∗
2
.
2
To tackle this task in practice, we use an inductive bias for the neural
model in the form of a causal diagram (Pearl, 2000; Spirtes et al., 2000; Bareinboim & Pearl, 2016),
which is a parsimonious description of the mechanisms (
F∗
) and exogenous conditions (
P(U∗)
) of
the generating SCM.
3
The question then becomes: under what conditions can a model trained using
this combination of qualitative inductive bias and the available data be suitable to answer questions
about hypothetical counterfactual worlds, as if we had access to the true M∗?
There exists a growing literature that leverages modern neural methods to solve causal inference
tasks.
1
Our approach based on proxy causal models will answer causal queries by direct evaluation
through a parameterized neural model
c
M
fitted on the data generated by
M∗
.
4
For instance, some
recent work solves the estimation of interventional (
L2
) or counterfactual (
L3
) distributions from
observational (
L1
) data in Markovian settings, implemented through architectures such as GANs,
flows, GNNs, and VGAEs (Kocaoglu et al., 2018; Pawlowski et al., 2020; Zecevic et al., 2021;
Sanchez-Martin et al., 2021). In some real-world settings, Markovianity is a too stringent condition
(see discussion in App. D.4) and may be violated, which leads to the separation between layers 1 and
2, and, in turn, issues of causal identification.
5
The proxy approach discussed above was pursued in
Xia et al. (2021) to solve the identification and estimation of interventional distributions (
L2
) from
1
One of our motivations is that these methods showed great promise at estimating effects from observational
data under backdoor/ignorability conditions (Shalit et al., 2017; Louizos et al., 2017; Li & Fu, 2017; Johansson
et al., 2016; Yao et al., 2018; Yoon et al., 2018; Kallus, 2020; Shi et al., 2019; Du et al., 2020; Guo et al., 2020).
2
This represents an extreme case where all
L1
- and
L2
-distributions are provided as data. In practice, this
may be unrealistic, and our method takes as input any arbitrary subset of distributions from L1and L2.
3
When imposed on neural models, they enforce equality constraints connecting layer 1 and layer 2 quantities,
defined formally through the causal Bayesian network (CBN) data structure (Bareinboim et al., 2022, Def. 16).
4In general, c
Mdoes not need to, and will not be equal to the true SCM M∗.
5Layer 3 differs from lower layers even in Markovian models; see Bareinboim et al. (2022, Ex. 7).
2