
to represent (conditional) probabilities in other machine learning settings, such as representation learning [Zaheer et al., 2017] and
approximate Bayesian inference [Xu et al., 2022]. We derive the consistency of the method based on the Rademacher complexity,
a result of which is of independent interest and may be relevant in establishing consistency for broader categories of neural mean
embedding approaches, including Xu et al. [2021a,b]. We empirically show that the proposed method performs better than other
state-of-the-art neural causal inference methods, including those using kernel feature dictionaries.
This paper is structured as follows. In Section 2, we introduce the causal parameters we are interested in and give a detailed
description of the proposed method in Section 3. The theoretical analysis is presented in Section 4, followed by a review of related
work in Section 5. We demonstrate the empirical performance of the proposed method in Section 6, covering two settings: a classical
back-door adjustment problem with a binary treatment, and a challenging back-door and front-door setting where the treatment
consists of high-dimensional image data.
2 Problem Setting
In this section, we introduce the causal parameters and methods to estimate these causal methods, namely a back-door adjustment
and front-door adjustment. Throughout the paper, we denote a random variable in a capital letter (e.g. A), the realization of this
random variable in lowercase (e.g. a), and the set where a random variable takes values in a calligraphic letter (e.g. A). We assume
data is generated from a distribution P.
Causal Parameters We introduce the target causal parameters using the potential outcome framework [Rubin, 2005]. Let the
treatment and the observed outcome be A∈ A and Y∈ Y ⊆ [−R, R]. We denote the potential outcome given treatment aas
Y(a)∈ Y. Here, we assume no inference, which means that we observe Y=Y(a)when A=a. We denote the hidden confounder
as U∈ U and assume conditional exchangeability ∀a∈ A, Y (a)⊥⊥A|U, which means that the potential outcomes are not affected
by the treatment assignment. A typical causal graph is shown in Figure 1a. We may additionally consider the observable confounder
O∈ O, which is discussed in Appendix B.
A first goal of causal inference is to estimate the Average Treatment Effect (ATE)1θATE(a) = EY(a), which is the average
potential outcome of A=a. We also consider Average Treatment Effect on the Treated (ATT) θATT(a;a0) = EY(a)|A=a0,
which is the expected potential outcome of A=afor those who received the treatment A=a0. Given no inference and conditional
exchangeability assumptions, these causal parameters can be written in the following form.
Proposition 1 (Rosenbaum and Rubin, 1983, Robins, 1986).Given unobserved confounder U, which satisfies no inference and
conditional exchangeability, we have
θATE(a) = EU[E[Y|A=a, U]] , θATT(a;a0) = EU[E[Y|A=a, U]|A=a0].
If we observable additional confounder O, we may also consider conditional average treatment effect (CATE): the average
potential outcome for the sub-population of O=o, which is discussed in Appendix B. Note that since the confounder Uis not
observed, we cannot recover these causal parameters only from (A, Y ).
Back-door Adjustment In back-door adjustment, we assume the access to the back-door variable X∈ X, which blocks all causal
paths from unobserved confounder Uto treatment A. See Figure 1b for a typical causal graph. Given the back-door variable, causal
parameters can be written only from observable variables (A, Y, X)as follows.
Proposition 2 (Pearl, 1995, Theorem 1).Given the back-door variable X, we have
θATE(a) = EX[g(a, X)] , θATT(a;a0) = EX[g(a, X)|A=a0],
where g(a, x) = E[Y|A=a, X =x].
By comparing Proposition 2 to Proposition 1, we can see that causal parameters can be learned by treating the back-door variable
Xas the only “confounder”, despite the presence of the additional hidden confounder U. Hence, we may apply any method based
on the “no unobservable confounder” assumption to back-door adjustment.
Front-door Adjustment Another adjustment for causal estimation is front-door adjustment, which uses the causal mechanism to
determine the causal effect. Assume we observe the front-door variable M∈ M, which blocks all causal paths from treatment Ato
outcome Y, as in Figure 1c. Then, we can recover the causal parameters as follows.
1In the binary treatment case A={0,1}, the ATE is typically defined as the expectation of the difference of potential outcome E[Y(1) −Y(0)]. However, we
define ATE as the expectation of potential outcome E[Y(a)], which is a primary target of interest in a continuous treatment case, also known as dose response curve.
The same applies to the ATT as well.
2