A Neural Mean Embedding Approach for Back-door and Front-door Adjustment Liyuan Xu

2025-04-24 1 0 476.38KB 17 页 10玖币
侵权投诉
A Neural Mean Embedding Approach for Back-door and Front-door
Adjustment
Liyuan Xu
Gatsby Unit
liyuan.jo.19@ucl.ac.uk
Arthur Gretton
Gatsby Unit
arthur.gretton@gmail.com
October 14, 2022
Abstract
We consider the estimation of average and counterfactual treatment effects, under two settings: back-door adjustment and front-
door adjustment. The goal in both cases is to recover the treatment effect without having an access to a hidden confounder. This
objective is attained by first estimating the conditional mean of the desired outcome variable given relevant covariates (the “first stage”
regression), and then taking the (conditional) expectation of this function as a “second stage” procedure. We propose to compute these
conditional expectations directly using a regression function to the learned input features of the first stage, thus avoiding the need
for sampling or density estimation. All functions and features (and in particular, the output features in the second stage) are neural
networks learned adaptively from data, with the sole requirement that the final layer of the first stage should be linear. The proposed
method is shown to converge to the true causal parameter, and outperforms the recent state-of-the-art methods on challenging causal
benchmarks, including settings involving high-dimensional image data.
1 Introduction
The goal of causal inference from observational data is to predict the effect of our actions, or treatments, on the outcome without
performing interventions. Questions of interest can include what is the effect of smoking on life expectancy? or counterfactual
questions, such as given the observed health outcome for a smoker, how long would they have lived had they quit smoking? Answering
these questions becomes challenging when a confounder exists, which affects both treatment and the outcome, and causes bias in the
estimation. Causal estimation requires us to correct for this confounding bias.
A popular assumption in causal inference is the no unmeasured confounder requirement, which means that we observe all the
confounders that cause the bias in the estimation. Although a number of causal inference methods are proposed under this assumption
[Hill, 2011, Shalit et al., 2017, Shi et al., 2019, Schwab et al., 2020], it rarely holds in practice. In the smoking example, the
confounder can be one’s genetic characteristics or social status, which are difficult to measure for both technical and ethical reasons.
To address this issue, Pearl [1995] proposed back-door adjustment and front-door adjustment, which recover the causal effect
in the presence of hidden confounders using a back-door variable or front-door variable, respectively. The back-door variable is a
covariate that blocks all causal effects directed from the confounder to the treatment. In health care, patients may have underlying
predispositions to illness due to genetic or social factors (hidden), from which measurable symptoms will arise (back-door variable)
- these symptoms in turn lead to a choice of treatment. By contrast, a front-door variable blocks the path from treatment to outcome.
In perhaps the best-known example, the amount of tar in a smoker’s lungs serves as a front-door variable, since it is increased by
smoking, shortens life expectancy, and has no direct link to underlying (hidden) sociological traits. Pearl [1995] showed that causal
quantities can be obtained by taking the (conditional) expectation of the conditional average outcome.
While Pearl [1995] only considered the discrete case, this framework was extended to the continuous case by Singh et al. [2020],
using two-stage regression (a review of this and other recent approaches for the continuous case is given in Section 5). In the first
stage, the approach regresses from the relevant covariates to the outcome of interest, expressing the function as a linear combination
of non-linear feature maps. Then, in the second stage, the causal parameters are estimated by learning the (conditional) expectation of
the non-linear feature map used in the first stage. Unlike competing methods [Colangelo and Lee, 2020, Kennedy et al., 2017], two-
stage regression avoids fitting probability densities, which is challenging in high-dimensional settings [Wasserman, 2006, Section
6.5]. Singh et al. [2020]’s method is shown to converge to the true causal parameters and exhibits better empirical performance than
competing methods.
One limitation of the methods in Singh et al. [2020] is that they use fixed pre-specified feature maps from reproducing kernel
Hilbert spaces, which have a limited expressive capacity when data are complex (images, text, audio). To overcome this, we propose
to employ a neural mean embedding approach to learning task-specific adaptive feature dictionaries. At a high level, we first employ
a neural network with a linear final layer in the first stage. For the second stage, we learn the (conditional) mean of the stage 1 features
in the penultimate layer, again with a neural net. The approach develops the technique of Xu et al. [2021a,b] and enables the model
to capture complex causal relationships for high-dimensional covariates and treatments. Neural network feature means are also used
1
arXiv:2210.06610v1 [cs.LG] 12 Oct 2022
to represent (conditional) probabilities in other machine learning settings, such as representation learning [Zaheer et al., 2017] and
approximate Bayesian inference [Xu et al., 2022]. We derive the consistency of the method based on the Rademacher complexity,
a result of which is of independent interest and may be relevant in establishing consistency for broader categories of neural mean
embedding approaches, including Xu et al. [2021a,b]. We empirically show that the proposed method performs better than other
state-of-the-art neural causal inference methods, including those using kernel feature dictionaries.
This paper is structured as follows. In Section 2, we introduce the causal parameters we are interested in and give a detailed
description of the proposed method in Section 3. The theoretical analysis is presented in Section 4, followed by a review of related
work in Section 5. We demonstrate the empirical performance of the proposed method in Section 6, covering two settings: a classical
back-door adjustment problem with a binary treatment, and a challenging back-door and front-door setting where the treatment
consists of high-dimensional image data.
2 Problem Setting
In this section, we introduce the causal parameters and methods to estimate these causal methods, namely a back-door adjustment
and front-door adjustment. Throughout the paper, we denote a random variable in a capital letter (e.g. A), the realization of this
random variable in lowercase (e.g. a), and the set where a random variable takes values in a calligraphic letter (e.g. A). We assume
data is generated from a distribution P.
Causal Parameters We introduce the target causal parameters using the potential outcome framework [Rubin, 2005]. Let the
treatment and the observed outcome be A A and Y∈ Y [R, R]. We denote the potential outcome given treatment aas
Y(a)∈ Y. Here, we assume no inference, which means that we observe Y=Y(a)when A=a. We denote the hidden confounder
as U∈ U and assume conditional exchangeability a∈ A, Y (a)A|U, which means that the potential outcomes are not affected
by the treatment assignment. A typical causal graph is shown in Figure 1a. We may additionally consider the observable confounder
O∈ O, which is discussed in Appendix B.
A first goal of causal inference is to estimate the Average Treatment Effect (ATE)1θATE(a) = EY(a), which is the average
potential outcome of A=a. We also consider Average Treatment Effect on the Treated (ATT) θATT(a;a0) = EY(a)|A=a0,
which is the expected potential outcome of A=afor those who received the treatment A=a0. Given no inference and conditional
exchangeability assumptions, these causal parameters can be written in the following form.
Proposition 1 (Rosenbaum and Rubin, 1983, Robins, 1986).Given unobserved confounder U, which satisfies no inference and
conditional exchangeability, we have
θATE(a) = EU[E[Y|A=a, U]] , θATT(a;a0) = EU[E[Y|A=a, U]|A=a0].
If we observable additional confounder O, we may also consider conditional average treatment effect (CATE): the average
potential outcome for the sub-population of O=o, which is discussed in Appendix B. Note that since the confounder Uis not
observed, we cannot recover these causal parameters only from (A, Y ).
Back-door Adjustment In back-door adjustment, we assume the access to the back-door variable X∈ X, which blocks all causal
paths from unobserved confounder Uto treatment A. See Figure 1b for a typical causal graph. Given the back-door variable, causal
parameters can be written only from observable variables (A, Y, X)as follows.
Proposition 2 (Pearl, 1995, Theorem 1).Given the back-door variable X, we have
θATE(a) = EX[g(a, X)] , θATT(a;a0) = EX[g(a, X)|A=a0],
where g(a, x) = E[Y|A=a, X =x].
By comparing Proposition 2 to Proposition 1, we can see that causal parameters can be learned by treating the back-door variable
Xas the only “confounder”, despite the presence of the additional hidden confounder U. Hence, we may apply any method based
on the “no unobservable confounder” assumption to back-door adjustment.
Front-door Adjustment Another adjustment for causal estimation is front-door adjustment, which uses the causal mechanism to
determine the causal effect. Assume we observe the front-door variable M∈ M, which blocks all causal paths from treatment Ato
outcome Y, as in Figure 1c. Then, we can recover the causal parameters as follows.
1In the binary treatment case A={0,1}, the ATE is typically defined as the expectation of the difference of potential outcome E[Y(1) Y(0)]. However, we
define ATE as the expectation of potential outcome E[Y(a)], which is a primary target of interest in a continuous treatment case, also known as dose response curve.
The same applies to the ATT as well.
2
A Y
U
(a) General causal graph
AX
U
Y
(b) Back-door adjustment
A M Y
U
(c) Front-door adjustment
Figure 1: Causal graphs we consider in this paper. The dotted circle means the unobservable variable.
Proposition 3 (Pearl, 1995, Theorem 2).Given the front-door variable M, we have
θATE(a) = EA0[EM[g(A0, M)|A=a]] , θATT(a;a0) = EM[g(a0, M )|A=a],
where g(a, m) = E[Y|A=a, M =m]and A0 A is a random variable that follows the same distribution as treatment A.
Unlike the case of the back-door adjustment, we cannot naively apply methods based on the “no unmeasured confounder” as-
sumption here, since Proposition 3 takes a different form to Proposition 1.
3 Algorithms
In this section, we present our proposed methods. We first present the case with back-door adjustment and then move to front-door
adjustment.
Back-door adjustment The algorithm consists of two stages; In the first stage, we learn the conditional expectation g=E[Y|A=a, X =x]
with a specific form. We then compute the causal parameter by estimating the expectation of the input features to g.
The conditional expectation g(a, x)is learned by regressing (A, X)to Y. Here, we consider a specific model g(a, x) =
w>(φA(a)φX(x)), where φA:A → Rd1,φX:X Rd2are feature maps represented by neural networks, wRd1d2
is a trainable weight vector, and denotes a tensor product ab= vec(ab>). Given data {(ai, yi, xi)}n
i=1 Psize of n, the
feature maps φA,φXand the weight wcan be trained by minimizing the following empirical loss:
ˆ
L1(w,φA,φX) = 1
n
n
X
i=1
(yiw>(φA(ai)φX(xi)))2.(1)
We may add any regularization term to this loss, such as weight decay λkwk2. Let the minimizer of the loss ˆ
L1be ˆ
w,ˆ
φA,ˆ
φZ=
arg min ˆ
L1and the learned regression function be ˆg(a, x) = ˆ
w>(ˆ
φA(a)ˆ
φX(x)). Then, by substituting ˆgfor gin Proposition 2,
we have
θATE(a)'ˆ
w>ˆ
φA(a)Ehˆ
φX(X)i, θATT(a;a0)'ˆ
w>ˆ
φA(a)Ehˆ
φX(X)A=a0i.
This is the advantage of assuming the specific form of g(a, x) = w>(φA(a)φX(x)); From linearity, we can recover the causal
parameters by estimating E[ˆ
φX(X)],E[ˆ
φX(X)|A=a0]. Such (conditional) expectations of features are called (conditional) mean
embedding, and thus, we name our method “neural (conditional) mean embedding”.
We can estimate the marginal expectation E[ˆ
φX(X)], as a simple empirical average E[ˆ
φX(X)] '1
nPn
i=1 ˆ
φX(xi).
The conditional mean embedding E[ˆ
φX(X)|A=a0]requires more care, however: it can be learned by a technique proposed in
Xu et al. [2021a], in which we train another regression function from treatment Ato the back-door feature ˆ
φX(X). Specifically, we
estimate E[ˆ
φX(X)|A=a0]by ˆ
fφX(a0), where the regression function ˆ
fφX:A → Rd2be given by
ˆ
fφX= arg min
f:A→Rd2
ˆ
L2(f;φX),ˆ
L2(f;φX) = 1
n
n
X
i=1 kφX(xi)f(ai)k2.(2)
Here, k·kdenotes the Euclidean norm. The loss L2may include the additional regularization term such as a weight decay term for
parameters in f. We have
ˆ
θATE(a) = ˆ
w> ˆ
φA(a)1
n
n
X
i=1
ˆ
φX(xi)!,ˆ
θATT(a;a0) = ˆ
w>ˆ
φA(a)ˆ
fˆ
φX(a0)
as the final estimator for the back-door adjustment. The estimator for the ATE ˆ
θATE is reduced to the average of the predictions
ˆ
θATE =1
nPn
i=1 ˆg(a, xi). This coincides with other neural network causal methods [Shalit et al., 2017, Chernozhukov et al., 2022a],
which do not assume g(a, z) = w>(φA(a)φX(x)). As we have seen, however, this tensor product formulation is essential for
estimating ATT by back-door adjustment. It will also be necessary for the front-door adjustment, as we will see next.
3
Front-door adjustment We can obtain the estimator for front-door adjustment by following the almost same procedure as the
back-door adjustment. Given data {(ai, yi, mi)}n
i=1, we again fit the regression model ˆg(a, m) = ˆ
w>ˆ
φA(a)ˆ
φM(m)by
solving
ˆ
w,ˆ
φA,ˆ
φM= arg min 1
n
n
X
i=1
(yiw>(φA(ai)φM(mi)))2,
where φM:M → Rd2is a feature map represented as the neural network. From Proposition 3, we have
θATE(a)'ˆ
w>Ehˆ
φA(A)iEhˆ
φM(M)A=ai,
θATT(a;a0)'ˆ
w>ˆ
φA(a0)Ehˆ
φM(M)A=ai.
Again, we estimate feature embedding by empirical average for E[ˆ
φA(A)] or solving another regression problem for E[ˆ
φM(M)|A=
a]. The final estimator for front-door adjustment is given as
ˆ
θATE(a) = ˆ
w> 1
n
n
X
i=1
ˆ
φA(ai)ˆ
fˆ
φM(a)!,ˆ
θATT(a;a0) = ˆ
w>ˆ
φA(a0)ˆ
fˆ
φM(a),
where ˆ
fˆ
φMis given by minimizing loss L2(with additional regularization term) defined as
ˆ
L2=1
n
n
X
i=1 kφM(mi)f(ai)k2.
4 Theoretical Analysis
In this section, we prove the consistency of the proposed method. We focus on the back-door adjustment case, since the consistency
of front-door adjustment can be derived identically. The proposed method consists of two successive regression problems. In the first
stage, we learn the conditional expectation g, and then in the second stage, we estimate the feature embeddings. First, we show each
stage’s consistency, then present the overall convergence rate to the causal parameter.
Consistency for the first stage: In this section, we consider the hypothesis space of gas
Hg={w>(φA(a)φZ(z)) |wRd1d2,φA(a)Rd1,φZ(z)Rd2,
kwk1R, max
a∈A kφA(a)k1,max
z∈Z kφZ(z)k1}.
Here, we denote `1-norm and infinity norm of vector bRdas kbk1=Pd
i=1 |bi|and kbk= maxi[d]bi. Note that from inequality
kφA(a)φZ(z)k≤ kφA(a)kkφZ(z)kand H¨
older’s inequality, we can show that h(a, z)[R, R]for all h∈ Hg. Given
this hypothesis space, the following lemma bounds the deviation of estimated conditional expectation ˆgand the true one.
Lemma 1. Given data S={ai, yi, xi}n
i=1, let minimizer of loss ˆ
L1be ˆg= arg min ˆ
L1. If the true conditional expectation gis in
the hypothesis space g∈ Hg, w.p. at least 12δ, we have
kgˆgkP(A,X)s16Rˆ
RS(Hg)+8R2rlog 2
2n,
where ˆ
RS(Hg)is empirical Rademacher complexity of Hggiven data S.
The proof is given in Appendix A.2. Here, we present the empirical Rademacher complexity when we apply a feed-forward
neural network for features.
Lemma 2. The empirical Rademacher complexity ˆ
RS(Hg)scales as
ˆ
RS(Hg)O(CL/n)
for some constant Cif we use a specific L-layer neural net for features φA,φX.
See Lemma 6 in Appendix A.2 for the detailed expression of the upper bound. Note that this may be of independent interest since
the similar hypothesis class is considered in Xu et al. [2021a,b], and no explicit upper bound is provided on the empirical Rademacher
complexity in that work.
4
摘要:

ANeuralMeanEmbeddingApproachforBack-doorandFront-doorAdjustmentLiyuanXuGatsbyUnitliyuan.jo.19@ucl.ac.ukArthurGrettonGatsbyUnitarthur.gretton@gmail.comOctober14,2022AbstractWeconsidertheestimationofaverageandcounterfactualtreatmenteffects,undertwosettings:back-dooradjustmentandfront-dooradjustment.Th...

展开>> 收起<<
A Neural Mean Embedding Approach for Back-door and Front-door Adjustment Liyuan Xu.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:476.38KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注