A Neural Mean Embedding Approach for Back-door and Front-door Adjustment Liyuan Xu

2025-04-24 1 0 476.38KB 17 页 10玖币

侵权投诉

A Neural Mean Embedding Approach for Back-door and Front-door

Adjustment

Liyuan Xu

Gatsby Unit

liyuan.jo.19@ucl.ac.uk

Arthur Gretton

Gatsby Unit

arthur.gretton@gmail.com

October 14, 2022

Abstract

We consider the estimation of average and counterfactual treatment effects, under two settings: back-door adjustment and front-

door adjustment. The goal in both cases is to recover the treatment effect without having an access to a hidden confounder. This

objective is attained by ﬁrst estimating the conditional mean of the desired outcome variable given relevant covariates (the “ﬁrst stage”

regression), and then taking the (conditional) expectation of this function as a “second stage” procedure. We propose to compute these

conditional expectations directly using a regression function to the learned input features of the ﬁrst stage, thus avoiding the need

for sampling or density estimation. All functions and features (and in particular, the output features in the second stage) are neural

networks learned adaptively from data, with the sole requirement that the ﬁnal layer of the ﬁrst stage should be linear. The proposed

method is shown to converge to the true causal parameter, and outperforms the recent state-of-the-art methods on challenging causal

benchmarks, including settings involving high-dimensional image data.

1 Introduction

The goal of causal inference from observational data is to predict the effect of our actions, or treatments, on the outcome without

performing interventions. Questions of interest can include what is the effect of smoking on life expectancy? or counterfactual

questions, such as given the observed health outcome for a smoker, how long would they have lived had they quit smoking? Answering

these questions becomes challenging when a confounder exists, which affects both treatment and the outcome, and causes bias in the

estimation. Causal estimation requires us to correct for this confounding bias.

A popular assumption in causal inference is the no unmeasured confounder requirement, which means that we observe all the

confounders that cause the bias in the estimation. Although a number of causal inference methods are proposed under this assumption

[Hill, 2011, Shalit et al., 2017, Shi et al., 2019, Schwab et al., 2020], it rarely holds in practice. In the smoking example, the

confounder can be one’s genetic characteristics or social status, which are difﬁcult to measure for both technical and ethical reasons.

To address this issue, Pearl [1995] proposed back-door adjustment and front-door adjustment, which recover the causal effect

in the presence of hidden confounders using a back-door variable or front-door variable, respectively. The back-door variable is a

covariate that blocks all causal effects directed from the confounder to the treatment. In health care, patients may have underlying

predispositions to illness due to genetic or social factors (hidden), from which measurable symptoms will arise (back-door variable)

- these symptoms in turn lead to a choice of treatment. By contrast, a front-door variable blocks the path from treatment to outcome.

In perhaps the best-known example, the amount of tar in a smoker’s lungs serves as a front-door variable, since it is increased by

smoking, shortens life expectancy, and has no direct link to underlying (hidden) sociological traits. Pearl [1995] showed that causal

quantities can be obtained by taking the (conditional) expectation of the conditional average outcome.

While Pearl [1995] only considered the discrete case, this framework was extended to the continuous case by Singh et al. [2020],

using two-stage regression (a review of this and other recent approaches for the continuous case is given in Section 5). In the ﬁrst

stage, the approach regresses from the relevant covariates to the outcome of interest, expressing the function as a linear combination

of non-linear feature maps. Then, in the second stage, the causal parameters are estimated by learning the (conditional) expectation of

the non-linear feature map used in the ﬁrst stage. Unlike competing methods [Colangelo and Lee, 2020, Kennedy et al., 2017], two-

stage regression avoids ﬁtting probability densities, which is challenging in high-dimensional settings [Wasserman, 2006, Section

6.5]. Singh et al. [2020]’s method is shown to converge to the true causal parameters and exhibits better empirical performance than

competing methods.

One limitation of the methods in Singh et al. [2020] is that they use ﬁxed pre-speciﬁed feature maps from reproducing kernel

Hilbert spaces, which have a limited expressive capacity when data are complex (images, text, audio). To overcome this, we propose

to employ a neural mean embedding approach to learning task-speciﬁc adaptive feature dictionaries. At a high level, we ﬁrst employ

a neural network with a linear ﬁnal layer in the ﬁrst stage. For the second stage, we learn the (conditional) mean of the stage 1 features

in the penultimate layer, again with a neural net. The approach develops the technique of Xu et al. [2021a,b] and enables the model

to capture complex causal relationships for high-dimensional covariates and treatments. Neural network feature means are also used

arXiv:2210.06610v1 [cs.LG] 12 Oct 2022

to represent (conditional) probabilities in other machine learning settings, such as representation learning [Zaheer et al., 2017] and

approximate Bayesian inference [Xu et al., 2022]. We derive the consistency of the method based on the Rademacher complexity,

a result of which is of independent interest and may be relevant in establishing consistency for broader categories of neural mean

embedding approaches, including Xu et al. [2021a,b]. We empirically show that the proposed method performs better than other

state-of-the-art neural causal inference methods, including those using kernel feature dictionaries.

This paper is structured as follows. In Section 2, we introduce the causal parameters we are interested in and give a detailed

description of the proposed method in Section 3. The theoretical analysis is presented in Section 4, followed by a review of related

work in Section 5. We demonstrate the empirical performance of the proposed method in Section 6, covering two settings: a classical

back-door adjustment problem with a binary treatment, and a challenging back-door and front-door setting where the treatment

consists of high-dimensional image data.

2 Problem Setting

In this section, we introduce the causal parameters and methods to estimate these causal methods, namely a back-door adjustment

and front-door adjustment. Throughout the paper, we denote a random variable in a capital letter (e.g. A), the realization of this

random variable in lowercase (e.g. a), and the set where a random variable takes values in a calligraphic letter (e.g. A). We assume

data is generated from a distribution P.

Causal Parameters We introduce the target causal parameters using the potential outcome framework [Rubin, 2005]. Let the

treatment and the observed outcome be A∈ A and Y∈ Y ⊆ [−R, R]. We denote the potential outcome given treatment aas

Y(a)∈ Y. Here, we assume no inference, which means that we observe Y=Y(a)when A=a. We denote the hidden confounder

as U∈ U and assume conditional exchangeability ∀a∈ A, Y (a)⊥⊥A|U, which means that the potential outcomes are not affected

by the treatment assignment. A typical causal graph is shown in Figure 1a. We may additionally consider the observable confounder

O∈ O, which is discussed in Appendix B.

A ﬁrst goal of causal inference is to estimate the Average Treatment Effect (ATE)1θATE(a) = EY(a), which is the average

potential outcome of A=a. We also consider Average Treatment Effect on the Treated (ATT) θATT(a;a0) = EY(a)|A=a0,

which is the expected potential outcome of A=afor those who received the treatment A=a0. Given no inference and conditional

exchangeability assumptions, these causal parameters can be written in the following form.

Proposition 1 (Rosenbaum and Rubin, 1983, Robins, 1986).Given unobserved confounder U, which satisﬁes no inference and

conditional exchangeability, we have

θATE(a) = EU[E[Y|A=a, U]] , θATT(a;a0) = EU[E[Y|A=a, U]|A=a0].

If we observable additional confounder O, we may also consider conditional average treatment effect (CATE): the average

potential outcome for the sub-population of O=o, which is discussed in Appendix B. Note that since the confounder Uis not

observed, we cannot recover these causal parameters only from (A, Y ).

Back-door Adjustment In back-door adjustment, we assume the access to the back-door variable X∈ X, which blocks all causal

paths from unobserved confounder Uto treatment A. See Figure 1b for a typical causal graph. Given the back-door variable, causal

parameters can be written only from observable variables (A, Y, X)as follows.

Proposition 2 (Pearl, 1995, Theorem 1).Given the back-door variable X, we have

θATE(a) = EX[g(a, X)] , θATT(a;a0) = EX[g(a, X)|A=a0],

where g(a, x) = E[Y|A=a, X =x].

By comparing Proposition 2 to Proposition 1, we can see that causal parameters can be learned by treating the back-door variable

Xas the only “confounder”, despite the presence of the additional hidden confounder U. Hence, we may apply any method based

on the “no unobservable confounder” assumption to back-door adjustment.

Front-door Adjustment Another adjustment for causal estimation is front-door adjustment, which uses the causal mechanism to

determine the causal effect. Assume we observe the front-door variable M∈ M, which blocks all causal paths from treatment Ato

outcome Y, as in Figure 1c. Then, we can recover the causal parameters as follows.

1In the binary treatment case A={0,1}, the ATE is typically deﬁned as the expectation of the difference of potential outcome E[Y(1) −Y(0)]. However, we

deﬁne ATE as the expectation of potential outcome E[Y(a)], which is a primary target of interest in a continuous treatment case, also known as dose response curve.

The same applies to the ATT as well.

A Y

(a) General causal graph

(b) Back-door adjustment

A M Y

Figure 1: Causal graphs we consider in this paper. The dotted circle means the unobservable variable.

Proposition 3 (Pearl, 1995, Theorem 2).Given the front-door variable M, we have

θATE(a) = EA0[EM[g(A0, M)|A=a]] , θATT(a;a0) = EM[g(a0, M )|A=a],

where g(a, m) = E[Y|A=a, M =m]and A0∈ A is a random variable that follows the same distribution as treatment A.

Unlike the case of the back-door adjustment, we cannot naively apply methods based on the “no unmeasured confounder” as-

sumption here, since Proposition 3 takes a different form to Proposition 1.

3 Algorithms

In this section, we present our proposed methods. We ﬁrst present the case with back-door adjustment and then move to front-door

adjustment.

Back-door adjustment The algorithm consists of two stages; In the ﬁrst stage, we learn the conditional expectation g=E[Y|A=a, X =x]

with a speciﬁc form. We then compute the causal parameter by estimating the expectation of the input features to g.

The conditional expectation g(a, x)is learned by regressing (A, X)to Y. Here, we consider a speciﬁc model g(a, x) =

w>(φA(a)⊗φX(x)), where φA:A → Rd1,φX:X → Rd2are feature maps represented by neural networks, w∈Rd1d2

is a trainable weight vector, and ⊗denotes a tensor product a⊗b= vec(ab>). Given data {(ai, yi, xi)}n

i=1 ∼Psize of n, the

feature maps φA,φXand the weight wcan be trained by minimizing the following empirical loss:

L1(w,φA,φX) = 1

i=1

(yi−w>(φA(ai)⊗φX(xi)))2.(1)

We may add any regularization term to this loss, such as weight decay λkwk2. Let the minimizer of the loss ˆ

L1be ˆ

w,ˆ

φA,ˆ

φZ=

arg min ˆ

L1and the learned regression function be ˆg(a, x) = ˆ

w>(ˆ

φA(a)⊗ˆ

φX(x)). Then, by substituting ˆgfor gin Proposition 2,

we have

θATE(a)'ˆ

w>ˆ

φA(a)⊗Ehˆ

φX(X)i, θATT(a;a0)'ˆ

w>ˆ

φA(a)⊗Ehˆ

φX(X)A=a0i.

This is the advantage of assuming the speciﬁc form of g(a, x) = w>(φA(a)⊗φX(x)); From linearity, we can recover the causal

parameters by estimating E[ˆ

φX(X)],E[ˆ

φX(X)|A=a0]. Such (conditional) expectations of features are called (conditional) mean

embedding, and thus, we name our method “neural (conditional) mean embedding”.

We can estimate the marginal expectation E[ˆ

φX(X)], as a simple empirical average E[ˆ

φX(X)] '1

nPn

i=1 ˆ

φX(xi).

The conditional mean embedding E[ˆ

φX(X)|A=a0]requires more care, however: it can be learned by a technique proposed in

Xu et al. [2021a], in which we train another regression function from treatment Ato the back-door feature ˆ

φX(X). Speciﬁcally, we

estimate E[ˆ

φX(X)|A=a0]by ˆ

fφX(a0), where the regression function ˆ

fφX:A → Rd2be given by

fφX= arg min

f:A→Rd2

L2(f;φX),ˆ

L2(f;φX) = 1

i=1 kφX(xi)−f(ai)k2.(2)

Here, k·kdenotes the Euclidean norm. The loss L2may include the additional regularization term such as a weight decay term for

parameters in f. We have

θATE(a) = ˆ

w> ˆ

φA(a)⊗1

i=1

φX(xi)!,ˆ

θATT(a;a0) = ˆ

w>ˆ

φA(a)⊗ˆ

fˆ

φX(a0)

as the ﬁnal estimator for the back-door adjustment. The estimator for the ATE ˆ

θATE is reduced to the average of the predictions

θATE =1

nPn

i=1 ˆg(a, xi). This coincides with other neural network causal methods [Shalit et al., 2017, Chernozhukov et al., 2022a],

which do not assume g(a, z) = w>(φA(a)⊗φX(x)). As we have seen, however, this tensor product formulation is essential for

estimating ATT by back-door adjustment. It will also be necessary for the front-door adjustment, as we will see next.

Front-door adjustment We can obtain the estimator for front-door adjustment by following the almost same procedure as the

back-door adjustment. Given data {(ai, yi, mi)}n

i=1, we again ﬁt the regression model ˆg(a, m) = ˆ

w>ˆ

φA(a)⊗ˆ

φM(m)by

solving

w,ˆ

φA,ˆ

φM= arg min 1

i=1

(yi−w>(φA(ai)⊗φM(mi)))2,

where φM:M → Rd2is a feature map represented as the neural network. From Proposition 3, we have

θATE(a)'ˆ

w>Ehˆ

φA(A)i⊗Ehˆ

φM(M)A=ai,

θATT(a;a0)'ˆ

w>ˆ

φA(a0)⊗Ehˆ

φM(M)A=ai.

Again, we estimate feature embedding by empirical average for E[ˆ

φA(A)] or solving another regression problem for E[ˆ

φM(M)|A=

a]. The ﬁnal estimator for front-door adjustment is given as

θATE(a) = ˆ

w> 1

i=1

φA(ai)⊗ˆ

fˆ

φM(a)!,ˆ

θATT(a;a0) = ˆ

w>ˆ

φA(a0)⊗ˆ

fˆ

φM(a),

where ˆ

fˆ

φMis given by minimizing loss L2(with additional regularization term) deﬁned as

L2=1

i=1 kφM(mi)−f(ai)k2.

4 Theoretical Analysis

In this section, we prove the consistency of the proposed method. We focus on the back-door adjustment case, since the consistency

of front-door adjustment can be derived identically. The proposed method consists of two successive regression problems. In the ﬁrst

stage, we learn the conditional expectation g, and then in the second stage, we estimate the feature embeddings. First, we show each

stage’s consistency, then present the overall convergence rate to the causal parameter.

Consistency for the ﬁrst stage: In this section, we consider the hypothesis space of gas

Hg={w>(φA(a)⊗φZ(z)) |w∈Rd1d2,φA(a)∈Rd1,φZ(z)∈Rd2,

kwk1≤R, max

a∈A kφA(a)k∞≤1,max

z∈Z kφZ(z)k∞≤1}.

Here, we denote `1-norm and inﬁnity norm of vector b∈Rdas kbk1=Pd

i=1 |bi|and kbk∞= maxi∈[d]bi. Note that from inequality

kφA(a)⊗φZ(z)k∞≤ kφA(a)k∞kφZ(z)k∞and H¨

older’s inequality, we can show that h(a, z)∈[−R, R]for all h∈ Hg. Given

this hypothesis space, the following lemma bounds the deviation of estimated conditional expectation ˆgand the true one.

Lemma 1. Given data S={ai, yi, xi}n

i=1, let minimizer of loss ˆ

L1be ˆg= arg min ˆ

L1. If the true conditional expectation gis in

the hypothesis space g∈ Hg, w.p. at least 1−2δ, we have

kg−ˆgkP(A,X)≤s16Rˆ

RS(Hg)+8R2rlog 2/δ

2n,

where ˆ

RS(Hg)is empirical Rademacher complexity of Hggiven data S.

The proof is given in Appendix A.2. Here, we present the empirical Rademacher complexity when we apply a feed-forward

neural network for features.

Lemma 2. The empirical Rademacher complexity ˆ

RS(Hg)scales as

RS(Hg)≤O(CL/√n)

for some constant Cif we use a speciﬁc L-layer neural net for features φA,φX.

See Lemma 6 in Appendix A.2 for the detailed expression of the upper bound. Note that this may be of independent interest since

the similar hypothesis class is considered in Xu et al. [2021a,b], and no explicit upper bound is provided on the empirical Rademacher

complexity in that work.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ANeuralMeanEmbeddingApproachforBack-doorandFront-doorAdjustmentLiyuanXuGatsbyUnitliyuan.jo.19@ucl.ac.ukArthurGrettonGatsbyUnitarthur.gretton@gmail.comOctober14,2022AbstractWeconsidertheestimationofaverageandcounterfactualtreatmenteffects,undertwosettings:back-dooradjustmentandfront-dooradjustment.Th...

展开>> 收起<<

A Neural Mean Embedding Approach for Back-door and Front-door Adjustment Liyuan Xu.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Neural Mean Embedding Approach for Back-door and Front-door Adjustment Liyuan Xu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: