Transfer Learning for Individual Treatment Effect Estimation Ahmed Aloui1Juncheng Dong1Cat P. Le1Vahid Tarokh1 1Department of Electrical and Computer Engineering Duke University

2025-05-06 0 0 1.44MB 22 页 10玖币

侵权投诉

Transfer Learning for Individual Treatment Effect Estimation

Ahmed Aloui ∗1Juncheng Dong ∗1Cat P. Le1Vahid Tarokh1

1Department of Electrical and Computer Engineering, Duke University

Abstract

This work considers the problem of transferring

causal knowledge between tasks for Individual

Treatment Effect (ITE) estimation. To this end, we

theoretically assess the feasibility of transferring

ITE knowledge and present a practical framework

for efﬁcient transfer. A lower bound is introduced

on the ITE error of the target task to demonstrate

that ITE knowledge transfer is challenging due to

the absence of counterfactual information. Never-

theless, we establish generalization upper bounds

on the counterfactual loss and ITE error of the

target task, demonstrating the feasibility of ITE

knowledge transfer. Subsequently, we introduce

a framework with a new Causal Inference Task

Afﬁnity (CITA) measure for ITE knowledge trans-

fer. Speciﬁcally, we use CITA to ﬁnd the closest

source task to the target task and utilize it for ITE

knowledge transfer. Empirical studies are provided,

demonstrating the efﬁcacy of the proposed method.

We observe that ITE knowledge transfer can sig-

niﬁcantly (up to 95%) reduce the amount of data

required for ITE estimation.

1 INTRODUCTION

Assessing the effects of treatments on people (i.e., the In-

dividual Treatment Effect (ITE) estimation) is of signiﬁ-

cant interest to various research communities, such as those

studying medicine and social policy making. In order to

study the causal relationship between the outcome and the

treatment, however, researchers must gather sufﬁcient data

samples from randomized control trials. This process can be

both costly and time-consuming [Kaur and Gupta, 2020]. To

this end, it is desirable to utilize knowledge from different

*Equal Contribution.

but closely related problems with transfer learning. For in-

stance, new vaccines must be developed for treatment when

the viruses undergo mutation. Suppose the mutated viruses

can be related to the known ones by a similarity measure. In

that case, the effects of vaccine candidates can be quickly

estimated based on this similarity with a small amount of

data collected from the new scenario. Hence, this approach

can notably accelerate the study.

While the recent progress in transfer learning is very promis-

ing [Wang and Deng, 2018, Alyafeai et al., 2020, Pan and

Yang, 2010, Zhuang et al., 2021], a major challenge for

transferring causal knowledge arises from non-causal (spu-

rious) correlations to which the statistical learning models

are vulnerable. For example, a classiﬁer may learn to use

the background colors to differentiate images of camels and

horses, as these objects are frequently depicted against dif-

ferent colored backgrounds [Arjovsky et al., 2019, Geirhos

et al., 2018, Beery et al., 2018]. In practice, the performance

of the ITE estimation models can never be evaluated be-

cause the counterfactual data is inaccessible, as shown in

Figure 1. This problem is known in the literature as the

fundamental problem of causal inference [Rubin, 1974, Hol-

land, 1986]. For instance, to compute the effect of vaccina-

tion on a person at some given time, that individual must

both be administered the vaccine, and also remain unvac-

cinated, which is obviously absurd. This scenario is very

different from the conventional supervised learning prob-

lems, where researchers often use a separate validation set

in order to estimate the accuracy of the trained model.

The aforementioned challenge implies that much attention

must be paid to selecting the appropriate source model in

causal knowledge transfer. Additionally, similar scenarios

to the given target task must be determined using a distance

accounting for the immeasurable counterfactual losses in

scenarios under consideration. In this work, we ﬁrst present

a lower bound and a set of generalization bounds for transfer

learning between causal inference tasks in order to demon-

strate both the difﬁculty and viability of causal knowledge

transfer. While these theoretical bounds are informative, a

Accepted for the 39th Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2023).

arXiv:2210.00380v3 [cs.LG] 5 Jun 2023

Figure 1: Inaccessibility to counterfactual data (e.g., a par-

allel universe where the treatments are reversed) makes

transferring causal knowledge more challenging.

method is needed for selecting the optimal source model

from multiple source tasks. This is discussed in Section 5,

where we introduce a framework endowed with a new task

afﬁnity, namely the Causal Inference Task Afﬁnity (CITA),

tailored explicitly for causal knowledge transfer. This task

afﬁnity is used for selecting the “closest” source task. Subse-

quently its knowledge (e.g., trained models, source dataset)

is utilized in the learning of the target task, as depicted in

Figure 2. Our contributions are summarized below:

We establish a new lower bound to demonstrate the

challenges of transferring ITE knowledge. Addition-

ally, we prove new regret bounds for learning the

counterfactual outcomes and ITEs of the target tasks

in causal transfer learning scenarios. These bounds

demonstrate the feasibility of transferring ITE knowl-

edge by stating that the error of any source model on

the target task is upper bounded by quantiﬁable mea-

sures related to (i) the performance of the source model

on the source task and (ii) the differences between the

source and the target causal inference tasks.

We introduce CITA, a task afﬁnity for causal inference,

which captures the symmetry of ITEs (i.e., invariance

to the relabeling of treatment assignments under the ac-

tion of the symmetric group). Additionally, we provide

theoretical (e.g., Theorem F.3) and empirical evidence

to show that CITA is highly correlated with the coun-

terfactuals loss, which is not measurable in practice.

We propose an ITE estimation framework and a set of

causal inference datasets suitable for learning causal

knowledge transfer. The empirical evidence on the

above datasets demonstrates that our methods can es-

timate the ITEs for the target task with signiﬁcantly

fewer (up to 95% reduction) data samples compared to

the case where transfer learning is not performed.

2 RELATED WORK

Many approaches in transfer learning [Thrun and Pratt,

2012, Blum and Mitchell, 1998, Silver and Bennett, 2008,

Sharif Razavian et al., 2014, Finn et al., 2016, Fernando

et al., 2017, Rusu et al., 2016, Le et al., 2020] have been

proposed, analyzed and applied in various machine learning

applications. Transfer learning techniques inherently assume

that prior knowledge in the selected source model helps with

learning a target task [Pan and Yang, 2010, Zhuang et al.,

2021]. In other words, these methods often do not consider

the selection of the base task to perform knowledge transfer.

Consequently, in some rare cases, transfer learning may

even degrade the performance of the model Standley et al.

[2020]. In order to avoid potential performance loss during

knowledge transfer to a target task, task afﬁnity (or task

similarity) is considered as a selection method that identiﬁes

a group of closest base candidates from the set of the prior

learned tasks. Task afﬁnity has been investigated and applied

to various domains (e.g., transfer learning [Zamir et al.,

2018, Dwivedi and Roig, 2019, Wang et al., 2019], neural

architecture search [Le et al., 2021, 2022a, Le et al., 2021],

few-shot learning [Pal and Balasubramanian, 2019, Le et al.,

2022b], multi-task learning [Standley et al., 2020], continual

learning [Kirkpatrick et al., 2017, Chen et al., 2018]).

While transfer learning and task afﬁnity have been inves-

tigated in numerous application areas, their applications

to causal inference have yet to be thoroughly investigated.

Neyman-Rubin Causal Model [Neyman, 1923, Donald,

2005] and Pearl’s Do-calculus [Pearl, 2009] are popular

frameworks for causal studies based on different perspec-

tives. A central question in the Neyman-Rubin Causal Model

framework is determining conditions for identiﬁability of

causal quantities such as Average and Individual Treatment

Effects. Previous work considered estimators for Average

Treatment Effect based on various methods such as Covari-

ate Adjustment [Rubin, 1978], weighting methods such as

those utilizing propensity scores [Rosenbaum and Rubin,

1983], and Doubly Robust estimators [Funk et al., 2011].

With the emergence of Machine Learning techniques, more

recent approaches to causal inference include the appli-

cations of decision trees[Wager and Athey, 2018, Athey

and Imbens, 2016], Gaussian Processes [Alaa and Van

Der Schaar, 2017], and Generative Modeling [Yoon et al.,

2018] to ITE estimation. In particular, deep neural networks

have successfully learned ITEs and estimated counterfactual

outcomes by data balancing in the latent domain [Johansson

et al., 2016, Shalit et al., 2017]. Please note that the trans-

portation of causal graphs is another well-studied closely

related ﬁeld in the causality literature [Bareinboim and Pearl,

2012]. It studies transferring knowledge of causal relation-

ships in Pearl’s do-calculus framework. In contrast, in this

paper, we are interested in transferring knowledge of ITE

from a source task to a target task in the Neyman-Rubin

framework using representation learning. A closely related

problem to ours is the domain adaptation problem for ITE

estimation, as explored in [Bica and van der Schaar, 2022,

Vo et al., 2022, Aglietti et al., 2020]. These works primarily

focus on situations where only the distribution of popula-

tions changes, leaving the causal functions unaltered. In

our research, we provide theoretical analysis and empirical

studies for the case where both the population distributions

and the causal mechanisms can change.

3 MATHEMATICAL BACKGROUND

3.1 CAUSAL INFERENCE

Let

X∈ X ⊂ Rd

be the covariates (i.e., input features),

A∈ {0, . . . , M}

be the treatment, and

Y∈ Y ⊂ R

the factual (observed) outcome. For every

j∈ {0, . . . , M}

we deﬁne

to be the potential outcome [Rubin, 1974]

that would have been observed if only the treatment

j, j ∈ {0,1,··· , M}

was assigned. In the medical context,

for instance,

is the individual information (e.g., weight,

heart rate),

is the treatment assignment (e.g.,

A= 0

the individual did not receive a vaccine, and

A= 1

if the

individual is vaccinated),

is the outcome (e.g., mortality

data). A causal inference dataset is a collection of factual

observations

DF={(xi, ai), yi}N

i=1

, where

is the num-

ber of samples. We assume these samples are independently

drawn from the same factual distribution

. In a parallel

universe, if the roles of the treatment and control groups

were reversed, we would have observed a different set of

samples

DCF

sampled from the counterfactual distribution

pCF

. In this work, we present our results for the binary case,

i.e.,

M= 1

. However, our approach can be easily extended

to any positive integer

M < ∞

. In the binary case, the

individuals who received treatments

A= 0

and

A= 1

are

respectively denoted by the control and treatment groups.

Deﬁnition 3.1 (ITE).The Individual Treatment Effect

(ITE), referred to as the Conditional Average Treatment

Effect (CATE) [Imbens and Rubin, 2015], is deﬁned as:

∀x∈ X, τ(x) = E[Y1−Y0|X=x](1)

We assume that the data generation process respects the

overlap, i.e.

∀x∈ X,0< p(a= 1|x)<1

, and conditional

unconfoundedness, i.e.

(Y1, Y 0)⊥⊥ A|X

[Robins, 1986].

These assumptions are sufﬁcient conditions for the ITE to

be identiﬁable [Imbens, 2004]. We also assume that the

true causal relationship is described by a function

f(x, a)

which can be expressed as an expected value in the non-

deterministic case. By deﬁnition

τ(x) = f(x, 1) −f(x, 0)

Let

f(x, a)

denote a hypothesis that estimates the true

function

f(x, a)

. Thus, the ITE function can then be es-

timated as

ˆτ(x) = ˆ

f(x, 1) −ˆ

f(x, 0)

. We use

ℓˆ

f(x, a, y)

to denote a loss function that quantiﬁes the performance

f(·,·)

. A possible example is the

loss deﬁned as

ℓˆ

f(x, a, y) = (y−ˆ

f(x, a))2.

Deﬁnition 3.2 (Factual Loss).For a hypothesis

and a loss

function lˆ

f, the factual loss is deﬁned as:

ϵF(ˆ

f) = ZX ×{0,1}×Y

lˆ

f(x, a, y)pF(x, a, y)dxdady (2)

We also deﬁne the factual loss for the treatment (

a= 1

) and

control (a= 0) groups respectively as:

ϵa=1

F(ˆ

f) = ZX ×Y

lˆ

f(x, 1, y)pF(x, y|a= 1)dxdy (3)

and

ϵa=0

F(ˆ

f) = ZX ×Y

lˆ

f(x, 0, y)pF(x, y|a= 0)dxdy (4)

Deﬁnition 3.3 (Counterfactual Loss).The counterfactual

loss is deﬁned as:

ϵCF (ˆ

f) = ZX ×{0,1}×Y

lˆ

f(x, a, y)pCF (x, a, y)dxdady

(5)

We also deﬁne the counterfactual loss for the treatment

(a= 1) and control (a= 0) groups respectively as:

ϵa=1

CF (ˆ

f) = ZX ×Y

lˆ

f(x, 1, y)pCF (x, y|a= 1)dxdy (6)

and

ϵa=0

CF (ˆ

f) = ZX ×Y

lˆ

f(x, 0, y)pCF (x, y|a= 0)dxdy (7)

The counterfactual loss corresponds to the expected loss

value in a parallel universe where the roles of the control

and treatment groups are exchanged.

Deﬁnition 3.4. The Expected Precision in Estimating Het-

erogeneous Treatment Effect (PEHE) is deﬁned as:

εP EHE (ˆ

f) = ZX

(ˆτ(x)−τ(x))2pF(x)dx. (8)

Here,

εP EHE

[Hill, 2011] is often used as the performance

metric for estimation of ITEs [Shalit et al., 2017, Johans-

son et al., 2016]. A critical connection between the factual

loss (

ϵF

), the counterfactual loss (

ϵCF

), and

εP EHE

is that

for small values of

ϵF

and

ϵCF

causal models have good

performance (i.e., low

εP EHE

). However, the

εP EHE

not directly accessible in causal inference scenarios because

the calculation of

τ(x)

(i.e., the ground truth ITE values)

requires access to the counterfactual values. In this light, we

choose a hypothesis that instead optimizes an upper bound

of εP EHE given in Equation 10.

3.2 REPRESENTATION LEARNING FOR ITE

ESTIMATION

In this work, we consider The TARNet model Shalit et al.

[2017] for causal learning. TARNet was developed as a

framework to estimate ITEs using counterfactual balancing.

It consists of a pair of functions

(Φ, h)

where

Φ : Rd→Rl

is a representation learning function, and

h:Rl×{0,1} →

is a function learning the two potential outcomes func-

tions in the representation space. The hypothesis learning

for the true causal function is

f(x, a) = h(Φ(x), a)

and the

loss function

ℓˆ

is denoted by

ℓ(Φ,h)

. To ensure the simi-

larity between the features of the treatment group and that

of the control group in the representation space, TARNet

uses the Integral Probability Metric in order to measure the

distance between distributions, deﬁned as:

IPM

G(p, q) := sup

g∈GZS

g(s)(p(s)−q(s))ds(9)

where the supremum is taken over a given class of functions

. It follows from the Kantorovich-Rubinstein duality Vil-

lani [2009] that

IPM

reduces to the 1-Wassertein distance

when

is the set of 1-Lipschtiz functions as is the case

in our numerical experiments. Here, the TARNet model

learns to estimate the potential outcomes by minimizing the

following objective:

L(Φ, h) = 1

i=1

wi·ℓ(Φ,h)(xi, ai, yi)

+α·IPM

G{Φ (xi)}i:ai=0 ,{Φ (xi)}i:ai=1(10)

where

wi=ai

2v+1−ai

2(1 −v)

v=1

i=1

, and

is the

balancing weight which controls the trade-off between the

similarity of the representations in the latent domain and the

model’s performance on the factual data.

4 THEORETICAL FRAMEWORK

In this section, we provide learning bounds on the counter-

factual loss of the target task, and

εP EHE

(i.e., the error

in estimating ITE). These bounds are inspired by the work

of Ben-David et al. [2010] in the non-causal setting. We

use superscripts

and

to respectively denote quantities

related to the target and source tasks. Let

τT

denote the

individual treatment effect function of the target task. We

consider the performance of a well-trained source model

fS:X × {0,1}→Ywhen applied to a target task:

εT

P EHE (ˆ

fS) =

x∼pT

FτT(x)−[ˆ

fS(x, 1) −ˆ

fS(x, 0)]2(11)

4.1 THE CHALLENGE OF ITE KNOWLEDGE

TRANSFER

We ﬁrst provide a lower bound on

εP EHE

that consists of

both the factual and the counterfactual losses. This bound

implies that good performance on the counterfactual data is

anecessary condition for accurate estimation of ITE.

Theorem 4.1. Let

be a model trained on a source task,

and u=pT

F(A= 1) then

ϵT

F(ˆ

fS) + uϵT,a=0

CF (ˆ

fS)≤εT

P EHE (ˆ

fS)(12)

According to the bound in Theorem 4.1, simply minimizing

the factual loss of the target may not guarantee a good

performance. Hence, choosing a source model with low (or

zero) factual loss on the target task cannot perform well if

the (immeasurable) counterfactual loss of the target becomes

excessively high. In other words, the performance of the

chosen source model can be arbitrarily inadequate, while its

performance appears perfect on factual data.

While Theorem 4.1 has implied that causal knowledge can-

not be transferred without any assumption, the learning

bounds presented in the following section prove the via-

bility of transferring causal knowledge under reasonable

assumptions.

4.2 GENERAL LEARNING BOUNDS

The problem of ITE knowledge transfer can be expressed as

two triples (pS

F, pS

CF , fS)and (pT

F, pT

CF , fT)where:

•pS

and

respectively denote the factual probability

distribution of the source and target tasks.

•pS

and

respectively denote the counterfactual

distribution of the source and target tasks.

•fS

and

respectively denote the underlying causal

function of the source task and the target task.

We use the

distance to measure the similarity between

probability distributions, deﬁned as:

V(p, q) = ZS|p(s)−q(s)|ds. (13)

Theorem 4.2. For any hypothesis ˆ

f, we have:

ϵT

CF (ˆ

f)≤ϵS

F(ˆ

f) + V(pT

F, pS

F) + V(pT

F, pT

CF )

(x,a)∼pS

[|fS(x, a)−fT(x, a)|](14)

and

εT

P EHE (ˆ

f)≤4ϵS

F(ˆ

f)+4V(pT

F, pS

F)+2V(pT

F, pT

CF )

+ 4 E

(x,a)∼pS

[|fS(x, a)−fT(x, a)|]

(15)

We note that the learning bounds consist of (1) the source

factual loss, (2) the difference between the causal func-

tions, and (3) a measure of similarities between probability

distributions. However, the

distance in Theorem 4.2 is in-

tractable in practice. A more reasonable candidate distance

is IPM distance as deﬁned in Equation 9. The

distance

can be replaced with the IPM distance as demonstrated by

the following Theorem 4.3.

Theorem 4.3. Suppose that the function class

is stable

under addition and multiplication and ˆ

f, fT∈G, then

ϵT

CF (ˆ

f)≤ϵS

F(ˆ

f) + IPM

G(pT

F, pS

F) + IPM

G(pT

F, pT

CF )

(x,a)∼pS

[|fS(x, a)−fT(x, a)|]

and

εT

P EHE (ˆ

f)≤4ϵS

F(ˆ

f)+4IPM

G(pT

F, pS

F)+2IPM

G(pT

F, pT

CF )

+ 4 E

(x,a)∼pS

[|fS(x, a)−fT(x, a)|]

4.3 BOUNDS FOR COUNTERFACTUAL

BALANCING FRAMEWORKS

Suppose that we have a representation learning model (e.g.,

TARNet) ˆ

fS= (Φ, h)trained on a source causal inference

task. We apply the source model to a different target task.

For notational simplicity, we denote

P(Φ(X)|A=a)

P(Φ(Xa))

for

a∈ {0,1}

. We make the following assump-

tions A1, A2, A3:

•A1:Φis injective (thus Ψ=Φ−1exists on Im(Φ)).

•

A2: There exists a real function space

Im(Φ)

such

that the function r7→ ℓT

Φ,h(Ψ(r), a, y)∈G.

•

A3: There exists a function class

G′

such that

y7→ ℓΦ,h(x, a, y)∈G′.

The above theorem guarantees that causal knowledge can

be transferred under reasonable assumptions. The following

Lemma provides an upper bound on the counterfactual loss

for transferring causal knowledge.

Lemma 4.4. Suppose that Assumptions A1, A2, A3 hold.

Then the counterfactual loss of any model

(Φ, h)

on the

target task satisﬁes:

ϵT

CF (Φ, h)≤ϵS,a=1

F(Φ, h) + ϵS,a=0

F(Φ, h)

+IPM

G(P(Φ(XT

1)), P (Φ(XS

1)))

+IPM

G(P(Φ(XT

0)), P (Φ(XS

0)))

+IPM

G(P(Φ(XT

0)), P (Φ(XT

1))) + 2γ∗

where

γ∗=E

x∼pS

FhIPM

G′(P(YS

a|x), P (YT

a|x))i(16)

measures the fundamental difference between two causal

inference tasks.

Theorem 4.5. (Transferability of Causal Knowledge) Sup-

pose that Assumptions A1, A2, A3 hold. The performance

of source model on target task, i.e.

εT

P EHE (Φ, h)

, is upper

bounded by:

εT

P EHE (Φ, h)≤2(ϵS,a=1

F(Φ, h) + ϵS,a=0

F(Φ, h)

+IPM

G(P(Φ(XT

1)), P (Φ(XS

1)))

+IPM

G(P(Φ(XT

0)), P (Φ(XS

0)))

+IPM

G(P(Φ(XT

0)), P (Φ(XT

1)) + 2γ∗)

Theorem 4.5 implies that good performance on the target

task is guaranteed if (1) the source model has a slight factual

loss (e.g., the ﬁrst and second term in the upper bound) and

(2) the distributions of the control and the treatment group

features are similar in the latent domain (e.g., the last three

terms in the upper bound). This upper bound provides a

sufﬁcient condition for transfer learning in causal inference

scenarios, indicating the transferability of causal knowledge.

5 TASK-AWARE ITE KNOWLEDGE

TRANSFER

In Section 4, the regret bounds indicate the transferabil-

ity of causal knowledge between pair of causal inference

tasks. In this section, we propose a causal inference learning

framework (illustrated in Figure 2) capable of identifying

the most relevant causal knowledge, when multiple sources

exist, to train the target task. Note that although the general-

ization bounds are informative for understanding viability

of transferring causal knowledge, they may not be the most

constructive approach to select the best source task because

the order of the upper bounds of errors is not necessarily the

same as the order of the errors. To this end, we ﬁrst propose

a task afﬁnity (CITA) that satisﬁes the symmetry property

of causal inference tasks (see Sec 5.2) to ﬁnd the closest

source task to the target task. We observe that CITA strongly

correlates with counterfactual loss. After obtaining the clos-

est task using the computed task distances, its knowledge

(e.g., trained model, bundled data) is utilized for training

the target task.

5.1 TASK AFFINITY SCORE

Let

(T, D)

denote the pair of a causal inference task

and its dataset

D= (X, A, Y )

, where

consists of the

covariates

, the corresponding treatment assignments

and the factual outcomes

. We formalize a sufﬁciently

well-trained deep network representing a causal task-dataset

pair

(T, D)

in Appendix (see Sec F). Here, all the previous

tasks’ models are assumed to be sufﬁciently well-trained

to represent the corresponding tasks. Next, we recall the

deﬁnitions of the Fisher Information matrix and the Task

Afﬁnity Score [Le et al., 2022b,a].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TransferLearningforIndividualTreatmentEffectEstimationAhmedAloui∗1JunchengDong∗1CatP.Le1VahidTarokh11DepartmentofElectricalandComputerEngineering,DukeUniversityAbstractThisworkconsiderstheproblemoftransferringcausalknowledgebetweentasksforIndividualTreatmentEffect(ITE)estimation.Tothisend,wetheoreti...

展开>> 收起<<

Transfer Learning for Individual Treatment Effect Estimation Ahmed Aloui1Juncheng Dong1Cat P. Le1Vahid Tarokh1 1Department of Electrical and Computer Engineering Duke University.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transfer Learning for Individual Treatment Effect Estimation Ahmed Aloui1Juncheng Dong1Cat P. Le1Vahid Tarokh1 1Department of Electrical and Computer Engineering Duke University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: