Transfer Learning for Individual Treatment Effect Estimation Ahmed Aloui1Juncheng Dong1Cat P. Le1Vahid Tarokh1 1Department of Electrical and Computer Engineering Duke University

2025-05-06 0 0 1.44MB 22 页 10玖币
侵权投诉
Transfer Learning for Individual Treatment Effect Estimation
Ahmed Aloui 1Juncheng Dong 1Cat P. Le1Vahid Tarokh1
1Department of Electrical and Computer Engineering, Duke University
Abstract
This work considers the problem of transferring
causal knowledge between tasks for Individual
Treatment Effect (ITE) estimation. To this end, we
theoretically assess the feasibility of transferring
ITE knowledge and present a practical framework
for efficient transfer. A lower bound is introduced
on the ITE error of the target task to demonstrate
that ITE knowledge transfer is challenging due to
the absence of counterfactual information. Never-
theless, we establish generalization upper bounds
on the counterfactual loss and ITE error of the
target task, demonstrating the feasibility of ITE
knowledge transfer. Subsequently, we introduce
a framework with a new Causal Inference Task
Affinity (CITA) measure for ITE knowledge trans-
fer. Specifically, we use CITA to find the closest
source task to the target task and utilize it for ITE
knowledge transfer. Empirical studies are provided,
demonstrating the efficacy of the proposed method.
We observe that ITE knowledge transfer can sig-
nificantly (up to 95%) reduce the amount of data
required for ITE estimation.
1 INTRODUCTION
Assessing the effects of treatments on people (i.e., the In-
dividual Treatment Effect (ITE) estimation) is of signifi-
cant interest to various research communities, such as those
studying medicine and social policy making. In order to
study the causal relationship between the outcome and the
treatment, however, researchers must gather sufficient data
samples from randomized control trials. This process can be
both costly and time-consuming [Kaur and Gupta, 2020]. To
this end, it is desirable to utilize knowledge from different
*Equal Contribution.
but closely related problems with transfer learning. For in-
stance, new vaccines must be developed for treatment when
the viruses undergo mutation. Suppose the mutated viruses
can be related to the known ones by a similarity measure. In
that case, the effects of vaccine candidates can be quickly
estimated based on this similarity with a small amount of
data collected from the new scenario. Hence, this approach
can notably accelerate the study.
While the recent progress in transfer learning is very promis-
ing [Wang and Deng, 2018, Alyafeai et al., 2020, Pan and
Yang, 2010, Zhuang et al., 2021], a major challenge for
transferring causal knowledge arises from non-causal (spu-
rious) correlations to which the statistical learning models
are vulnerable. For example, a classifier may learn to use
the background colors to differentiate images of camels and
horses, as these objects are frequently depicted against dif-
ferent colored backgrounds [Arjovsky et al., 2019, Geirhos
et al., 2018, Beery et al., 2018]. In practice, the performance
of the ITE estimation models can never be evaluated be-
cause the counterfactual data is inaccessible, as shown in
Figure 1. This problem is known in the literature as the
fundamental problem of causal inference [Rubin, 1974, Hol-
land, 1986]. For instance, to compute the effect of vaccina-
tion on a person at some given time, that individual must
both be administered the vaccine, and also remain unvac-
cinated, which is obviously absurd. This scenario is very
different from the conventional supervised learning prob-
lems, where researchers often use a separate validation set
in order to estimate the accuracy of the trained model.
The aforementioned challenge implies that much attention
must be paid to selecting the appropriate source model in
causal knowledge transfer. Additionally, similar scenarios
to the given target task must be determined using a distance
accounting for the immeasurable counterfactual losses in
scenarios under consideration. In this work, we first present
a lower bound and a set of generalization bounds for transfer
learning between causal inference tasks in order to demon-
strate both the difficulty and viability of causal knowledge
transfer. While these theoretical bounds are informative, a
Accepted for the 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023).
arXiv:2210.00380v3 [cs.LG] 5 Jun 2023
Figure 1: Inaccessibility to counterfactual data (e.g., a par-
allel universe where the treatments are reversed) makes
transferring causal knowledge more challenging.
method is needed for selecting the optimal source model
from multiple source tasks. This is discussed in Section 5,
where we introduce a framework endowed with a new task
affinity, namely the Causal Inference Task Affinity (CITA),
tailored explicitly for causal knowledge transfer. This task
affinity is used for selecting the “closest” source task. Subse-
quently its knowledge (e.g., trained models, source dataset)
is utilized in the learning of the target task, as depicted in
Figure 2. Our contributions are summarized below:
1.
We establish a new lower bound to demonstrate the
challenges of transferring ITE knowledge. Addition-
ally, we prove new regret bounds for learning the
counterfactual outcomes and ITEs of the target tasks
in causal transfer learning scenarios. These bounds
demonstrate the feasibility of transferring ITE knowl-
edge by stating that the error of any source model on
the target task is upper bounded by quantifiable mea-
sures related to (i) the performance of the source model
on the source task and (ii) the differences between the
source and the target causal inference tasks.
2.
We introduce CITA, a task affinity for causal inference,
which captures the symmetry of ITEs (i.e., invariance
to the relabeling of treatment assignments under the ac-
tion of the symmetric group). Additionally, we provide
theoretical (e.g., Theorem F.3) and empirical evidence
to show that CITA is highly correlated with the coun-
terfactuals loss, which is not measurable in practice.
3.
We propose an ITE estimation framework and a set of
causal inference datasets suitable for learning causal
knowledge transfer. The empirical evidence on the
above datasets demonstrates that our methods can es-
timate the ITEs for the target task with significantly
fewer (up to 95% reduction) data samples compared to
the case where transfer learning is not performed.
2 RELATED WORK
Many approaches in transfer learning [Thrun and Pratt,
2012, Blum and Mitchell, 1998, Silver and Bennett, 2008,
Sharif Razavian et al., 2014, Finn et al., 2016, Fernando
et al., 2017, Rusu et al., 2016, Le et al., 2020] have been
proposed, analyzed and applied in various machine learning
applications. Transfer learning techniques inherently assume
that prior knowledge in the selected source model helps with
learning a target task [Pan and Yang, 2010, Zhuang et al.,
2021]. In other words, these methods often do not consider
the selection of the base task to perform knowledge transfer.
Consequently, in some rare cases, transfer learning may
even degrade the performance of the model Standley et al.
[2020]. In order to avoid potential performance loss during
knowledge transfer to a target task, task affinity (or task
similarity) is considered as a selection method that identifies
a group of closest base candidates from the set of the prior
learned tasks. Task affinity has been investigated and applied
to various domains (e.g., transfer learning [Zamir et al.,
2018, Dwivedi and Roig, 2019, Wang et al., 2019], neural
architecture search [Le et al., 2021, 2022a, Le et al., 2021],
few-shot learning [Pal and Balasubramanian, 2019, Le et al.,
2022b], multi-task learning [Standley et al., 2020], continual
learning [Kirkpatrick et al., 2017, Chen et al., 2018]).
While transfer learning and task affinity have been inves-
tigated in numerous application areas, their applications
to causal inference have yet to be thoroughly investigated.
Neyman-Rubin Causal Model [Neyman, 1923, Donald,
2005] and Pearl’s Do-calculus [Pearl, 2009] are popular
frameworks for causal studies based on different perspec-
tives. A central question in the Neyman-Rubin Causal Model
framework is determining conditions for identifiability of
causal quantities such as Average and Individual Treatment
Effects. Previous work considered estimators for Average
Treatment Effect based on various methods such as Covari-
ate Adjustment [Rubin, 1978], weighting methods such as
those utilizing propensity scores [Rosenbaum and Rubin,
1983], and Doubly Robust estimators [Funk et al., 2011].
With the emergence of Machine Learning techniques, more
recent approaches to causal inference include the appli-
cations of decision trees[Wager and Athey, 2018, Athey
and Imbens, 2016], Gaussian Processes [Alaa and Van
Der Schaar, 2017], and Generative Modeling [Yoon et al.,
2018] to ITE estimation. In particular, deep neural networks
have successfully learned ITEs and estimated counterfactual
outcomes by data balancing in the latent domain [Johansson
et al., 2016, Shalit et al., 2017]. Please note that the trans-
portation of causal graphs is another well-studied closely
related field in the causality literature [Bareinboim and Pearl,
2012]. It studies transferring knowledge of causal relation-
ships in Pearl’s do-calculus framework. In contrast, in this
paper, we are interested in transferring knowledge of ITE
from a source task to a target task in the Neyman-Rubin
framework using representation learning. A closely related
problem to ours is the domain adaptation problem for ITE
estimation, as explored in [Bica and van der Schaar, 2022,
Vo et al., 2022, Aglietti et al., 2020]. These works primarily
focus on situations where only the distribution of popula-
tions changes, leaving the causal functions unaltered. In
our research, we provide theoretical analysis and empirical
studies for the case where both the population distributions
and the causal mechanisms can change.
3 MATHEMATICAL BACKGROUND
3.1 CAUSAL INFERENCE
Let
X∈ X Rd
be the covariates (i.e., input features),
A∈ {0, . . . , M}
be the treatment, and
Y∈ Y R
be
the factual (observed) outcome. For every
j∈ {0, . . . , M}
we define
Yj
to be the potential outcome [Rubin, 1974]
that would have been observed if only the treatment
A=
j, j ∈ {0,1,··· , M}
was assigned. In the medical context,
for instance,
X
is the individual information (e.g., weight,
heart rate),
A
is the treatment assignment (e.g.,
A= 0
if
the individual did not receive a vaccine, and
A= 1
if the
individual is vaccinated),
Y
is the outcome (e.g., mortality
data). A causal inference dataset is a collection of factual
observations
DF={(xi, ai), yi}N
i=1
, where
N
is the num-
ber of samples. We assume these samples are independently
drawn from the same factual distribution
pF
. In a parallel
universe, if the roles of the treatment and control groups
were reversed, we would have observed a different set of
samples
DCF
sampled from the counterfactual distribution
pCF
. In this work, we present our results for the binary case,
i.e.,
M= 1
. However, our approach can be easily extended
to any positive integer
M <
. In the binary case, the
individuals who received treatments
A= 0
and
A= 1
are
respectively denoted by the control and treatment groups.
Definition 3.1 (ITE).The Individual Treatment Effect
(ITE), referred to as the Conditional Average Treatment
Effect (CATE) [Imbens and Rubin, 2015], is defined as:
x∈ X, τ(x) = E[Y1Y0|X=x](1)
We assume that the data generation process respects the
overlap, i.e.
x∈ X,0< p(a= 1|x)<1
, and conditional
unconfoundedness, i.e.
(Y1, Y 0)A|X
[Robins, 1986].
These assumptions are sufficient conditions for the ITE to
be identifiable [Imbens, 2004]. We also assume that the
true causal relationship is described by a function
f(x, a)
,
which can be expressed as an expected value in the non-
deterministic case. By definition
τ(x) = f(x, 1) f(x, 0)
.
Let
ˆ
f(x, a)
denote a hypothesis that estimates the true
function
f(x, a)
. Thus, the ITE function can then be es-
timated as
ˆτ(x) = ˆ
f(x, 1) ˆ
f(x, 0)
. We use
ˆ
f(x, a, y)
to denote a loss function that quantifies the performance
of
ˆ
f(·,·)
. A possible example is the
L2
loss defined as
ˆ
f(x, a, y) = (yˆ
f(x, a))2.
Definition 3.2 (Factual Loss).For a hypothesis
ˆ
f
and a loss
function lˆ
f, the factual loss is defined as:
ϵF(ˆ
f) = ZX ×{0,1}×Y
lˆ
f(x, a, y)pF(x, a, y)dxdady (2)
We also define the factual loss for the treatment (
a= 1
) and
control (a= 0) groups respectively as:
ϵa=1
F(ˆ
f) = ZX ×Y
lˆ
f(x, 1, y)pF(x, y|a= 1)dxdy (3)
and
ϵa=0
F(ˆ
f) = ZX ×Y
lˆ
f(x, 0, y)pF(x, y|a= 0)dxdy (4)
Definition 3.3 (Counterfactual Loss).The counterfactual
loss is defined as:
ϵCF (ˆ
f) = ZX ×{0,1}×Y
lˆ
f(x, a, y)pCF (x, a, y)dxdady
(5)
We also define the counterfactual loss for the treatment
(a= 1) and control (a= 0) groups respectively as:
ϵa=1
CF (ˆ
f) = ZX ×Y
lˆ
f(x, 1, y)pCF (x, y|a= 1)dxdy (6)
and
ϵa=0
CF (ˆ
f) = ZX ×Y
lˆ
f(x, 0, y)pCF (x, y|a= 0)dxdy (7)
The counterfactual loss corresponds to the expected loss
value in a parallel universe where the roles of the control
and treatment groups are exchanged.
Definition 3.4. The Expected Precision in Estimating Het-
erogeneous Treatment Effect (PEHE) is defined as:
εP EHE (ˆ
f) = ZX
(ˆτ(x)τ(x))2pF(x)dx. (8)
Here,
εP EHE
[Hill, 2011] is often used as the performance
metric for estimation of ITEs [Shalit et al., 2017, Johans-
son et al., 2016]. A critical connection between the factual
loss (
ϵF
), the counterfactual loss (
ϵCF
), and
εP EHE
is that
for small values of
ϵF
and
ϵCF
causal models have good
performance (i.e., low
εP EHE
). However, the
εP EHE
is
not directly accessible in causal inference scenarios because
the calculation of
τ(x)
(i.e., the ground truth ITE values)
requires access to the counterfactual values. In this light, we
choose a hypothesis that instead optimizes an upper bound
of εP EHE given in Equation 10.
3.2 REPRESENTATION LEARNING FOR ITE
ESTIMATION
In this work, we consider The TARNet model Shalit et al.
[2017] for causal learning. TARNet was developed as a
framework to estimate ITEs using counterfactual balancing.
It consists of a pair of functions
, h)
where
Φ : RdRl
is a representation learning function, and
h:Rl×{0,1} →
R
is a function learning the two potential outcomes func-
tions in the representation space. The hypothesis learning
for the true causal function is
ˆ
f(x, a) = h(Φ(x), a)
and the
loss function
ˆ
f
is denoted by
,h)
. To ensure the simi-
larity between the features of the treatment group and that
of the control group in the representation space, TARNet
uses the Integral Probability Metric in order to measure the
distance between distributions, defined as:
IPM
G(p, q) := sup
gGZS
g(s)(p(s)q(s))ds(9)
where the supremum is taken over a given class of functions
G
. It follows from the Kantorovich-Rubinstein duality Vil-
lani [2009] that
IPM
reduces to the 1-Wassertein distance
when
G
is the set of 1-Lipschtiz functions as is the case
in our numerical experiments. Here, the TARNet model
learns to estimate the potential outcomes by minimizing the
following objective:
L, h) = 1
N
N
X
i=1
wi·,h)(xi, ai, yi)
+α·IPM
G{Φ (xi)}i:ai=0 ,{Φ (xi)}i:ai=1(10)
where
wi=ai
2v+1ai
2(1 v)
,
v=1
N
N
X
i=1
ai
, and
α
is the
balancing weight which controls the trade-off between the
similarity of the representations in the latent domain and the
model’s performance on the factual data.
4 THEORETICAL FRAMEWORK
In this section, we provide learning bounds on the counter-
factual loss of the target task, and
εP EHE
(i.e., the error
in estimating ITE). These bounds are inspired by the work
of Ben-David et al. [2010] in the non-causal setting. We
use superscripts
T
and
S
to respectively denote quantities
related to the target and source tasks. Let
τT
denote the
individual treatment effect function of the target task. We
consider the performance of a well-trained source model
ˆ
fS:X × {0,1}→Ywhen applied to a target task:
εT
P EHE (ˆ
fS) =
E
xpT
FτT(x)[ˆ
fS(x, 1) ˆ
fS(x, 0)]2(11)
4.1 THE CHALLENGE OF ITE KNOWLEDGE
TRANSFER
We first provide a lower bound on
εP EHE
that consists of
both the factual and the counterfactual losses. This bound
implies that good performance on the counterfactual data is
anecessary condition for accurate estimation of ITE.
Theorem 4.1. Let
ˆ
fS
be a model trained on a source task,
and u=pT
F(A= 1) then
ϵT
F(ˆ
fS) + T,a=0
CF (ˆ
fS)εT
P EHE (ˆ
fS)(12)
According to the bound in Theorem 4.1, simply minimizing
the factual loss of the target may not guarantee a good
performance. Hence, choosing a source model with low (or
zero) factual loss on the target task cannot perform well if
the (immeasurable) counterfactual loss of the target becomes
excessively high. In other words, the performance of the
chosen source model can be arbitrarily inadequate, while its
performance appears perfect on factual data.
While Theorem 4.1 has implied that causal knowledge can-
not be transferred without any assumption, the learning
bounds presented in the following section prove the via-
bility of transferring causal knowledge under reasonable
assumptions.
4.2 GENERAL LEARNING BOUNDS
The problem of ITE knowledge transfer can be expressed as
two triples (pS
F, pS
CF , fS)and (pT
F, pT
CF , fT)where:
pS
F
and
pT
F
respectively denote the factual probability
distribution of the source and target tasks.
pS
CF
and
pT
CF
respectively denote the counterfactual
distribution of the source and target tasks.
fS
and
fT
respectively denote the underlying causal
function of the source task and the target task.
We use the
L1
distance to measure the similarity between
probability distributions, defined as:
V(p, q) = ZS|p(s)q(s)|ds. (13)
Theorem 4.2. For any hypothesis ˆ
f, we have:
ϵT
CF (ˆ
f)ϵS
F(ˆ
f) + V(pT
F, pS
F) + V(pT
F, pT
CF )
+E
(x,a)pS
F
[|fS(x, a)fT(x, a)|](14)
and
εT
P EHE (ˆ
f)4ϵS
F(ˆ
f)+4V(pT
F, pS
F)+2V(pT
F, pT
CF )
+ 4 E
(x,a)pS
F
[|fS(x, a)fT(x, a)|]
(15)
We note that the learning bounds consist of (1) the source
factual loss, (2) the difference between the causal func-
tions, and (3) a measure of similarities between probability
distributions. However, the
L1
distance in Theorem 4.2 is in-
tractable in practice. A more reasonable candidate distance
is IPM distance as defined in Equation 9. The
L1
distance
can be replaced with the IPM distance as demonstrated by
the following Theorem 4.3.
Theorem 4.3. Suppose that the function class
G
is stable
under addition and multiplication and ˆ
f, fTG, then
ϵT
CF (ˆ
f)ϵS
F(ˆ
f) + IPM
G(pT
F, pS
F) + IPM
G(pT
F, pT
CF )
+E
(x,a)pS
F
[|fS(x, a)fT(x, a)|]
and
εT
P EHE (ˆ
f)4ϵS
F(ˆ
f)+4IPM
G(pT
F, pS
F)+2IPM
G(pT
F, pT
CF )
+ 4 E
(x,a)pS
F
[|fS(x, a)fT(x, a)|]
4.3 BOUNDS FOR COUNTERFACTUAL
BALANCING FRAMEWORKS
Suppose that we have a representation learning model (e.g.,
TARNet) ˆ
fS= (Φ, h)trained on a source causal inference
task. We apply the source model to a different target task.
For notational simplicity, we denote
P(Φ(X)|A=a)
by
P(Φ(Xa))
for
a∈ {0,1}
. We make the following assump-
tions A1, A2, A3:
A1:Φis injective (thus Ψ=Φ1exists on Im(Φ)).
A2: There exists a real function space
G
on
Im(Φ)
such
that the function r7→ T
Φ,h(Ψ(r), a, y)G.
A3: There exists a function class
G
on
Y
such that
y7→ Φ,h(x, a, y)G.
The above theorem guarantees that causal knowledge can
be transferred under reasonable assumptions. The following
Lemma provides an upper bound on the counterfactual loss
for transferring causal knowledge.
Lemma 4.4. Suppose that Assumptions A1, A2, A3 hold.
Then the counterfactual loss of any model
, h)
on the
target task satisfies:
ϵT
CF , h)ϵS,a=1
F, h) + ϵS,a=0
F, h)
+IPM
G(P(Φ(XT
1)), P (Φ(XS
1)))
+IPM
G(P(Φ(XT
0)), P (Φ(XS
0)))
+IPM
G(P(Φ(XT
0)), P (Φ(XT
1))) + 2γ
where
γ=E
xpS
FhIPM
G(P(YS
a|x), P (YT
a|x))i(16)
measures the fundamental difference between two causal
inference tasks.
Theorem 4.5. (Transferability of Causal Knowledge) Sup-
pose that Assumptions A1, A2, A3 hold. The performance
of source model on target task, i.e.
εT
P EHE , h)
, is upper
bounded by:
εT
P EHE , h)2(ϵS,a=1
F, h) + ϵS,a=0
F, h)
+IPM
G(P(Φ(XT
1)), P (Φ(XS
1)))
+IPM
G(P(Φ(XT
0)), P (Φ(XS
0)))
+IPM
G(P(Φ(XT
0)), P (Φ(XT
1)) + 2γ)
Theorem 4.5 implies that good performance on the target
task is guaranteed if (1) the source model has a slight factual
loss (e.g., the first and second term in the upper bound) and
(2) the distributions of the control and the treatment group
features are similar in the latent domain (e.g., the last three
terms in the upper bound). This upper bound provides a
sufficient condition for transfer learning in causal inference
scenarios, indicating the transferability of causal knowledge.
5 TASK-AWARE ITE KNOWLEDGE
TRANSFER
In Section 4, the regret bounds indicate the transferabil-
ity of causal knowledge between pair of causal inference
tasks. In this section, we propose a causal inference learning
framework (illustrated in Figure 2) capable of identifying
the most relevant causal knowledge, when multiple sources
exist, to train the target task. Note that although the general-
ization bounds are informative for understanding viability
of transferring causal knowledge, they may not be the most
constructive approach to select the best source task because
the order of the upper bounds of errors is not necessarily the
same as the order of the errors. To this end, we first propose
a task affinity (CITA) that satisfies the symmetry property
of causal inference tasks (see Sec 5.2) to find the closest
source task to the target task. We observe that CITA strongly
correlates with counterfactual loss. After obtaining the clos-
est task using the computed task distances, its knowledge
(e.g., trained model, bundled data) is utilized for training
the target task.
5.1 TASK AFFINITY SCORE
Let
(T, D)
denote the pair of a causal inference task
T
and its dataset
D= (X, A, Y )
, where
D
consists of the
covariates
X
, the corresponding treatment assignments
A
,
and the factual outcomes
Y
. We formalize a sufficiently
well-trained deep network representing a causal task-dataset
pair
(T, D)
in Appendix (see Sec F). Here, all the previous
tasks’ models are assumed to be sufficiently well-trained
to represent the corresponding tasks. Next, we recall the
definitions of the Fisher Information matrix and the Task
Affinity Score [Le et al., 2022b,a].
摘要:

TransferLearningforIndividualTreatmentEffectEstimationAhmedAloui∗1JunchengDong∗1CatP.Le1VahidTarokh11DepartmentofElectricalandComputerEngineering,DukeUniversityAbstractThisworkconsiderstheproblemoftransferringcausalknowledgebetweentasksforIndividualTreatmentEffect(ITE)estimation.Tothisend,wetheoreti...

展开>> 收起<<
Transfer Learning for Individual Treatment Effect Estimation Ahmed Aloui1Juncheng Dong1Cat P. Le1Vahid Tarokh1 1Department of Electrical and Computer Engineering Duke University.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:1.44MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注