Transfer Learning on Heterogeneous Feature Spaces for Treatment Effects Estimation Ioana Bica

2025-04-26 0 0 6.06MB 26 页 10玖币
侵权投诉
Transfer Learning on Heterogeneous Feature Spaces
for Treatment Effects Estimation
Ioana Bica
University of Oxford, Oxford, UK
The Alan Turing Institute, London, UK
ioana.bica@eng.ox.ac.uk
Mihaela van der Schaar
University of Cambridge, Cambridge, UK
University of California, Los Angeles, USA
The Alan Turing Institute, London, UK
mv472@cam.ac.uk
Abstract
Consider the problem of improving the estimation of conditional average treatment
effects (CATE) for a target domain of interest by leveraging related information
from a source domain with a different feature space. This heterogeneous transfer
learning problem for CATE estimation is ubiquitous in areas such as healthcare
where we may wish to evaluate the effectiveness of a treatment for a new patient
population for which different clinical covariates and limited data are available.
In this paper, we address this problem by introducing several building blocks that
use representation learning to handle the heterogeneous feature spaces and a flexi-
ble multi-task architecture with shared and private layers to transfer information
between potential outcome functions across domains. Then, we show how these
building blocks can be used to recover transfer learning equivalents of the standard
CATE learners. On a new semi-synthetic data simulation benchmark for hetero-
geneous transfer learning we not only demonstrate performance improvements of
our heterogeneous transfer causal effect learners across datasets, but also provide
insights into the differences between these learners from a transfer perspective.
1 Introduction
Estimating the personalized effects of interventions from observational data is a fundamental problem
in causal inference that is crucial for decision-making in many domains: in healthcare, for determining
which treatments to give to patients [
1
], in education, for deciding which school curriculum is best for
each student [
2
,
3
], or in public policy for choosing who would benefit from job training programs [
4
].
Recently, a large number of machine learning methods have been proposed for estimating conditional
average treatment effects (CATE) which enable such personalized policies [5–16].
Nevertheless, the good performance of these methods on a population of interest relies heavily
on the availability of large enough observational datasets for training [
17
,
18
]. In healthcare, for
instance, this can be challenging when hospitals with few patients cannot collect enough data (e.g.
a large proportion of hospitals in the USA have fewer than 100 beds, which for rare diseases can
results in less than 80 training examples per year [
19
]). Moreover, in situations such as the COVID
pandemic, each hospital will initially have very limited amount of data to learn the effectiveness
of interventions from [
20
]. Compared to the predictive setting, this problem is exacerbated in the
treatment effects setting where we need to observe both patients who are treated and not treated to be
able to reliably train a model for CATE estimation to obtain personalized treatment recommendations
for the intended patient population. While data from large national registries can be used to build
global models for general use across hospitals, such models do not take into account the particularities
of different patient populations (e.g. different conditional outcome distributions) and consequently
can perform poorly during deployment [
19
,
21
]. Moreover, various hospitals often record different
(but overlapping) sets of patient covariates [
19
,
22
] which makes this transfer learning problem [
23
]
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06183v1 [cs.LG] 8 Oct 2022
even more challenging as we need to account for the heterogeneous feature spaces. Therefore, it
becomes crucial to build methods for heterogeneous transfer learning that can leverage information
from large source datasets with potentially different feature spaces to improve CATE estimation on
the target datasets of interest.
We consider the Neyman-Rubin potential outcomes (PO) framework [
24
,
25
], where for each indi-
vidual we can define two potential outcomes, one with and one without the treatment. Out of these,
we can only observe the factual outcome; the counterfactual outcome is never observed. Under the
identifiability conditions of overlap and ignorability, observational data can be used to estimate the
PO conditional on the patient’s characteristics, which can then be used to obtain CATE. In this paper,
we address the problem of heterogeneous transfer for CATE estimation, where we aim to leverage
related data from a source domain with different feature space (e.g. data from another hospital) to
improve CATE estimation on a target domain from which we only have few training examples.
Due to the fundamental problem of causal inference of not being able to observe both PO for a patient
[
26
], heterogeneous transfer learning in the context of CATE estimation becomes significantly more
challenging than for supervised learning. In addition to the feature mismatch between domains, the
PO may also have different conditional distributions, as covariate relationships and their impact
on patients’ response to treatments cannot be expected to stay constant across hospitals/locations
[
19
]. Moreover, as clinicians may use different criteria for assigning treatments for various patient
populations, this selection bias may create discrepancies in the covariate shift induced by the treated
and control populations in each domain. Consequently, we need to build an approach that can both
handle the heterogeneous feature spaces, and also model the similarities and differences between
both PO functions and treatment assignment mechanisms across the source and target domains.
For the binary treatment setting in a single patient population, a large number of different approaches
for CATE estimation have been proposed where the main design choices involved modelling the PO
functions and handling the selection bias present in observational datasets [
10
,
14
16
,
27
29
]. We
discuss these in more details in Section 2. However, note that each different CATE learner has its own
advantages and disadvantages in terms of the inductive biases they use for modelling the PO functions
and the covariate shift induced by the selection bias, and thus, different learners will achieve better
performance in various scenarios [
15
,
30
]. Therefore, we propose a flexible approach for transfer
learning, that (1) preserves the characteristics of each learner in a single domain, while (2) enabling
heterogeneous feature spaces and (3) sharing information between PO functions across domains.
Firstly, we introduce several building blocks that can be used to adapt the most common CATE
learners [
10
,
15
] to transfer information from a source to a target domain. These building blocks
involve handling the heterogeneous feature spaces, sharing information between PO functions across
domains and sharing information between PO functions within a single domain. Secondly, we show
how these building blocks can be used to build heterogeneous transfer causal effect (HTCE-) learner
equivalents of the most common and popular CATE learners based on neural networks [10, 15].
Contributions.
Our contributions are three-fold (i) we define the problem of heterogeneous transfer
learning in the context of CATE estimation and propose several building blocks that can be used to
construct models to address this problem, (ii) we use these building blocks to construct HTCE-learner
equivalents of the most common CATE learners, and (iii) we propose a new semi-synthetic data
simulation and guidelines for evaluating CATE methods for heterogeneous transfer and perform
extensive experiments that not only show that our HTCE-learners achieve improved performance, but
also provide new insights into the differences between these learners from a transfer perspective.
2 Related works
We tackle the problem of heterogeneous transfer learning in the context of CATE estimation. Thus,
our work straddles at the intersection of research in (1) causal inference methods for CATE estimation
(2) leveraging multiple datasets for CATE estimation (3) multi-task/transfer learning and domain
adaptation. Refer to Appendix A for further discussion of related works.
CATE learners.
The estimation of CATE has received a lot of attention in the causal inference
literature and several methods have been proposed to estimate the effects of binary treatments. Out of
these, we consider the most popular approaches that involve using model agnostic learning strategies,
also known as meta-learners, for CATE estimation [
15
,
31
] or neural network-based models that build
shared representations between the PO functions followed by outcome specific layers [
10
,
15
,
27
,
28
].
2
The CATE meta-learners can be split into (a) one-step plug-in learners (indirect meta-learners) that
estimate the PO from the observational data and then set CATE as the difference in the PO [
31
]
and (b) two-step learners (direct meta-learners) that estimate the PO and/or the propensity score
in the first step on the basis of which they build a pseudo-outcome and obtain CATE directly by
regressing the input covariates on the pseudo-outcome in the second step [15, 31–33]. Refer to [15]
for a more thorough classification of the different meta-learners. Alternatively, several methods based
on representation learning with neural networks and multi-task learning have been proposed that
involve having shared layers between the PO functions followed by outcome-specific layers. The
most standard architecture for this is TARNet [
10
] which has been extended to allow for different
types of information sharing between the PO and propensity score in [
14
,
15
,
27
]. To account for the
confounding bias present in observational datasets, several approaches have been proposed to extend
this model architecture by building balanced representations (treatment invariant representations)
[
10
,
28
] and/or incorporating propensity weighting to obtain unbiased estimates of the PO [
29
,
34
].
These different approaches have their own benefits and drawback, which is why it is important to
build a heterogeneous transfer learning approach that is general enough to extend all of them.
Transfer and domain adaptation for CATE estimation.
While, the problem of transfer learning
for CATE estimation has also been addressed by [
35
] the proposed approach considers shared feature
spaces and consists of a two-stage training procedure that involves warm-start on the first domain and
fine-tuning on the second domain. Alternatively, [
36
] proposes a CATE estimation method that can
generalize to distribution shifts in the patient population in the unsupervised domain adaptation setting.
However, they do not assume access to label information in the target domain and only consider a
shared feature space between the two domains. In addition, [
37
] leverages data from multiple different
environments, with shared feature spaces, to learn an invariant representation that removes the ‘bad
controls’ which induce bias in the CATE estimation. Then, they use this invariant representation
to learn shared PO functions across the different environments. Refer to Appendix A for more
methods that use multiple datasets for causal inference, although for different purposes than ours.
Multi-task/transfer learning and domain adaptation
. Methods to address these problems have
been extensively studied in the predictive (supervised) setting. We describe here the works most
related to ours that consider (a) shared feature space and (b) heterogeneous feature space. For
shared feature spaces, methods in domain adaptation focus on handling the covariate shift, i.e. the
distribution mismatch between the input features across the different domains and propose various
approaches of learning domain invariant representations [
38
,
39
] based on which they learn an out-
come function shared between domains. Alternatively, multi-task/transfer learning methods propose
various approaches for neural networks to learn from related tasks that involve using both shared and
task (domain) specific layers [
40
,
41
] to allow a flexible modelling of the different outcomes. To
handle heterogeneous feature spaces, [
22
] proposes RadialGAN, a method that augments the target
dataset with generated samples from the source datasets. However, RadialGAN involves training
separate generators and discriminators for each domain and consequently also requires access to
enough training data in the target domain. After the data generation, RadialGAN trains separate
predictors in each domain that do not share information between each other. Alternatively, Wiens et
al. [
19
] considers the problem of feature mismatch (in a specific healthcare application), but does not
address the problem of distributional differences in the outcomes.
We are the first to address the problem of heterogeneous transfer for CATE estimation. We build
HTCE-learners that use representation learning to handle the heterogeneous feature spaces and a
multi-task architecture with shared and private layers to transfer information between PO across
domains, thus also handling the case when different populations respond differently to treatments.
3 Problem formalism
Let random variable
Xi∈ X
denote a vector of pre-treatment covariates (confounders),
Wi∈ {0,1}
the assigned binary treatment and
Yi
a categorical or continuous observed outcome for individual
i
. Let
π(x) = p(W= 1 |X=x)
denote the treatment assignment mechanism. As previously
mentioned, we work in the Neyman-Rubin potential outcomes (PO) framework [
24
,
25
] and we
consider that each individual has two potential outcomes
Yi(1)
and
Yi(0)
for receiving and not
receiving the treatment respectively. However, only one of these outcomes can be observed such that
Yi=WiYi(1) + (1 Wi)Yi(0)
. Let
µ1(x) = E[Y(1) |X=x]
and
µ0(x) = E[Y(0) |X=x]
be
the PO functions.
3
Our aim is to estimate the conditional average treatment effect (CATE):
τ(x) = E[Y(1) Y(0) |X=x] = µ1(x)µ0(x)(1)
which is the difference between expected outcomes for an individual with covariates
X=x
. Let
η= (µ0(x), µ1(x), π(x)) be the nuisance functions for this CATE estimation problem.
Assume access to a source dataset
DR={(XR
i, Wi, Yi)}NR
i=1
and a target dataset
DT=
{(XT
i, Wi, Yi)}NT
i=1
. Different domains in applications such as healthcare, have heterogeneous
feature spaces such that
XR
iRDR
and
XT
iRDT
, where
DR6=DT
are the dimensions
of the feature spaces. We also consider that the source and target domains have different dis-
tributions
p(XR)6=p(XT)
(due to their heterogeneous feature spaces), different treatment as-
signment mechanisms
p(W= 1 |XR)6=p(W= 1 |XT)
and different conditional distri-
butions for PO,
p(Y(w)|XR)6=p(Y(w)|XT)
. This results in different joint distributions
p(XR, W, Y )6=p(XT, W, Y )
which is representative of hospitals recording different types of pa-
tient data where the relationships between patient covariates, treatments and outcomes can change
across diseases and locations [
19
,
42
]. Nevertheless, we implicitly assume that there is a shared
structure between these conditional distributions across domains to enable transfer.
Our aim is to estimate conditional average treatment effects (CATE) for the target domain:
τT(x) = µT
1(xT)µT
0(xT),(2)
by using both the source
DR
and target dataset
DT
. In particular, we want to improve the estimation
in the target domain by leveraging information from the source domain. This is useful in the setting
where the target dataset is much smaller than the source one
NT<< NR
and we can leverage
shared structure between the source and target outcome response functions:
µR
0(xR), µT
0(xT)
and
µR
1(xR), µT
1(xT)
and treatment assignment mechanisms
πR(xR)
,
πT(xT)
. To be able to identify the
causal effects from observational data, we make the standard assumptions for both domains.
Assumption 1.
(Unconfoundedness) There are no unobserved confounders, such that the treatment
assignment and PO are conditionally independent given the covariates:
Y(0), Y (1) W|XT
and
Y(0), Y (1) W|XR.
Assumption 2.
(Overlap)
πT(xT) = p(W= 1 |XT=xT)>0,xT∈ XT
and
πR(xR) =
p(W= 1 |XR=xR)>0,xR∈ XR.
4 Building blocks for CATE transfer learners
In this section, we propose building blocks that enable a flexible transfer approach for CATE learners.
The challenge in this setting is threefold as we need to (1) handle heterogeneous feature spaces
between the source and target domains (2) share information between PO functions across source
and target datasets
(µR
1
,
µT
1)
and
(µR
0
,
µT
0)
as well as (3) share information between PO functions
within a single domain (µR
0,µR
1)and (µT
0,µT
1).
We start by addressing (1) and (2) and show how the proposed building blocks can be used to obtain
transfer approaches for the most common meta-learning strategies in the treatment effects literature
[
15
]. Then, we propose a building block for addressing (3) to obtain transfer CATE learners that use
shared layers and outcome specific layers for the potential outcome functions in each domain [10].
4.1 Handling heterogeneous feature spaces between source and target domains
Consider the following split for the source and target covariates
XR= (Xs, XpR)
and
XT=
(Xs, XpT)
such that we have a set of features private (specific) to the source dataset
XpRRDpR
,
a set of features private to the target dataset
XpTRDpT
and a set of shared features between the
two datasets
XsRDS
. To handle the heterogeneous features spaces between the source and target
datasets we propose using several encoders to create a common representation that can be used as
input to the different transfer CATE learners.
Let
φpR(xR) : RDRRDp
and
φpT(xT) : RDTRDp
be domain-specific (private) encoders
that map the heterogeneous input features to a representation of size
Dp
, such that
φpT(xT) = zpT
and
φpR(xR) = zpR
. Moreover, let
φs(xs) : RDSRDs
be a shared encoder that maps the
shared features between the source and target domains into a representation of size
Ds
such that
φs(xs) = zs.
4
Shared
encoder
Source private
encoder
Private target
encoder
xR
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
xs
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
xT
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
s
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
pR
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
pT
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
zs
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
zs
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
zpR
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Handling heterogeneous
feature spaces
hpR
w,l
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
hpT
w,l
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
hs
w,l
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
ˆµR
w(xR)
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
ˆµT
w(xT)
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Sharing information between PO functions
across source and target domains
pT(xT)
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
pR(xR)
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Figure 1: Building block
for handling the heteroge-
neous feature space of the
source and target domains.
As illustrated in Figure 1, a source example
xR
is encoded to
[zs||zpR]
and a target example
xT
to
[zs||zpT]
, where
||
denotes concatenation and
where both representations have size
Ds+Dp
. Note that an alternative
approach would have been to use the domain-specific encoders
φp
only
for the private features
xpR
and
xpT
. However, inputting the shared
features through both types of encoders allows us to learn relationships
between them that are shared across the different domain, as well as
interactions which are domain-specific.
To discourage redundancy and ensure that
zp
and
zs
encode different
information from the input features, we propose using a regularization
loss that enforces their orthogonality [39]:
Lorthz=kζs>ζpRk2
F+kζs>ζpTk2
F(3)
where
ζpR, ζpT
and
ζs
are matrices whose rows are the private
zpR
,
zpT
and shared
zs
representations for the source and target examples
respectively, and k·k2
Fis the squared Frobenius norm.
4.2 Sharing information between potential outcomes response functions across domains
As treatment responses can vary between different patient populations, it is important to build
a transfer approach that enables learning target-specific outcome functions, while also sharing
information from the source domain. We propose a building block for sharing information between
PO functions across domains that is inspired by the FlexTENet architecture [
14
] and by works in
multitask learning [
41
] and that involves having private layers (subspaces) for each domain as well as
shared layers.
Shared
encoder
Source private
encoder
Private target
encoder
xR
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
xs
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
xT
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
s
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
pR
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
pT
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
zs
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
zs
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
zpT
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
zpR
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Handling heterogeneous
feature spaces
hpR
w,l
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
hpT
w,l
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
hs
w,l
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
ˆµR
w(xR)
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
ˆµT
w(xT)
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Sharing information between PO functions
across source and target domains
pT(xT)
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
pR(xR)
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
<latexit sha1_base64="(null)">(null)</latexit>
Figure 2: Building block for sharing information
between PO across domains.
As shown in Figure 2, for each treatment
w
{0,1}
, we consider a model architecture for esti-
mating its PO functions in the source and target
domains
µR
w
and
µT
w
that consists of
L
layers,
each having a shared and two private subspaces
(one for each domain). For simplicity, we con-
sider the same number of hidden dimensions
for each shared and private subspace. Let
˜
hpR
w,l
,
˜
hpT
w,l,˜
hs
w,l
be the inputs and
hpR
w,l
,
hpT
w,l, hs
w,l
the
outputs of the
lth
layer. For
l > 1
, similarly to
[
14
], the inputs to the
(l+1)th
layer are obtained
as follows:
˜
hpR
w,l+1 = [hs
w,l||hpR
w,l]
,
˜
hpT
w,l+1 =
[hs
w,l||hpT
w,l]
,
˜
hs
w,l+1 = [hs
w,l]
. For
l= 1
, we
set
˜
hpR
w,1= ΦR(xR)
,
˜
hpT
w,1= ΦT(xT)
, and
˜
hs
w,1=˜
hpR
w,1
when using an example from the
source domain or
˜
hs
w,1=˜
hpT
w,1
when using an
example from the target domain, where
ΦR(·)
and
ΦT(·)
are input representations. When sharing the
encoders from Section 4.1 for both treatments, we set
ΦR(xR) = [zs||zpR]
and
ΦT(xT)=[zs||zpT]
.
However, as we will see in Section 5.1, this input representation is CATE learner specific and can be
extended (see Section 5.2) by adding more representation layers to share information between PO
functions within each domain. For the last layer
L
, we build
hs
w,L, hpR
w,L, hpT
w,L
to each have the same
dimension as the potential outcome y.
Overall, let
gR
w
,
gT
w
be the hypothesis functions estimating the potential outcomes in the source
and target domains respectively, such that
gR
wR(xR)) = ψ(hpR
w,L +hs
w,L)
and
gT
wT(xT)) =
ψ(hpT
w,L +hs
w,L)
, where
ψ
is the linear function for continuous outcomes and sigmoid function for
binary ones. This allows us to define the following loss function for estimating the PO:
Ly=
NR
X
i=1
l(yi, gR
wiR
wi(xR
i))) +
NT
X
i=1
l(yi, gT
wiT
wi(xT
i))) (4)
5
摘要:

TransferLearningonHeterogeneousFeatureSpacesforTreatmentEffectsEstimationIoanaBicaUniversityofOxford,Oxford,UKTheAlanTuringInstitute,London,UKioana.bica@eng.ox.ac.ukMihaelavanderSchaarUniversityofCambridge,Cambridge,UKUniversityofCalifornia,LosAngeles,USATheAlanTuringInstitute,London,UKmv472@cam.ac....

展开>> 收起<<
Transfer Learning on Heterogeneous Feature Spaces for Treatment Effects Estimation Ioana Bica.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:6.06MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注