Transfer Learning on Heterogeneous Feature Spaces for Treatment Effects Estimation Ioana Bica

2025-04-26 0 0 6.06MB 26 页 10玖币

侵权投诉

Transfer Learning on Heterogeneous Feature Spaces

for Treatment Effects Estimation

Ioana Bica

University of Oxford, Oxford, UK

The Alan Turing Institute, London, UK

ioana.bica@eng.ox.ac.uk

Mihaela van der Schaar

University of Cambridge, Cambridge, UK

University of California, Los Angeles, USA

The Alan Turing Institute, London, UK

mv472@cam.ac.uk

Abstract

Consider the problem of improving the estimation of conditional average treatment

effects (CATE) for a target domain of interest by leveraging related information

from a source domain with a different feature space. This heterogeneous transfer

learning problem for CATE estimation is ubiquitous in areas such as healthcare

where we may wish to evaluate the effectiveness of a treatment for a new patient

population for which different clinical covariates and limited data are available.

In this paper, we address this problem by introducing several building blocks that

use representation learning to handle the heterogeneous feature spaces and a ﬂexi-

ble multi-task architecture with shared and private layers to transfer information

between potential outcome functions across domains. Then, we show how these

building blocks can be used to recover transfer learning equivalents of the standard

CATE learners. On a new semi-synthetic data simulation benchmark for hetero-

geneous transfer learning we not only demonstrate performance improvements of

our heterogeneous transfer causal effect learners across datasets, but also provide

insights into the differences between these learners from a transfer perspective.

1 Introduction

Estimating the personalized effects of interventions from observational data is a fundamental problem

in causal inference that is crucial for decision-making in many domains: in healthcare, for determining

which treatments to give to patients [

], in education, for deciding which school curriculum is best for

each student [

], or in public policy for choosing who would beneﬁt from job training programs [

Recently, a large number of machine learning methods have been proposed for estimating conditional

average treatment effects (CATE) which enable such personalized policies [5–16].

Nevertheless, the good performance of these methods on a population of interest relies heavily

on the availability of large enough observational datasets for training [

]. In healthcare, for

instance, this can be challenging when hospitals with few patients cannot collect enough data (e.g.

a large proportion of hospitals in the USA have fewer than 100 beds, which for rare diseases can

results in less than 80 training examples per year [

]). Moreover, in situations such as the COVID

pandemic, each hospital will initially have very limited amount of data to learn the effectiveness

of interventions from [

]. Compared to the predictive setting, this problem is exacerbated in the

treatment effects setting where we need to observe both patients who are treated and not treated to be

able to reliably train a model for CATE estimation to obtain personalized treatment recommendations

for the intended patient population. While data from large national registries can be used to build

global models for general use across hospitals, such models do not take into account the particularities

of different patient populations (e.g. different conditional outcome distributions) and consequently

can perform poorly during deployment [

]. Moreover, various hospitals often record different

(but overlapping) sets of patient covariates [

] which makes this transfer learning problem [

]

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06183v1 [cs.LG] 8 Oct 2022

even more challenging as we need to account for the heterogeneous feature spaces. Therefore, it

becomes crucial to build methods for heterogeneous transfer learning that can leverage information

from large source datasets with potentially different feature spaces to improve CATE estimation on

the target datasets of interest.

We consider the Neyman-Rubin potential outcomes (PO) framework [

], where for each indi-

vidual we can deﬁne two potential outcomes, one with and one without the treatment. Out of these,

we can only observe the factual outcome; the counterfactual outcome is never observed. Under the

identiﬁability conditions of overlap and ignorability, observational data can be used to estimate the

PO conditional on the patient’s characteristics, which can then be used to obtain CATE. In this paper,

we address the problem of heterogeneous transfer for CATE estimation, where we aim to leverage

related data from a source domain with different feature space (e.g. data from another hospital) to

improve CATE estimation on a target domain from which we only have few training examples.

Due to the fundamental problem of causal inference of not being able to observe both PO for a patient

[

], heterogeneous transfer learning in the context of CATE estimation becomes signiﬁcantly more

challenging than for supervised learning. In addition to the feature mismatch between domains, the

PO may also have different conditional distributions, as covariate relationships and their impact

on patients’ response to treatments cannot be expected to stay constant across hospitals/locations

[

]. Moreover, as clinicians may use different criteria for assigning treatments for various patient

populations, this selection bias may create discrepancies in the covariate shift induced by the treated

and control populations in each domain. Consequently, we need to build an approach that can both

handle the heterogeneous feature spaces, and also model the similarities and differences between

both PO functions and treatment assignment mechanisms across the source and target domains.

For the binary treatment setting in a single patient population, a large number of different approaches

for CATE estimation have been proposed where the main design choices involved modelling the PO

functions and handling the selection bias present in observational datasets [

–

]. We

discuss these in more details in Section 2. However, note that each different CATE learner has its own

advantages and disadvantages in terms of the inductive biases they use for modelling the PO functions

and the covariate shift induced by the selection bias, and thus, different learners will achieve better

performance in various scenarios [

]. Therefore, we propose a ﬂexible approach for transfer

learning, that (1) preserves the characteristics of each learner in a single domain, while (2) enabling

heterogeneous feature spaces and (3) sharing information between PO functions across domains.

Firstly, we introduce several building blocks that can be used to adapt the most common CATE

learners [

] to transfer information from a source to a target domain. These building blocks

involve handling the heterogeneous feature spaces, sharing information between PO functions across

domains and sharing information between PO functions within a single domain. Secondly, we show

how these building blocks can be used to build heterogeneous transfer causal effect (HTCE-) learner

equivalents of the most common and popular CATE learners based on neural networks [10, 15].

Contributions.

Our contributions are three-fold (i) we deﬁne the problem of heterogeneous transfer

learning in the context of CATE estimation and propose several building blocks that can be used to

construct models to address this problem, (ii) we use these building blocks to construct HTCE-learner

equivalents of the most common CATE learners, and (iii) we propose a new semi-synthetic data

simulation and guidelines for evaluating CATE methods for heterogeneous transfer and perform

extensive experiments that not only show that our HTCE-learners achieve improved performance, but

also provide new insights into the differences between these learners from a transfer perspective.

2 Related works

We tackle the problem of heterogeneous transfer learning in the context of CATE estimation. Thus,

our work straddles at the intersection of research in (1) causal inference methods for CATE estimation

(2) leveraging multiple datasets for CATE estimation (3) multi-task/transfer learning and domain

adaptation. Refer to Appendix A for further discussion of related works.

CATE learners.

The estimation of CATE has received a lot of attention in the causal inference

literature and several methods have been proposed to estimate the effects of binary treatments. Out of

these, we consider the most popular approaches that involve using model agnostic learning strategies,

also known as meta-learners, for CATE estimation [

] or neural network-based models that build

shared representations between the PO functions followed by outcome speciﬁc layers [

The CATE meta-learners can be split into (a) one-step plug-in learners (indirect meta-learners) that

estimate the PO from the observational data and then set CATE as the difference in the PO [

]

and (b) two-step learners (direct meta-learners) that estimate the PO and/or the propensity score

in the ﬁrst step on the basis of which they build a pseudo-outcome and obtain CATE directly by

regressing the input covariates on the pseudo-outcome in the second step [15, 31–33]. Refer to [15]

for a more thorough classiﬁcation of the different meta-learners. Alternatively, several methods based

on representation learning with neural networks and multi-task learning have been proposed that

involve having shared layers between the PO functions followed by outcome-speciﬁc layers. The

most standard architecture for this is TARNet [

] which has been extended to allow for different

types of information sharing between the PO and propensity score in [

]. To account for the

confounding bias present in observational datasets, several approaches have been proposed to extend

this model architecture by building balanced representations (treatment invariant representations)

[

] and/or incorporating propensity weighting to obtain unbiased estimates of the PO [

These different approaches have their own beneﬁts and drawback, which is why it is important to

build a heterogeneous transfer learning approach that is general enough to extend all of them.

Transfer and domain adaptation for CATE estimation.

While, the problem of transfer learning

for CATE estimation has also been addressed by [

] the proposed approach considers shared feature

spaces and consists of a two-stage training procedure that involves warm-start on the ﬁrst domain and

ﬁne-tuning on the second domain. Alternatively, [

] proposes a CATE estimation method that can

generalize to distribution shifts in the patient population in the unsupervised domain adaptation setting.

However, they do not assume access to label information in the target domain and only consider a

shared feature space between the two domains. In addition, [

] leverages data from multiple different

environments, with shared feature spaces, to learn an invariant representation that removes the ‘bad

controls’ which induce bias in the CATE estimation. Then, they use this invariant representation

to learn shared PO functions across the different environments. Refer to Appendix A for more

methods that use multiple datasets for causal inference, although for different purposes than ours.

Multi-task/transfer learning and domain adaptation

. Methods to address these problems have

been extensively studied in the predictive (supervised) setting. We describe here the works most

related to ours that consider (a) shared feature space and (b) heterogeneous feature space. For

shared feature spaces, methods in domain adaptation focus on handling the covariate shift, i.e. the

distribution mismatch between the input features across the different domains and propose various

approaches of learning domain invariant representations [

] based on which they learn an out-

come function shared between domains. Alternatively, multi-task/transfer learning methods propose

various approaches for neural networks to learn from related tasks that involve using both shared and

task (domain) speciﬁc layers [

] to allow a ﬂexible modelling of the different outcomes. To

handle heterogeneous feature spaces, [

] proposes RadialGAN, a method that augments the target

dataset with generated samples from the source datasets. However, RadialGAN involves training

separate generators and discriminators for each domain and consequently also requires access to

enough training data in the target domain. After the data generation, RadialGAN trains separate

predictors in each domain that do not share information between each other. Alternatively, Wiens et

al. [

] considers the problem of feature mismatch (in a speciﬁc healthcare application), but does not

address the problem of distributional differences in the outcomes.

We are the ﬁrst to address the problem of heterogeneous transfer for CATE estimation. We build

HTCE-learners that use representation learning to handle the heterogeneous feature spaces and a

multi-task architecture with shared and private layers to transfer information between PO across

domains, thus also handling the case when different populations respond differently to treatments.

3 Problem formalism

Let random variable

Xi∈ X

denote a vector of pre-treatment covariates (confounders),

Wi∈ {0,1}

the assigned binary treatment and

a categorical or continuous observed outcome for individual

. Let

π(x) = p(W= 1 |X=x)

denote the treatment assignment mechanism. As previously

mentioned, we work in the Neyman-Rubin potential outcomes (PO) framework [

] and we

consider that each individual has two potential outcomes

Yi(1)

and

Yi(0)

for receiving and not

receiving the treatment respectively. However, only one of these outcomes can be observed such that

Yi=WiYi(1) + (1 −Wi)Yi(0)

. Let

µ1(x) = E[Y(1) |X=x]

and

µ0(x) = E[Y(0) |X=x]

the PO functions.

Our aim is to estimate the conditional average treatment effect (CATE):

τ(x) = E[Y(1) −Y(0) |X=x] = µ1(x)−µ0(x)(1)

which is the difference between expected outcomes for an individual with covariates

X=x

. Let

η= (µ0(x), µ1(x), π(x)) be the nuisance functions for this CATE estimation problem.

Assume access to a source dataset

DR={(XR

i, Wi, Yi)}NR

i=1

and a target dataset

DT=

{(XT

i, Wi, Yi)}NT

i=1

. Different domains in applications such as healthcare, have heterogeneous

feature spaces such that

i∈RDR

and

i∈RDT

, where

DR6=DT

are the dimensions

of the feature spaces. We also consider that the source and target domains have different dis-

tributions

p(XR)6=p(XT)

(due to their heterogeneous feature spaces), different treatment as-

signment mechanisms

p(W= 1 |XR)6=p(W= 1 |XT)

and different conditional distri-

butions for PO,

p(Y(w)|XR)6=p(Y(w)|XT)

. This results in different joint distributions

p(XR, W, Y )6=p(XT, W, Y )

which is representative of hospitals recording different types of pa-

tient data where the relationships between patient covariates, treatments and outcomes can change

across diseases and locations [

]. Nevertheless, we implicitly assume that there is a shared

structure between these conditional distributions across domains to enable transfer.

Our aim is to estimate conditional average treatment effects (CATE) for the target domain:

τT(x) = µT

1(xT)−µT

0(xT),(2)

by using both the source

and target dataset

. In particular, we want to improve the estimation

in the target domain by leveraging information from the source domain. This is useful in the setting

where the target dataset is much smaller than the source one

NT<< NR

and we can leverage

shared structure between the source and target outcome response functions:

µR

0(xR), µT

0(xT)

and

µR

1(xR), µT

1(xT)

and treatment assignment mechanisms

πR(xR)

πT(xT)

. To be able to identify the

causal effects from observational data, we make the standard assumptions for both domains.

Assumption 1.

(Unconfoundedness) There are no unobserved confounders, such that the treatment

assignment and PO are conditionally independent given the covariates:

Y(0), Y (1) ⊥⊥ W|XT

and

Y(0), Y (1) ⊥⊥ W|XR.

Assumption 2.

(Overlap)

πT(xT) = p(W= 1 |XT=xT)>0,∀xT∈ XT

and

πR(xR) =

p(W= 1 |XR=xR)>0,∀xR∈ XR.

4 Building blocks for CATE transfer learners

In this section, we propose building blocks that enable a ﬂexible transfer approach for CATE learners.

The challenge in this setting is threefold as we need to (1) handle heterogeneous feature spaces

between the source and target domains (2) share information between PO functions across source

and target datasets

(µR

µT

and

(µR

µT

as well as (3) share information between PO functions

within a single domain (µR

0,µR

1)and (µT

0,µT

1).

We start by addressing (1) and (2) and show how the proposed building blocks can be used to obtain

transfer approaches for the most common meta-learning strategies in the treatment effects literature

[

]. Then, we propose a building block for addressing (3) to obtain transfer CATE learners that use

shared layers and outcome speciﬁc layers for the potential outcome functions in each domain [10].

4.1 Handling heterogeneous feature spaces between source and target domains

Consider the following split for the source and target covariates

XR= (Xs, XpR)

and

XT=

(Xs, XpT)

such that we have a set of features private (speciﬁc) to the source dataset

XpR∈RDpR

a set of features private to the target dataset

XpT∈RDpT

and a set of shared features between the

two datasets

Xs∈RDS

. To handle the heterogeneous features spaces between the source and target

datasets we propose using several encoders to create a common representation that can be used as

input to the different transfer CATE learners.

Let

φpR(xR) : RDR→RDp

and

φpT(xT) : RDT→RDp

be domain-speciﬁc (private) encoders

that map the heterogeneous input features to a representation of size

, such that

φpT(xT) = zpT

and

φpR(xR) = zpR

. Moreover, let

φs(xs) : RDS→RDs

be a shared encoder that maps the

shared features between the source and target domains into a representation of size

such that

φs(xs) = zs.

Shared

encoder

Source private

encoder

Private target

encoder

s

pR

pT

zpT

zpR

Handling heterogeneous

feature spaces

hpR

w,l

hpT

w,l

w,l

ˆµR

w(xR)

ˆµT

w(xT)

Sharing information between PO functions

across source and target domains

pT(xT)

pR(xR)

Figure 1: Building block

for handling the heteroge-

neous feature space of the

source and target domains.

As illustrated in Figure 1, a source example

is encoded to

[zs||zpR]

and a target example

[zs||zpT]

, where

denotes concatenation and

where both representations have size

Ds+Dp

. Note that an alternative

approach would have been to use the domain-speciﬁc encoders

φp

only

for the private features

xpR

and

xpT

. However, inputting the shared

features through both types of encoders allows us to learn relationships

between them that are shared across the different domain, as well as

interactions which are domain-speciﬁc.

To discourage redundancy and ensure that

and

encode different

information from the input features, we propose using a regularization

loss that enforces their orthogonality [39]:

Lorthz=kζs>ζpRk2

F+kζs>ζpTk2

F(3)

where

ζpR, ζpT

and

ζs

are matrices whose rows are the private

zpR

zpT

and shared

representations for the source and target examples

respectively, and k·k2

Fis the squared Frobenius norm.

4.2 Sharing information between potential outcomes response functions across domains

As treatment responses can vary between different patient populations, it is important to build

a transfer approach that enables learning target-speciﬁc outcome functions, while also sharing

information from the source domain. We propose a building block for sharing information between

PO functions across domains that is inspired by the FlexTENet architecture [

] and by works in

multitask learning [

] and that involves having private layers (subspaces) for each domain as well as

shared layers.

Shared

encoder

Source private

encoder

Private target

encoder

s

pR

pT

zpT

zpR

Handling heterogeneous

feature spaces

hpR

w,l

hpT

w,l

w,l

ˆµR

w(xR)

ˆµT

w(xT)

Sharing information between PO functions

across source and target domains

pT(xT)

pR(xR)

Figure 2: Building block for sharing information

between PO across domains.

As shown in Figure 2, for each treatment

w∈

{0,1}

, we consider a model architecture for esti-

mating its PO functions in the source and target

domains

µR

and

µT

that consists of

layers,

each having a shared and two private subspaces

(one for each domain). For simplicity, we con-

sider the same number of hidden dimensions

for each shared and private subspace. Let

hpR

w,l

hpT

w,l,˜

w,l

be the inputs and

hpR

w,l

hpT

w,l, hs

w,l

the

outputs of the

lth

layer. For

l > 1

, similarly to

[

], the inputs to the

(l+1)th

layer are obtained

as follows:

hpR

w,l+1 = [hs

w,l||hpR

w,l]

hpT

w,l+1 =

[hs

w,l||hpT

w,l]

w,l+1 = [hs

w,l]

. For

l= 1

, we

set

hpR

w,1= ΦR(xR)

hpT

w,1= ΦT(xT)

, and

w,1=˜

hpR

w,1

when using an example from the

source domain or

w,1=˜

hpT

w,1

when using an

example from the target domain, where

ΦR(·)

and

ΦT(·)

are input representations. When sharing the

encoders from Section 4.1 for both treatments, we set

ΦR(xR) = [zs||zpR]

and

ΦT(xT)=[zs||zpT]

However, as we will see in Section 5.1, this input representation is CATE learner speciﬁc and can be

extended (see Section 5.2) by adding more representation layers to share information between PO

functions within each domain. For the last layer

, we build

w,L, hpR

w,L, hpT

w,L

to each have the same

dimension as the potential outcome y.

Overall, let

be the hypothesis functions estimating the potential outcomes in the source

and target domains respectively, such that

w(ΦR(xR)) = ψ(hpR

w,L +hs

w,L)

and

w(ΦT(xT)) =

ψ(hpT

w,L +hs

w,L)

, where

is the linear function for continuous outcomes and sigmoid function for

binary ones. This allows us to deﬁne the following loss function for estimating the PO:

Ly=

i=1

l(yi, gR

wi(ΦR

wi(xR

i))) +

i=1

l(yi, gT

wi(ΦT

wi(xT

i))) (4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TransferLearningonHeterogeneousFeatureSpacesforTreatmentEffectsEstimationIoanaBicaUniversityofOxford,Oxford,UKTheAlanTuringInstitute,London,UKioana.bica@eng.ox.ac.ukMihaelavanderSchaarUniversityofCambridge,Cambridge,UKUniversityofCalifornia,LosAngeles,USATheAlanTuringInstitute,London,UKmv472@cam.ac....

展开>> 收起<<

Transfer Learning on Heterogeneous Feature Spaces for Treatment Effects Estimation Ioana Bica.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transfer Learning on Heterogeneous Feature Spaces for Treatment Effects Estimation Ioana Bica

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: