Sample Constrained Treatment Eﬀect Estimation Raghavendra Addanki Adobe Research

2025-04-24 0 0 1.26MB 27 页 10玖币

侵权投诉

Sample Constrained Treatment Eﬀect Estimation

Raghavendra Addanki

Adobe Research

raddanki@adobe.com

David Arbour

Adobe Research

darbour26@gmail.com

Tung Mai

Adobe Research

tumai@adobe.com

Cameron Musco

University of Massachusetts Amherst

cmusco@cs.umass.edu

Anup Rao

Adobe Research

anuprao@adobe.com

October 14, 2022

Abstract

Treatment eﬀect estimation is a fundamental problem in causal inference. We focus on

designing eﬃcient randomized controlled trials, to accurately estimate the eﬀect of some treatment

on a population of

individuals. In particular, we study sample-constrained treatment eﬀect

estimation, where we must select a subset of

sn

individuals from the population to experiment

on. This subset must be further partitioned into treatment and control groups. Algorithms for

partitioning the entire population into treatment and control groups, or for choosing a single

representative subset, have been well-studied. The key challenge in our setting is jointly choosing

a representative subset and a partition for that set.

We focus on both individual and average treatment eﬀect estimation, under a linear eﬀects

model. We give provably eﬃcient experimental designs and corresponding estimators, by

identifying connections to discrepancy minimization and leverage-score-based sampling used in

randomized numerical linear algebra. Our theoretical results obtain a smooth transition to known

guarantees when

equals the population size. We also empirically demonstrate the performance

of our algorithms.

1 Introduction

Experimentation has long been held as a gold standard for inferring causal eﬀects since one can

explicitly enforce independence between treatment assignment and other variables which inﬂuence

the outcome of interest. We consider the potential outcomes framework [

Ney23

Rub80

], where each

individual is associated with a control and treatment value (also called the potential outcomes) and

based on the treatment assignment, we can observe only one of these values. Eﬃcient designs of

experimentation for estimating individual treatment eﬀects which measure the diﬀerence between

treatment and control values for each individual, and the average treatment eﬀect which measures

the average individual treatment eﬀect has been well-studied [

MR12

]. In the absence of assumptions

on the functional form of the potential outcomes, the minimax optimal approach for conducting

an experiment is to assign individuals to treatment or control completely at random, without

Author ordering is alphabetical. A preliminary version of this work appeared in the proceedings of 36th Conference

on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06594v1 [cs.LG] 12 Oct 2022

consideration of baseline covariates (features) [

Kal17

]. However, by considering covariates for each

individual, and using additional assumptions of smoothness, substantial gains can be made in terms

of the variance of the treatment eﬀect estimate via alternative assignment procedures. The most

common approach attempts to minimize imbalance, i.e., the diﬀerence between the baseline covariates

in the treatment and control groups [ADR21,Kal17,MR12].

While experimental designs that minimize imbalance increase the power of an experiment for

a given pool of subjects, there are many practical applications where the experimenter wishes to

minimize the total number of subjects who are placed into the experiment. For example, in medicine,

clinical trials may carry nontrivial risk to patients. Within industrial applications, experiments may

carry substantial costs in terms of testing changes, which decrease the quality of the user experience,

or have direct monetary costs.

In this paper, we examine the problem of selecting a subset of

individuals from a larger

population and assigning treatments such that the estimated treatment eﬀect has a small error. We

consider two diﬀerent estimands: individual treatment eﬀect (ITE) and average treatment eﬀect

(ATE).

A bit more formally, we represent the

-covariates of a population of

individuals using

X∈Rn×d

We assume that the treatment and control values, denoted by

y1,y0∈Rn

, are functions of the

covariates, i.e.,

(

X,ζ

ζ1

)and

(

X,ζ

ζ0

)where

ζ0,ζ

ζ1∈Rn

are noise vectors. The ITE for

the

ith

individual is

i−y0

and ATE is the average of all the ITE values. We further assume a

linear model, i.e., the functions

f, g

are linear in

and

ζ1,ζ

ζ0

. The goal is to pick a subset of

individuals and partition this subset into control and treatment groups. For an individual

in the

treatment group, we measure

, and for an individual

in the control, we measure

. From this

small set of measurements, we seek to estimate the ITE or ATE over the full population.

Without parametric assumptions, ITE estimation is not feasible [

SJS17

]. We focus on linear

models in particular, since they are important in developing theory. E.g., in the literature on

optimal designs in active learning, much of the foundational theory is built around linear models.

Identifying estimators based on linearity assumptions is an active area of study in the causal inference

literature [HSSZ19,WDTT16].

Our setup is similar to active learning [

Set09

], where the goal is to minimize the number of

individual labels that we access for solving linear regression or other downstream tasks. The key

diﬀerence is that we must select both a subset of individuals, and for each

, can measure only one

of two labels:

. In particular, ITE estimation can be thought of as solving two simultaneous

active linear regression problems – one for the treatment outcomes and one for the control outcomes.

Thus, standard active learning-based approaches, such as [

CP19

CDL13

M+11

], fall short. Even

when

equals the population size

, i.e., when active learning becomes trivial, our problem does

not. We must still pick a partition of the full population into treatment and control groups. Overall,

sample constrained treatment eﬀect estimation by designing eﬃcient randomized controlled trials

has received little attention, compared to various approaches that use observational data, such

as [JTvA+21,QWZ21,SSS+19].

1.1 Our Contributions

For ITE estimation, we propose an algorithm using leverage score sampling [

Woo14

], which is a

popular approach to subset selection for fast linear algebraic computation. For ATE estimation, we

employ a recursive application of a covariate balancing design [

HSSZ19

]. We provide a theoretical

analysis in terms of root mean squared error (ITE) and deviation error (ATE).

Recall that we assume the treatment and control values are linear functions of the covariates

plus Gaussian noise, i.e.,

Xβ

β1

ζ1

and

Xβ

β0

ζ0

where

ζ1,ζ

ζ0∈Rn

have i.i.d. mean zero,

variance σ2Gaussian entries, and β

β1,β

β0∈Rdare coeﬃcient vectors.

ITE estimation.

For ITE estimation, we give a randomized algorithm that selects Θ(

dlog d

)

individuals in expectation, using leverage scores, which measure the importance of an individual

based on their covariates. Our algorithm obtains, with high probability, root mean squared error

Oplog d/n ·

(



β

β1



β

β0



) +

σ

(see Corollary 3.7). We argue that this is optimal up to constants

and a √log dfactor, even for approaches that experiment on the full population.

The key challenge in achieving this bound is to extend leverage scores to our simultaneous linear

regression setting, ensuring that we do not share samples across the treatment and control eﬀect

estimation problems. To do this, we introduce a smoothed covariate matrix, whose leverage scores

are bounded. This ensures that, when applying independent leverage score sampling, with high

probability few individuals are randomly assigned to both control and treatment, and thus removing

such individuals from one of the groups does not introduce too much error.

ATE estimation.

For ATE estimation we give a randomized algorithm that selects at most

individuals for treatment/control assignment and obtains an error of

Oσ/√s+ (

β

β1

+

β

β0

)/s

where

(

)hides logarithmic factors (see Theorem 4.3). The error decreases with increasing values

of sand when s=n, it matches state-of-the-art guarantees due to Harshaw et al. [HSSZ19].

Our algorithm for ATE estimation is based on covariate balancing. This is a popular approach

where one attempts to assign similar individuals to the treatment and control groups, to ensure

that the observed eﬀect is attributed to the administered treatment alone. Harshaw et al. [

HSSZ19

]

designed an algorithm by minimizing the discrepancy of an augmented covariate matrix, which

achieves low ATE estimation error. To extend their approach to our setting, ﬁrst, we need to select

a subset of

individuals that are representative of the entire population, and then balance the

covariates. Uniform sampling or importance sampling techniques give high error here. Instead, we

employ a recursive strategy, which repeatedly partitions the individuals into two subsets by balancing

covariates, and selects the smaller subset to recurse on, until we have selected at most

individuals.

We observe that our techniques for ITE and ATE estimation should extend to the setting when

the outcomes are non-linear functions of the covariates, which are linear in some higher-dimensional

kernel space. This is immediate for our discrepancy minimization design for ATE, which only requires

knowing the pairwise inner products of the covariate vectors. For ITE estimation, leverage score

sampling for kernel ridge regression [

AM15

] is most likely applicable. Extensions to broader classes

of non-linear models are beyond the scope of this work, but they are an interesting future direction.

Finally, in Section 5, we provide an empirical evaluation of the performance of our ITE and

ATE estimation methods, comparing against uniform sampling and other baselines on several

datasets. Our results suggest that our techniques can help reduce the costs associated with running

a randomized controlled trials substantially using only a small fraction of the population.

1.2 Other Related Work

For ATE estimation, the most well-studied approaches to experiment design are covariate balancing

and randomization. A variety of design techniques have been studied based on these approaches,

such as blocking [

GLSR04

], matching [

Ima08

Stu10

], rerandomization [

LDR18

MR12

], and opti-

mization [

Kal17

]. Using observational data, treatment eﬀect estimation using covariate regression

adjustment [

Lin13

] and various active learning-based sampling techniques have gained recent atten-

tion [

JTvA+21

NJC+13

SSS+19

]. Compared to ATE, estimating ITE is signiﬁcantly harder and

has received attention only recently using machine learning methods [

AI16

SJS17

WA18

]. There

has been a lot of recent work on eﬃcient experimental designs to minimize experimental costs, in

various domains, such as causal discovery [

AKMM20

AMM21

Ebe07

GSKB18

KDV17

SKDV15

multi-arm bandits [ACD21,KW21,NPS21], and group testing [BCS+20,CCJS11,DHH00].

2 Preliminaries

Notation.

We use bold capital letters, e.g.,

to denote matrices and bold lowercase letters, e.g.,

to denote vectors. We use

[

:] and

, j

]to denote the

ith

row and

jth

column of

respectively,

which we always view as column vectors. The

ith

largest singular value of

is denoted by

σi

(

For any vector x, the Euclidean norm or the `2-norm is denoted by kxk.

For a population of

individuals, we represent each with an integer in [

]where we denote

[

]

def

,··· , n}

. Each individual

j∈

[

]is associated with a treatment and a control value,

denoted

j,y0

j∈R+

, respectively. The vectors associated with all

treatment and control values

are denoted

and

. Additionally, each individual is associated with a

-dimensional covariate

vector. Combined, they comprise the rows of the covariate matrix X∈Rn×d.

In this paper, we consider the ﬁnite population framework, where the potential outcomes of

individuals are ﬁxed and the randomness is only due to treatment assignment [

DLM17

]. We make the

SUTVA assumption, i.e., the treatment outcome value of any individual is independent of treatment

assignments of others in the population [Wag20].

Assumption 2.1

(Linearity Assumption)

Under the linearity assumption, the treatment and control

values are a linear function of the covariates. Formally, for some β

β0,β

β1∈Rd,

y1=Xβ

β1+ζ

ζ1and y0=Xβ

β0+ζ

ζ0,

where

ζ1,ζ

ζ0∈Rn

are noise vectors, with each coordinate drawn independently from the Gaussian

distribution with zero mean and variance

σ2

, i.e.,

, σ2

). We further assume that

is row-

normalized, i.e., kX[i, :]k ≤ 1∀i∈[n].

Deﬁnition 2.2

(Individual Treatment Eﬀect)

Given a population of

individuals, the individual

treatment eﬀect (ITE) of j∈[n]is the diﬀerence between the treatment and control values:

ITE(j)def

=y1

j−y0

Deﬁnition 2.3

(Average Treatment Eﬀect)

Given a population of

individuals, the average

treatment eﬀect (ATE), denoted by τ, is the average individual treatment eﬀect:

τdef

j∈[n]

ITE(j) = 1

j∈[n]

j−y0

Deﬁnition 2.4

(Root Mean Squared Error)

For a set of estimated individual treatment eﬀects,

ITE(j)for j∈[n], the root mean squared error ( RMSE) is deﬁned as:

RMSE def

√n·



d

ITE(j)−ITE(j)



.

Deﬁnition 2.5

(Leverage Score)

Given a matrix

X∈Rn×d

, the leverage score of

jth

row

[

:],

denoted by `j(X), is deﬁned as:

`j(X)def

=X[j, :]T(XTX)+X[j, :],

where +denotes the Moore–Penrose pseudo-inverse.

3 Individual Treatment Eﬀect Estimation

In this section, we describe our algorithm for ITE estimation. The algorithm identiﬁes a subset of

the population to experiment on, using importance based sampling techniques, that are well-studied

in randomized numerical linear algebra [

Woo14

]. Missing proof details in this section are presented

in Appendix A.1.

Overview of our approach.

Under the linearity assumption (Assumption 2.1), we can reformulate

the problem of estimating the ITE for every individual as simultaneously solving two linear regression

instances: one for control and one for treatment, i.e., we regress

y0,y1

. However, there are two

challenges: 1) we would like to solve these regression problems using measurements from just a small

subset of

individuals and 2) we only have access to either the control or treatment measurement

jor y1

jfor any individual in this set.

To tackle the ﬁrst challenge, we use a sampling technique based on the importance of each row

, captured via its leverage score (Defn. 2.5). Intuitively, we want to select

individuals (or

equivalently rows) that capture the entire row space of

and use them to estimate the ITE of all

other individuals. Leverage scores capture the importance of a row in making up the row space.

E.g., if a row is orthogonal to all the other rows, it’s leverage score will be the maximum value of 1.

Unfortunately, if we apply leverage score sampling independently to the regression problems for

and

, rows with high leverage leverage scores may be sampled for both instances. This presents

a problem, since we can only read at most one of

. To mitigate this issue, we construct

asmoothed matrix

X∗

, which consists of

projected onto its singular vectors with high singular

values. Intuitively, this dampens the eﬀects of high leverage score ‘outlier’ rows that don’t contribute

signiﬁcantly to the spectrum of

. Formally, we prove that the maximum leverage score of

X∗

bounded, which let’s us solve our two regression problems via independent sampling. There will be

few repeated samples across our subsets, which introduce minimal error.

3.1 Leverage Score Sampling

For some

γ≥

0, to be ﬁxed later, we deﬁne a smoothed matrix for

, the projection onto singular

vectors with high singular values, as follows:

Deﬁnition 3.1

(Smoothed matrix)

Given

X∈Rn×d

with singular value decomposition

UΣVT

let Γ

∗

be the set of indices corresponding to singular values greater than

√γ

, i.e., Γ

∗def

={i|σi

(

)

≥

√γ}

; we denote

d0def

∗|

. Let

Σ∗

(Γ

∗,

∗

)denote the principal sub-matrix of

associated

with these large singular values. Similarly, let

U∗∈Rn×d0,V∗∈Rd×d0

be the associated column

sub-matrices of Uand V. Then, we deﬁne:

X∗def

=U∗Σ∗V∗T.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SampleConstrainedTreatmentEectEstimationRaghavendraAddankiAdobeResearchraddanki@adobe.comDavidArbourAdobeResearchdarbour26@gmail.comTungMaiAdobeResearchtumai@adobe.comCameronMuscoUniversityofMassachusettsAmherstcmusco@cs.umass.eduAnupRaoAdobeResearchanuprao@adobe.comOctober14,2022AbstractTreatmente...

展开>> 收起<<

Sample Constrained Treatment Eﬀect Estimation Raghavendra Addanki Adobe Research.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sample Constrained Treatment Eﬀect Estimation Raghavendra Addanki Adobe Research

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: