Sample Constrained Treatment Effect Estimation Raghavendra Addanki Adobe Research

2025-04-24 0 0 1.26MB 27 页 10玖币
侵权投诉
Sample Constrained Treatment Effect Estimation
Raghavendra Addanki
Adobe Research
raddanki@adobe.com
David Arbour
Adobe Research
darbour26@gmail.com
Tung Mai
Adobe Research
tumai@adobe.com
Cameron Musco
University of Massachusetts Amherst
cmusco@cs.umass.edu
Anup Rao
Adobe Research
anuprao@adobe.com
October 14, 2022
Abstract
Treatment effect estimation is a fundamental problem in causal inference. We focus on
designing efficient randomized controlled trials, to accurately estimate the effect of some treatment
on a population of
n
individuals. In particular, we study sample-constrained treatment effect
estimation, where we must select a subset of
sn
individuals from the population to experiment
on. This subset must be further partitioned into treatment and control groups. Algorithms for
partitioning the entire population into treatment and control groups, or for choosing a single
representative subset, have been well-studied. The key challenge in our setting is jointly choosing
a representative subset and a partition for that set.
We focus on both individual and average treatment effect estimation, under a linear effects
model. We give provably efficient experimental designs and corresponding estimators, by
identifying connections to discrepancy minimization and leverage-score-based sampling used in
randomized numerical linear algebra. Our theoretical results obtain a smooth transition to known
guarantees when
s
equals the population size. We also empirically demonstrate the performance
of our algorithms.
1 Introduction
Experimentation has long been held as a gold standard for inferring causal effects since one can
explicitly enforce independence between treatment assignment and other variables which influence
the outcome of interest. We consider the potential outcomes framework [
Ney23
,
Rub80
], where each
individual is associated with a control and treatment value (also called the potential outcomes) and
based on the treatment assignment, we can observe only one of these values. Efficient designs of
experimentation for estimating individual treatment effects which measure the difference between
treatment and control values for each individual, and the average treatment effect which measures
the average individual treatment effect has been well-studied [
MR12
]. In the absence of assumptions
on the functional form of the potential outcomes, the minimax optimal approach for conducting
an experiment is to assign individuals to treatment or control completely at random, without
Author ordering is alphabetical. A preliminary version of this work appeared in the proceedings of 36th Conference
on Neural Information Processing Systems (NeurIPS 2022).
1
arXiv:2210.06594v1 [cs.LG] 12 Oct 2022
consideration of baseline covariates (features) [
Kal17
]. However, by considering covariates for each
individual, and using additional assumptions of smoothness, substantial gains can be made in terms
of the variance of the treatment effect estimate via alternative assignment procedures. The most
common approach attempts to minimize imbalance, i.e., the difference between the baseline covariates
in the treatment and control groups [ADR21,Kal17,MR12].
While experimental designs that minimize imbalance increase the power of an experiment for
a given pool of subjects, there are many practical applications where the experimenter wishes to
minimize the total number of subjects who are placed into the experiment. For example, in medicine,
clinical trials may carry nontrivial risk to patients. Within industrial applications, experiments may
carry substantial costs in terms of testing changes, which decrease the quality of the user experience,
or have direct monetary costs.
In this paper, we examine the problem of selecting a subset of
s
individuals from a larger
population and assigning treatments such that the estimated treatment effect has a small error. We
consider two different estimands: individual treatment effect (ITE) and average treatment effect
(ATE).
A bit more formally, we represent the
d
-covariates of a population of
n
individuals using
XRn×d
.
We assume that the treatment and control values, denoted by
y1,y0Rn
, are functions of the
covariates, i.e.,
y1
=
f
(
X,ζ
ζ
ζ1
)and
y0
=
g
(
X,ζ
ζ
ζ0
)where
ζ
ζ
ζ0,ζ
ζ
ζ1Rn
are noise vectors. The ITE for
the
ith
individual is
y1
iy0
i
and ATE is the average of all the ITE values. We further assume a
linear model, i.e., the functions
f, g
are linear in
X
and
ζ
ζ
ζ1,ζ
ζ
ζ0
. The goal is to pick a subset of
s
individuals and partition this subset into control and treatment groups. For an individual
i
in the
treatment group, we measure
y1
i
, and for an individual
j
in the control, we measure
y0
j
. From this
small set of measurements, we seek to estimate the ITE or ATE over the full population.
Without parametric assumptions, ITE estimation is not feasible [
SJS17
]. We focus on linear
models in particular, since they are important in developing theory. E.g., in the literature on
optimal designs in active learning, much of the foundational theory is built around linear models.
Identifying estimators based on linearity assumptions is an active area of study in the causal inference
literature [HSSZ19,WDTT16].
Our setup is similar to active learning [
Set09
], where the goal is to minimize the number of
individual labels that we access for solving linear regression or other downstream tasks. The key
difference is that we must select both a subset of individuals, and for each
i
, can measure only one
of two labels:
y1
i
or
y0
i
. In particular, ITE estimation can be thought of as solving two simultaneous
active linear regression problems – one for the treatment outcomes and one for the control outcomes.
Thus, standard active learning-based approaches, such as [
CP19
,
CDL13
,
M+11
], fall short. Even
when
s
equals the population size
n
, i.e., when active learning becomes trivial, our problem does
not. We must still pick a partition of the full population into treatment and control groups. Overall,
sample constrained treatment effect estimation by designing efficient randomized controlled trials
has received little attention, compared to various approaches that use observational data, such
as [JTvA+21,QWZ21,SSS+19].
1.1 Our Contributions
For ITE estimation, we propose an algorithm using leverage score sampling [
Woo14
], which is a
popular approach to subset selection for fast linear algebraic computation. For ATE estimation, we
employ a recursive application of a covariate balancing design [
HSSZ19
]. We provide a theoretical
analysis in terms of root mean squared error (ITE) and deviation error (ATE).
2
Recall that we assume the treatment and control values are linear functions of the covariates
plus Gaussian noise, i.e.,
y1
=
Xβ
β
β1
+
ζ
ζ
ζ1
and
y0
=
Xβ
β
β0
+
ζ
ζ
ζ0
where
ζ
ζ
ζ1,ζ
ζ
ζ0Rn
have i.i.d. mean zero,
variance σ2Gaussian entries, and β
β
β1,β
β
β0Rdare coefficient vectors.
ITE estimation.
For ITE estimation, we give a randomized algorithm that selects Θ(
dlog d
)
individuals in expectation, using leverage scores, which measure the importance of an individual
based on their covariates. Our algorithm obtains, with high probability, root mean squared error
Oplog d/n ·
(
β
β
β1
+
β
β
β0
) +
σ
(see Corollary 3.7). We argue that this is optimal up to constants
and a log dfactor, even for approaches that experiment on the full population.
The key challenge in achieving this bound is to extend leverage scores to our simultaneous linear
regression setting, ensuring that we do not share samples across the treatment and control effect
estimation problems. To do this, we introduce a smoothed covariate matrix, whose leverage scores
are bounded. This ensures that, when applying independent leverage score sampling, with high
probability few individuals are randomly assigned to both control and treatment, and thus removing
such individuals from one of the groups does not introduce too much error.
ATE estimation.
For ATE estimation we give a randomized algorithm that selects at most
s
individuals for treatment/control assignment and obtains an error of
e
Oσ/s+ (
β
β
β1
+
β
β
β0
)/s
,
where
e
O
(
·
)hides logarithmic factors (see Theorem 4.3). The error decreases with increasing values
of sand when s=n, it matches state-of-the-art guarantees due to Harshaw et al. [HSSZ19].
Our algorithm for ATE estimation is based on covariate balancing. This is a popular approach
where one attempts to assign similar individuals to the treatment and control groups, to ensure
that the observed effect is attributed to the administered treatment alone. Harshaw et al. [
HSSZ19
]
designed an algorithm by minimizing the discrepancy of an augmented covariate matrix, which
achieves low ATE estimation error. To extend their approach to our setting, first, we need to select
a subset of
s
individuals that are representative of the entire population, and then balance the
covariates. Uniform sampling or importance sampling techniques give high error here. Instead, we
employ a recursive strategy, which repeatedly partitions the individuals into two subsets by balancing
covariates, and selects the smaller subset to recurse on, until we have selected at most
s
individuals.
We observe that our techniques for ITE and ATE estimation should extend to the setting when
the outcomes are non-linear functions of the covariates, which are linear in some higher-dimensional
kernel space. This is immediate for our discrepancy minimization design for ATE, which only requires
knowing the pairwise inner products of the covariate vectors. For ITE estimation, leverage score
sampling for kernel ridge regression [
AM15
] is most likely applicable. Extensions to broader classes
of non-linear models are beyond the scope of this work, but they are an interesting future direction.
Finally, in Section 5, we provide an empirical evaluation of the performance of our ITE and
ATE estimation methods, comparing against uniform sampling and other baselines on several
datasets. Our results suggest that our techniques can help reduce the costs associated with running
a randomized controlled trials substantially using only a small fraction of the population.
1.2 Other Related Work
For ATE estimation, the most well-studied approaches to experiment design are covariate balancing
and randomization. A variety of design techniques have been studied based on these approaches,
3
such as blocking [
GLSR04
], matching [
Ima08
,
Stu10
], rerandomization [
LDR18
,
MR12
], and opti-
mization [
Kal17
]. Using observational data, treatment effect estimation using covariate regression
adjustment [
Lin13
] and various active learning-based sampling techniques have gained recent atten-
tion [
JTvA+21
,
NJC+13
,
SSS+19
]. Compared to ATE, estimating ITE is significantly harder and
has received attention only recently using machine learning methods [
AI16
,
SJS17
,
WA18
]. There
has been a lot of recent work on efficient experimental designs to minimize experimental costs, in
various domains, such as causal discovery [
AKMM20
,
AMM21
,
Ebe07
,
GSKB18
,
KDV17
,
SKDV15
],
multi-arm bandits [ACD21,KW21,NPS21], and group testing [BCS+20,CCJS11,DHH00].
2 Preliminaries
Notation.
We use bold capital letters, e.g.,
X
to denote matrices and bold lowercase letters, e.g.,
y
to denote vectors. We use
X
[
i,
:] and
X
[:
, j
]to denote the
ith
row and
jth
column of
X
respectively,
which we always view as column vectors. The
ith
largest singular value of
X
is denoted by
σi
(
X
).
For any vector x, the Euclidean norm or the `2-norm is denoted by kxk.
For a population of
n
individuals, we represent each with an integer in [
n
]where we denote
[
n
]
def
={
1
,
2
,··· , n}
. Each individual
j
[
n
]is associated with a treatment and a control value,
denoted
y1
j,y0
jR+
, respectively. The vectors associated with all
n
treatment and control values
are denoted
y1
and
y0
. Additionally, each individual is associated with a
d
-dimensional covariate
vector. Combined, they comprise the rows of the covariate matrix XRn×d.
In this paper, we consider the finite population framework, where the potential outcomes of
individuals are fixed and the randomness is only due to treatment assignment [
DLM17
]. We make the
SUTVA assumption, i.e., the treatment outcome value of any individual is independent of treatment
assignments of others in the population [Wag20].
Assumption 2.1
(Linearity Assumption)
.
Under the linearity assumption, the treatment and control
values are a linear function of the covariates. Formally, for some β
β
β0,β
β
β1Rd,
y1=Xβ
β
β1+ζ
ζ
ζ1and y0=Xβ
β
β0+ζ
ζ
ζ0,
where
ζ
ζ
ζ1,ζ
ζ
ζ0Rn
are noise vectors, with each coordinate drawn independently from the Gaussian
distribution with zero mean and variance
σ2
, i.e.,
N
(0
, σ2
). We further assume that
X
is row-
normalized, i.e., kX[i, :]k ≤ 1i[n].
Definition 2.2
(Individual Treatment Effect)
.
Given a population of
n
individuals, the individual
treatment effect (ITE) of j[n]is the difference between the treatment and control values:
ITE(j)def
=y1
jy0
j.
Definition 2.3
(Average Treatment Effect)
.
Given a population of
n
individuals, the average
treatment effect (ATE), denoted by τ, is the average individual treatment effect:
τdef
=1
nX
j[n]
ITE(j) = 1
nX
j[n]
y1
jy0
j.
Definition 2.4
(Root Mean Squared Error)
.
For a set of estimated individual treatment effects,
d
ITE(j)for j[n], the root mean squared error ( RMSE) is defined as:
RMSE def
=1
n·
d
ITE(j)ITE(j)
.
4
Definition 2.5
(Leverage Score)
.
Given a matrix
XRn×d
, the leverage score of
jth
row
X
[
j,
:],
denoted by `j(X), is defined as:
`j(X)def
=X[j, :]T(XTX)+X[j, :],
where +denotes the Moore–Penrose pseudo-inverse.
3 Individual Treatment Effect Estimation
In this section, we describe our algorithm for ITE estimation. The algorithm identifies a subset of
the population to experiment on, using importance based sampling techniques, that are well-studied
in randomized numerical linear algebra [
Woo14
]. Missing proof details in this section are presented
in Appendix A.1.
Overview of our approach.
Under the linearity assumption (Assumption 2.1), we can reformulate
the problem of estimating the ITE for every individual as simultaneously solving two linear regression
instances: one for control and one for treatment, i.e., we regress
y0,y1
on
X
. However, there are two
challenges: 1) we would like to solve these regression problems using measurements from just a small
subset of
s
individuals and 2) we only have access to either the control or treatment measurement
y0
jor y1
jfor any individual in this set.
To tackle the first challenge, we use a sampling technique based on the importance of each row
in
X
, captured via its leverage score (Defn. 2.5). Intuitively, we want to select
s
individuals (or
equivalently rows) that capture the entire row space of
X
and use them to estimate the ITE of all
other individuals. Leverage scores capture the importance of a row in making up the row space.
E.g., if a row is orthogonal to all the other rows, it’s leverage score will be the maximum value of 1.
Unfortunately, if we apply leverage score sampling independently to the regression problems for
y0
and
y1
, rows with high leverage leverage scores may be sampled for both instances. This presents
a problem, since we can only read at most one of
y0
j
or
y1
j
. To mitigate this issue, we construct
asmoothed matrix
X
, which consists of
X
projected onto its singular vectors with high singular
values. Intuitively, this dampens the effects of high leverage score ‘outlier’ rows that don’t contribute
significantly to the spectrum of
X
. Formally, we prove that the maximum leverage score of
X
is
bounded, which let’s us solve our two regression problems via independent sampling. There will be
few repeated samples across our subsets, which introduce minimal error.
3.1 Leverage Score Sampling
For some
γ
0, to be fixed later, we define a smoothed matrix for
X
, the projection onto singular
vectors with high singular values, as follows:
Definition 3.1
(Smoothed matrix)
.
Given
XRn×d
with singular value decomposition
X
=
UΣVT
,
let Γ
be the set of indices corresponding to singular values greater than
γ
, i.e., Γ
def
={i|σi
(
X
)
γ}
; we denote
d0def
=|
Γ
|
. Let
Σ
=
Σ
,
Γ
)denote the principal sub-matrix of
Σ
associated
with these large singular values. Similarly, let
URn×d0,VRd×d0
be the associated column
sub-matrices of Uand V. Then, we define:
Xdef
=UΣVT.
5
摘要:

SampleConstrainedTreatmentEectEstimationRaghavendraAddankiAdobeResearchraddanki@adobe.comDavidArbourAdobeResearchdarbour26@gmail.comTungMaiAdobeResearchtumai@adobe.comCameronMuscoUniversityofMassachusettsAmherstcmusco@cs.umass.eduAnupRaoAdobeResearchanuprao@adobe.comOctober14,2022AbstractTreatmente...

展开>> 收起<<
Sample Constrained Treatment Effect Estimation Raghavendra Addanki Adobe Research.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:1.26MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注