Doubly-robust and heteroscedasticity-aware sample trimming for causal inference Samir Khan

2025-05-03 0 0 782.83KB 32 页 10玖币
侵权投诉
Doubly-robust and heteroscedasticity-aware sample trimming for
causal inference
Samir Khan
Stanford University
samirk@stanford.edu
Johan Ugander
Stanford University
jugander@stanford.edu
January 30, 2024
Abstract
A popular method for variance reduction in causal inference is propensity-based trimming,
the practice of removing units with extreme propensities from the sample. This practice has
theoretical grounding when the data are homoscedastic and the propensity model is parametric
(Yang and Ding, 2018; Crump et al., 2009), but in modern settings where heteroscedastic data
are analyzed with non-parametric models, existing theory fails to support current practice.
In this work, we address this challenge by developing new methods and theory for sample
trimming. Our contributions are three-fold: first, we describe novel procedures for selecting
which units to trim. Our procedures differ from previous works in that we trim not only units
with small propensities, but also units with extreme conditional variances. Second, we give
new theoretical guarantees for inference after trimming. In particular, we show how to perform
inference on the trimmed subpopulation without requiring that our regressions converge at
parametric rates. Instead, we make only fourth-root rate assumptions like those in the double
machine learning literature. This result applies to conventional propensity-based trimming as
well and thus may be of independent interest. Finally, we propose a bootstrap-based method
for constructing simultaneously valid confidence intervals for multiple trimmed sub-populations,
which are valuable for navigating the trade-off between sample size and variance reduction
inherent in trimming. We validate our methods in simulation, on the 2007-2008 National Health
and Nutrition Examination Survey, and on a semi-synthetic Medicare dataset and find promising
results in all settings.
1 Introduction
Traditional methods for estimating causal effects from observational data typically rely on two
standard assumptions: unconfoundedness and overlap (Rosenbaum and Rubin, 1984). In practice,
observational data often have limited overlap, especially in high-dimensional settings (D’Amour
et al., 2021), and this leads to extreme propensity scores and high-variance estimates of the treat-
ment effect. A large body of literature addresses this challenge by modifying the estimand to either
exclude or down-weight units with extreme propensity scores, and these methods have been widely
adopted in practice (Yang and Ding, 2018; Li et al., 2018; Crump et al., 2009). However, modern
data can pose an additional challenge in the form of heavy-tails and heteroscedasticity (Burke et al.,
2019; Tripuraneni et al., 2021).
In this paper, we address this challenge by exploring sample trimming methods that reduce vari-
ance by trimming not only units with extreme propensities, but also units with extreme conditional
1
arXiv:2210.10171v3 [stat.ME] 29 Jan 2024
variances. In order to provide valid statistical inferences when using these trimming methods, we
also develop new methods for inference after sample trimming that offer greater flexibility and are
valid in a wider range of settings than previous such methods.
Motivation and interpretation To motivate this approach, consider a single unit (X, Y, Z)
drawn from a super-population distribution, where Xis a covariate vector, Yis a response, Zis
a treatment indicator, and e(X) is the probability of treatment. An inverse-propensity weighted
estimator for E[Y] is Y Z/e(X), which is unbiased, but is well-known to suffer from extremely high-
variance when e(X) takes small values (Basu, 1971; Khan and Ugander, 2021). As such, the goal of
existing sample trimming methods is to preclude this possibility by removing units for which e(X)
takes small values, executing a change of estimand to make what is essentially a bias–variance
trade-off. On the other hand, the inverse-propensity weighted estimate Y Z/e(X) will also have
high-variance if var(Y|X) is large, an issue which is not addressed by existing methods, but may
be a major obstacle when var(Y|X) is extremely large for some values of X. Put simply: if we
do not believe that we can accurately estimate treatment effects on units with propensities of, say,
0.01, then we must also acknowledge that we cannot accurately estimate treatment effects on units
with conditional variances of say, 100, and so we propose to trim these latter units as well.
An important difference between this proposal and existing propensity-based trimming methods
is that the sub-population found by propensity-based methods can be interpreted as a population
that is likely to receive either treatment or control (sometimes called an equipoise population), and
thus may be a natural population of interest. This interpretation does not extend to variance-
based trimming methods—instead, variance-based methods can be interpreted as identifying a
small population of outliers in the data, whose behavior and response to treatment is very different
from that of other units, and trimming these units to focus on an “inlier” population on which
treatment effects can be estimated more accurately. In many cases, this inlier population is also of
natural interest, since treatment effects on the full population may be dominated by the outliers,
and the treatment effect on the inlier population may be more representative of how treatment
will affect the majority of units. We demonstrate this phenomenon, along with further interpretive
issues, as part of a data example in Section 6.3.
In general, the question of whether or not a particular subpopulation is of interest to an analyst
is dependent on both domain considerations and the level of precision with which treatment effects
for that subpopulation can be estimated. The problem of selecting a subpopulation of interest from
a set of candidates is fundamental to the trimming literature, and not unique to our work—even
when propensity trimming alone, the choice of propensity cut-off induces a similar family of sub-
populations, and choosing between those sub-populations requires a similar balancing of variance
and relevance. Our methods can be understood as more effectively navigating this trade-off be-
tween variance and relevance than existing methods, thus offering practitioners a better set of
sub-populations to choose between.
Inference after trimming After applying any sample trimming procedure, another challenge
immediately arises: how to perform valid inference on the trimmed sub-population. Thus our second
contribution in the present work is to provide new theoretical results on inference after sample
trimming. Existing work typically makes strong rate assumptions or parametric assumptions on
the estimation of nuisance components (Crump et al., 2009; Yang and Ding, 2018), and we extend
this work by using doubly-robust estimators to show how valid inference can be performed under
weaker conditions on the estimation of nuisance components. The application of doubly-robust
estimators to this setting requires a subtle choice of estimand as well as careful handling of cross-
2
fitting, both of which we address.
These results apply both to our variance-based trimming and to classical propensity-based
trimming methods, thus connecting the recent literature on double machine learning and cross-
fitting with the long-standing practice of sample trimming.
Our third contribution addresses a more subtle, previously unconsidered, aspect of inference
after trimming. Roughly speaking, there are several features of a sample trimming method we
may be interested in: the amount of variance reduction offered by the trimming, the size of the
resulting sub-population, the point estimate on that sub-population, and perhaps even covariate
distributions with the sub-population. However, there is no way to smoothly navigate the trade-
offs between these considerations. If we trim the sample one way, perform inference, and find the
results unfavorable for some reason, we cannot then trim the sample another way and perform valid
inference without conditioning on the results of the first sample trimming; this is the problem of
selective inference (Taylor and Tibshirani, 2015). As a remedy, we introduce a bootstrap-based
method that allows an analyst to pre-commit to a small number of trimming methods, and then
constructs simultaneously-valid confidence intervals for the sub-populations found by each trimming
method. An analyst can then choose freely between the different sub-populations based on problem
specific considerations while retaining statistical validity. One drawback of our methods is that, in
simulations, we require relatively large sample sizes to obtain the target coverage level, meaning
that analysts should be more cautious of results in small sample sizes.
Response-based trimming One potential objection to our approach is that our trimming meth-
ods will use the responses when modeling conditional variances, and thus our trimming procedures
are response-dependent. This raises two concerns, one statistical and one philosophical. From a
statistical perspective, one may be concerned that this compromises the validity of the analysis, but
we show in Section 4 that under appropriate assumptions on the fitting of the conditional variance,
our inferences remain valid despite the fact that we have used the response when trimming. From a
philosophical perspective, units with extreme responses may be the units most in need of treatment,
and should not be trimmed. However, if these extreme units are actually the ones of most interest,
then a measure like the average treatment effect is perhaps not even appropriate, since it will also
account for the effect of the treatment on all other units as well. Nonetheless, because our methods
provide simultaneously valid confidence intervals across multiple sub-populations, we still provide
a point estimate and confidence interval for the average treatment effect on the full population,
including any potential units of special importance, when sample trimming.
To summarize, our work both proposes a new criteria for sample trimming based on conditional
variances and propensity scores rather than on propensity scores alone, and develops new theoretical
tools for inference after sample trimming. We validate all of our methods with experiments on
synthetic, semi-synthetic, and real data and find that our new trimming methods reduce variance
beyond what propensity-based methods alone can achieve, identify interesting sub-populations of
the full sample by removing possible outliers, and lead to statistically significant conclusions on
some of these sub-populations even when no such conclusion was possible on the full population.
1.1 Related work
Our work directly builds on the extensive sample trimming literature, and especially on Crump
et al. (2009) (which is itself a journal version of Crump et al. (2006)) and Yang and Ding (2018). We
offer a more detailed comparison with these works in Section 4, but at a high-level, we differ from
these previous works in our more complete treatment of heteroscedasticity and in assuming weaker
conditions on the modeling of nuisance components. For example, Theorem 1 of Crump et al.
3
(2009) calculates an optimal trimming set in the heteroscedastic case, but then quickly specializes
to the homoscedastic setting in Corollary 1, and so the main methodological work is under the
homoscedasticity assumption. In contrast, we provide a full methodological toolbox for tackling
heteroscedasticity, including allowing for complex nonparametric estimates of conditional variances,
and present simultaneous inference methods that can be used to compare subpopulations.
One prior work with a similar idea to ours is Chaudhuri and Hill (2014), which proposes to
remove units whose contribution to the inverse-propensity weighted estimator is extremely large,
which also amounts to removing units with extreme response values. However, Chaudhuri and Hill
(2014) are considering a largely different problem than us: they are not concerned with variance
minimization, consider only classical inverse-propensity weighted estimators, and do not modify
the estimand as is done in the sample trimming literature.
Our current proposal is also conceptually related to methods in robust statistics and outlier
removal. For example, removing units with large residuals from an ordinary least-squares analysis
is similar in spirit to the methods we propose here, as are other methods that identify and remove
extreme units from the data such as Rohatgi and Syrgkanis (2022). We differ from these methods
in that our motivation for dropping units is based on variance reduction, not on a contamination
model for the data, and in that we emphasize the problem of inference after dropping these units.
2 Model and notation
We adopt a potential outcomes framework with nunits where the tuples (Yi(1), Yi(0), Xi, Zi) are
i.i.d. from a super-population distribution Pover R2× X × {0,1}. We assume that Yi(1) and
Yi(0) both have finite variance and that we observe Yi=ZiYi(1) + (1 Zi)Yi(0). We write
e(x) = pr(Zi= 1 |Xi=x) for the propensity score, µw(x) = E[Yi(w)|Xi=x], where w∈ {0,1},
for the conditional means, and σ2
w(x) = var(Yi(w)|Xi=x) for the conditional variances. We
make the standard unconfoundedness and overlap assumptions that Zi(Yi(1), Yi(0)) |Xiand
ηe(x)1η(Rosenbaum and Rubin, 1984).
Our target of inference is the sample average treatment effect (SATE) and its trimmed analogs,
τ=1
n
n
X
i=1
τ(Xi), τA=1
nA
n
X
i=1
τ(Xi)1{XiA},
where τ(x) = µ1(x)µ0(x) is a conditional average treatment effect (CATE), A⊆ X is the subset
of covariate space we are restricting the covariates to, and nA=Pn
i=1 1{XiA}is the number of
sample units whose covariates lie in A.
In subsequent sections, we employ empirical process notation (Wellner et al., 2013; Kennedy,
2016). We let Wi= (Xi, Yi, Zi) be the entire triplet we observe for unit i, and we write Pnf=
1
nPif(Wi) and Pf=Rf(w)dP(w). Note that for a random function ˆ
f,Pˆ
fis a random variable,
since we do not integrate over the randomness in ˆ
f. In contrast, E[ˆ
f] is a deterministic quantity
that integrates out the randomness in a new sample and in ˆ
f. We also define the norm fLq(P)=
(P|f|q)1/q.
3 Trimming methods
In this section, we present a framework for sample trimming methods and use this framework to
propose a trimming method that accounts for conditional variances. As a starting point, recall
4
the result of Hirano et al. (2003) that the variance of an efficient estimator (such as the AIPW
estimator) of τAis given by
Veff
A=1
pr(XA)2E1{XA}σ2
1(X)
e(X)+σ2
0(X)
1e(X).(1)
Based on (1), we can extract the key quantity that determines a unit’s contribution to the asymp-
totic variance, calling it k(x):
k(x) = σ2
1(x)
e(x)+σ2
0(x)
1e(x).(2)
That is, if many units have large values of k(Xi), then the variance of our estimate of τAwill be
large, and vice-versa. This idea was made precise by Crump et al. (2009), who showed that, if
σ2
0(x) and σ2
1(x) are bounded, (1) is minimized for the set Athat thresholds k(x) at a cut-off γ,
that is,
argmin
A
Veff
A={x:k(x)γ},(3)
for some cut-off γR. This result motivates us to consider trimming sets Athat have this form,
i.e., that threshold the function k(x).
Of course, in practice, we do not have direct access to the function kor the choice of γfor which
the minimum in (3) is attained. Instead, both must be learned from the data, giving us an estimated
function ˆ
k(x), an estimated cut-off ˆγ, and a corresponding trimming set ˆ
A={x:ˆ
k(x)ˆγ}. The
difference between ˆ
Aand Ais subtle, but will play a crucial role in what follows, particularly in
our discussion of inferential issues in Section 4. We now discuss several choices for ˆ
kand ˆγ.
3.1 Choices of ˆ
k
How we estimate k(x) depends on what assumptions we are willing to make on σ2
1(x) and σ2
0(x).
In particular, we distinguish between two possibilities:
Homoscedasticity assumed: if we assume that σ2
1(x), σ2
0(x) are constant in xand equal to each
other, then we have that k(x)1/(e(x)(1e(x)), and so we can estimate kby first estimating
the propensity score by ˆe(x), and then setting ˆ
k(x)=1/(ˆe(x)(1ˆe(x)), Note that thresholding
this choice of ˆ
kis equivalent to thresholding on ˆe(x) itself, and so recovers standard propensity
trimming (Crump et al., 2009).
Heteroscedasticity allowed: if we are not willing to make the homoscedasticity assumption, then
we must also estimate the conditional variances by ˆσ2
1(x),ˆσ2
0(x), and then use the estimate
ˆ
k(x) = ˆσ2
1(x)/ˆe(x) + ˆσ2
0(x)/(1 ˆe(x)).(4)
Thus, the usual propensity-based trimming corresponds to choosing ˆ
kbased on a homoscedas-
ticity assumption that may or may not be satisfied. In some cases, such as when Yiis binary
and so σ2
w(x) is bounded by 1/4 for all x, deviations from this assumption may be negligible.
However, in other cases, such as when Yiis real-valued and has potentially unbounded variance,
deviations from this assumption may be significant and worth capturing. In such situations, we
propose instead trimming based on the “heteroscedasticity-aware” ˆ
kdefined in (4). This is in con-
trast to propensity-based trimming, which we refer to as “homoscedastic trimming” in light of the
underlying homoscedasticity assumption. Going forward, we state all of our results for general ˆ
k,
making them relevant to both existing (homoscedastic, propensity-based) procedures and our new
procedures.
5
摘要:

Doubly-robustandheteroscedasticity-awaresampletrimmingforcausalinferenceSamirKhanStanfordUniversitysamirk@stanford.eduJohanUganderStanfordUniversityjugander@stanford.eduJanuary30,2024AbstractApopularmethodforvariancereductionincausalinferenceispropensity-basedtrimming,thepracticeofremovingunitswithe...

展开>> 收起<<
Doubly-robust and heteroscedasticity-aware sample trimming for causal inference Samir Khan.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:782.83KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注