Doubly-robust and heteroscedasticity-aware sample trimming for causal inference Samir Khan

2025-05-03 0 0 782.83KB 32 页 10玖币

侵权投诉

Doubly-robust and heteroscedasticity-aware sample trimming for

causal inference

Samir Khan

Stanford University

samirk@stanford.edu

Johan Ugander

Stanford University

jugander@stanford.edu

January 30, 2024

Abstract

A popular method for variance reduction in causal inference is propensity-based trimming,

the practice of removing units with extreme propensities from the sample. This practice has

theoretical grounding when the data are homoscedastic and the propensity model is parametric

(Yang and Ding, 2018; Crump et al., 2009), but in modern settings where heteroscedastic data

are analyzed with non-parametric models, existing theory fails to support current practice.

In this work, we address this challenge by developing new methods and theory for sample

trimming. Our contributions are three-fold: ﬁrst, we describe novel procedures for selecting

which units to trim. Our procedures diﬀer from previous works in that we trim not only units

with small propensities, but also units with extreme conditional variances. Second, we give

new theoretical guarantees for inference after trimming. In particular, we show how to perform

inference on the trimmed subpopulation without requiring that our regressions converge at

parametric rates. Instead, we make only fourth-root rate assumptions like those in the double

machine learning literature. This result applies to conventional propensity-based trimming as

well and thus may be of independent interest. Finally, we propose a bootstrap-based method

for constructing simultaneously valid conﬁdence intervals for multiple trimmed sub-populations,

which are valuable for navigating the trade-oﬀ between sample size and variance reduction

inherent in trimming. We validate our methods in simulation, on the 2007-2008 National Health

and Nutrition Examination Survey, and on a semi-synthetic Medicare dataset and ﬁnd promising

results in all settings.

1 Introduction

Traditional methods for estimating causal eﬀects from observational data typically rely on two

standard assumptions: unconfoundedness and overlap (Rosenbaum and Rubin, 1984). In practice,

observational data often have limited overlap, especially in high-dimensional settings (D’Amour

et al., 2021), and this leads to extreme propensity scores and high-variance estimates of the treat-

ment eﬀect. A large body of literature addresses this challenge by modifying the estimand to either

exclude or down-weight units with extreme propensity scores, and these methods have been widely

adopted in practice (Yang and Ding, 2018; Li et al., 2018; Crump et al., 2009). However, modern

data can pose an additional challenge in the form of heavy-tails and heteroscedasticity (Burke et al.,

2019; Tripuraneni et al., 2021).

In this paper, we address this challenge by exploring sample trimming methods that reduce vari-

ance by trimming not only units with extreme propensities, but also units with extreme conditional

arXiv:2210.10171v3 [stat.ME] 29 Jan 2024

variances. In order to provide valid statistical inferences when using these trimming methods, we

also develop new methods for inference after sample trimming that oﬀer greater ﬂexibility and are

valid in a wider range of settings than previous such methods.

Motivation and interpretation To motivate this approach, consider a single unit (X, Y, Z)

drawn from a super-population distribution, where Xis a covariate vector, Yis a response, Zis

a treatment indicator, and e(X) is the probability of treatment. An inverse-propensity weighted

estimator for E[Y] is Y Z/e(X), which is unbiased, but is well-known to suﬀer from extremely high-

variance when e(X) takes small values (Basu, 1971; Khan and Ugander, 2021). As such, the goal of

existing sample trimming methods is to preclude this possibility by removing units for which e(X)

takes small values, executing a change of estimand to make what is essentially a bias–variance

trade-oﬀ. On the other hand, the inverse-propensity weighted estimate Y Z/e(X) will also have

high-variance if var(Y|X) is large, an issue which is not addressed by existing methods, but may

be a major obstacle when var(Y|X) is extremely large for some values of X. Put simply: if we

do not believe that we can accurately estimate treatment eﬀects on units with propensities of, say,

0.01, then we must also acknowledge that we cannot accurately estimate treatment eﬀects on units

with conditional variances of say, 100, and so we propose to trim these latter units as well.

An important diﬀerence between this proposal and existing propensity-based trimming methods

is that the sub-population found by propensity-based methods can be interpreted as a population

that is likely to receive either treatment or control (sometimes called an equipoise population), and

thus may be a natural population of interest. This interpretation does not extend to variance-

based trimming methods—instead, variance-based methods can be interpreted as identifying a

small population of outliers in the data, whose behavior and response to treatment is very diﬀerent

from that of other units, and trimming these units to focus on an “inlier” population on which

treatment eﬀects can be estimated more accurately. In many cases, this inlier population is also of

natural interest, since treatment eﬀects on the full population may be dominated by the outliers,

and the treatment eﬀect on the inlier population may be more representative of how treatment

will aﬀect the majority of units. We demonstrate this phenomenon, along with further interpretive

issues, as part of a data example in Section 6.3.

In general, the question of whether or not a particular subpopulation is of interest to an analyst

is dependent on both domain considerations and the level of precision with which treatment eﬀects

for that subpopulation can be estimated. The problem of selecting a subpopulation of interest from

a set of candidates is fundamental to the trimming literature, and not unique to our work—even

when propensity trimming alone, the choice of propensity cut-oﬀ induces a similar family of sub-

populations, and choosing between those sub-populations requires a similar balancing of variance

and relevance. Our methods can be understood as more eﬀectively navigating this trade-oﬀ be-

tween variance and relevance than existing methods, thus oﬀering practitioners a better set of

sub-populations to choose between.

Inference after trimming After applying any sample trimming procedure, another challenge

immediately arises: how to perform valid inference on the trimmed sub-population. Thus our second

contribution in the present work is to provide new theoretical results on inference after sample

trimming. Existing work typically makes strong rate assumptions or parametric assumptions on

the estimation of nuisance components (Crump et al., 2009; Yang and Ding, 2018), and we extend

this work by using doubly-robust estimators to show how valid inference can be performed under

weaker conditions on the estimation of nuisance components. The application of doubly-robust

estimators to this setting requires a subtle choice of estimand as well as careful handling of cross-

ﬁtting, both of which we address.

These results apply both to our variance-based trimming and to classical propensity-based

trimming methods, thus connecting the recent literature on double machine learning and cross-

ﬁtting with the long-standing practice of sample trimming.

Our third contribution addresses a more subtle, previously unconsidered, aspect of inference

after trimming. Roughly speaking, there are several features of a sample trimming method we

may be interested in: the amount of variance reduction oﬀered by the trimming, the size of the

resulting sub-population, the point estimate on that sub-population, and perhaps even covariate

distributions with the sub-population. However, there is no way to smoothly navigate the trade-

oﬀs between these considerations. If we trim the sample one way, perform inference, and ﬁnd the

results unfavorable for some reason, we cannot then trim the sample another way and perform valid

inference without conditioning on the results of the ﬁrst sample trimming; this is the problem of

selective inference (Taylor and Tibshirani, 2015). As a remedy, we introduce a bootstrap-based

method that allows an analyst to pre-commit to a small number of trimming methods, and then

constructs simultaneously-valid conﬁdence intervals for the sub-populations found by each trimming

method. An analyst can then choose freely between the diﬀerent sub-populations based on problem

speciﬁc considerations while retaining statistical validity. One drawback of our methods is that, in

simulations, we require relatively large sample sizes to obtain the target coverage level, meaning

that analysts should be more cautious of results in small sample sizes.

Response-based trimming One potential objection to our approach is that our trimming meth-

ods will use the responses when modeling conditional variances, and thus our trimming procedures

are response-dependent. This raises two concerns, one statistical and one philosophical. From a

statistical perspective, one may be concerned that this compromises the validity of the analysis, but

we show in Section 4 that under appropriate assumptions on the ﬁtting of the conditional variance,

our inferences remain valid despite the fact that we have used the response when trimming. From a

philosophical perspective, units with extreme responses may be the units most in need of treatment,

and should not be trimmed. However, if these extreme units are actually the ones of most interest,

then a measure like the average treatment eﬀect is perhaps not even appropriate, since it will also

account for the eﬀect of the treatment on all other units as well. Nonetheless, because our methods

provide simultaneously valid conﬁdence intervals across multiple sub-populations, we still provide

a point estimate and conﬁdence interval for the average treatment eﬀect on the full population,

including any potential units of special importance, when sample trimming.

To summarize, our work both proposes a new criteria for sample trimming based on conditional

variances and propensity scores rather than on propensity scores alone, and develops new theoretical

tools for inference after sample trimming. We validate all of our methods with experiments on

synthetic, semi-synthetic, and real data and ﬁnd that our new trimming methods reduce variance

beyond what propensity-based methods alone can achieve, identify interesting sub-populations of

the full sample by removing possible outliers, and lead to statistically signiﬁcant conclusions on

some of these sub-populations even when no such conclusion was possible on the full population.

1.1 Related work

Our work directly builds on the extensive sample trimming literature, and especially on Crump

et al. (2009) (which is itself a journal version of Crump et al. (2006)) and Yang and Ding (2018). We

oﬀer a more detailed comparison with these works in Section 4, but at a high-level, we diﬀer from

these previous works in our more complete treatment of heteroscedasticity and in assuming weaker

conditions on the modeling of nuisance components. For example, Theorem 1 of Crump et al.

(2009) calculates an optimal trimming set in the heteroscedastic case, but then quickly specializes

to the homoscedastic setting in Corollary 1, and so the main methodological work is under the

homoscedasticity assumption. In contrast, we provide a full methodological toolbox for tackling

heteroscedasticity, including allowing for complex nonparametric estimates of conditional variances,

and present simultaneous inference methods that can be used to compare subpopulations.

One prior work with a similar idea to ours is Chaudhuri and Hill (2014), which proposes to

remove units whose contribution to the inverse-propensity weighted estimator is extremely large,

which also amounts to removing units with extreme response values. However, Chaudhuri and Hill

(2014) are considering a largely diﬀerent problem than us: they are not concerned with variance

minimization, consider only classical inverse-propensity weighted estimators, and do not modify

the estimand as is done in the sample trimming literature.

Our current proposal is also conceptually related to methods in robust statistics and outlier

removal. For example, removing units with large residuals from an ordinary least-squares analysis

is similar in spirit to the methods we propose here, as are other methods that identify and remove

extreme units from the data such as Rohatgi and Syrgkanis (2022). We diﬀer from these methods

in that our motivation for dropping units is based on variance reduction, not on a contamination

model for the data, and in that we emphasize the problem of inference after dropping these units.

2 Model and notation

We adopt a potential outcomes framework with nunits where the tuples (Yi(1), Yi(0), Xi, Zi) are

i.i.d. from a super-population distribution Pover R2× X × {0,1}. We assume that Yi(1) and

Yi(0) both have ﬁnite variance and that we observe Yi=ZiYi(1) + (1 −Zi)Yi(0). We write

e(x) = pr(Zi= 1 |Xi=x) for the propensity score, µw(x) = E[Yi(w)|Xi=x], where w∈ {0,1},

for the conditional means, and σ2

w(x) = var(Yi(w)|Xi=x) for the conditional variances. We

make the standard unconfoundedness and overlap assumptions that Zi⊥(Yi(1), Yi(0)) |Xiand

η≤e(x)≤1−η(Rosenbaum and Rubin, 1984).

Our target of inference is the sample average treatment eﬀect (SATE) and its trimmed analogs,

τ=1

i=1

τ(Xi), τA=1

i=1

τ(Xi)1{Xi∈A},

where τ(x) = µ1(x)−µ0(x) is a conditional average treatment eﬀect (CATE), A⊆ X is the subset

of covariate space we are restricting the covariates to, and nA=Pn

i=1 1{Xi∈A}is the number of

sample units whose covariates lie in A.

In subsequent sections, we employ empirical process notation (Wellner et al., 2013; Kennedy,

2016). We let Wi= (Xi, Yi, Zi) be the entire triplet we observe for unit i, and we write Pnf=

nPif(Wi) and Pf=Rf(w)dP(w). Note that for a random function ˆ

f,Pˆ

fis a random variable,

since we do not integrate over the randomness in ˆ

f. In contrast, E[ˆ

f] is a deterministic quantity

that integrates out the randomness in a new sample and in ˆ

f. We also deﬁne the norm ∥f∥Lq(P)=

(P|f|q)1/q.

3 Trimming methods

In this section, we present a framework for sample trimming methods and use this framework to

propose a trimming method that accounts for conditional variances. As a starting point, recall

the result of Hirano et al. (2003) that the variance of an eﬃcient estimator (such as the AIPW

estimator) of τAis given by

Veﬀ

A=1

pr(X∈A)2E1{X∈A}σ2

1(X)

e(X)+σ2

0(X)

1−e(X).(1)

Based on (1), we can extract the key quantity that determines a unit’s contribution to the asymp-

totic variance, calling it k(x):

k(x) = σ2

1(x)

e(x)+σ2

0(x)

1−e(x).(2)

That is, if many units have large values of k(Xi), then the variance of our estimate of τAwill be

large, and vice-versa. This idea was made precise by Crump et al. (2009), who showed that, if

σ2

0(x) and σ2

1(x) are bounded, (1) is minimized for the set Athat thresholds k(x) at a cut-oﬀ γ,

that is,

argmin

Veﬀ

A={x:k(x)≤γ},(3)

for some cut-oﬀ γ∈R. This result motivates us to consider trimming sets Athat have this form,

i.e., that threshold the function k(x).

Of course, in practice, we do not have direct access to the function kor the choice of γfor which

the minimum in (3) is attained. Instead, both must be learned from the data, giving us an estimated

function ˆ

k(x), an estimated cut-oﬀ ˆγ, and a corresponding trimming set ˆ

A={x:ˆ

k(x)≤ˆγ}. The

diﬀerence between ˆ

Aand Ais subtle, but will play a crucial role in what follows, particularly in

our discussion of inferential issues in Section 4. We now discuss several choices for ˆ

kand ˆγ.

3.1 Choices of ˆ

How we estimate k(x) depends on what assumptions we are willing to make on σ2

1(x) and σ2

0(x).

In particular, we distinguish between two possibilities:

Homoscedasticity assumed: if we assume that σ2

1(x), σ2

0(x) are constant in xand equal to each

other, then we have that k(x)∝1/(e(x)(1−e(x)), and so we can estimate kby ﬁrst estimating

the propensity score by ˆe(x), and then setting ˆ

k(x)=1/(ˆe(x)(1−ˆe(x)), Note that thresholding

this choice of ˆ

kis equivalent to thresholding on ˆe(x) itself, and so recovers standard propensity

trimming (Crump et al., 2009).

Heteroscedasticity allowed: if we are not willing to make the homoscedasticity assumption, then

we must also estimate the conditional variances by ˆσ2

1(x),ˆσ2

0(x), and then use the estimate

k(x) = ˆσ2

1(x)/ˆe(x) + ˆσ2

0(x)/(1 −ˆe(x)).(4)

Thus, the usual propensity-based trimming corresponds to choosing ˆ

kbased on a homoscedas-

ticity assumption that may or may not be satisﬁed. In some cases, such as when Yiis binary

and so σ2

w(x) is bounded by 1/4 for all x, deviations from this assumption may be negligible.

However, in other cases, such as when Yiis real-valued and has potentially unbounded variance,

deviations from this assumption may be signiﬁcant and worth capturing. In such situations, we

propose instead trimming based on the “heteroscedasticity-aware” ˆ

kdeﬁned in (4). This is in con-

trast to propensity-based trimming, which we refer to as “homoscedastic trimming” in light of the

underlying homoscedasticity assumption. Going forward, we state all of our results for general ˆ

making them relevant to both existing (homoscedastic, propensity-based) procedures and our new

procedures.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Doubly-robustandheteroscedasticity-awaresampletrimmingforcausalinferenceSamirKhanStanfordUniversitysamirk@stanford.eduJohanUganderStanfordUniversityjugander@stanford.eduJanuary30,2024AbstractApopularmethodforvariancereductionincausalinferenceispropensity-basedtrimming,thepracticeofremovingunitswithe...

展开>> 收起<<

Doubly-robust and heteroscedasticity-aware sample trimming for causal inference Samir Khan.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Doubly-robust and heteroscedasticity-aware sample trimming for causal inference Samir Khan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: