Data-driven Automated Negative Control Estimation DANCE Search for Validation of and Causal Inference with Negative Controls

2025-04-27 0 0 1.16MB 39 页 10玖币

侵权投诉

Data-driven Automated Negative Control Estimation

(DANCE): Search for, Validation of, and Causal

Inference with Negative Controls

Erich Kummerfeld1, Jaewon Lim2, and Xu Shi3

1Institute for Health Informatics, University of Minnesota

2Department of Biostatistics, University of Washington

3Department of Biostatistics, University of Michigan

Abstract

Negative control variables are increasingly used to adjust for unmeasured confound-

ing bias in causal inference using observational data. They are typically identiﬁed by

subject matter knowledge and there is currently a severe lack of data-driven methods

to ﬁnd negative controls. In this paper, we present a statistical test for discovering

negative controls of a special type—disconnected negative controls—that can serve as

surrogates of the unmeasured confounder, and we incorporate that test into the Data-

driven Automated Negative Control Estimation (DANCE) algorithm. DANCE ﬁrst

uses the new validation test to identify subsets of a set of candidate negative control

variables that satisfy the assumptions of disconnected negative controls. It then applies

a negative control method to each pair of these validated negative control variables,

and aggregates the output to produce an unbiased point estimate and conﬁdence in-

terval for a causal eﬀect in the presence of unmeasured confounding. We (1) prove the

correctness of this validation test, and thus of DANCE; (2) demonstrate via simula-

tion experiments that DANCE outperforms both naive analysis ignoring unmeasured

confounding and negative control method with randomly selected candidate negative

controls; and (3) demonstrate the eﬀectiveness of DANCE on a challenging real-world

problem.

Keywords: causal discovery; graphical models; negative control; unmeasured confounding;

vanishing tetrad.

arXiv:2210.00528v1 [stat.ME] 2 Oct 2022

1 Introduction

There are many causal questions in science and medicine that can not be solved with ran-

domized experiments now or in the foreseeable future. For such questions, our best estimates

must thus rely on observational data instead. The rich ﬁeld of causal inference has developed

in response to this, providing support for these eﬀorts and developing methods that oﬀer

some level of assurance and conﬁdence for learning causal information from observational

data (Pearl 2009, Rubin 1974). Many causal inference methods assume that there are no

unmeasured common causes of treatment and outcome, but it is generally believed that in

reality unmeasured confounders are widespread. This is a serious limitation to the methods

that make such assumptions. One of the most frequently used approaches to mitigate un-

measured confounding is the instrumental variable (IV) approach (Angrist & Keueger 1991,

Angrist et al. 1996, Hern´an & Robins 2006), which has been previously studied extensively

(Greenland 2000, Baiocchi et al. 2014, Garabedian et al. 2014, Burgess et al. 2017, Swanson

et al. 2018).

A more recently developed strategy is negative control (NC) methods (Lipsitch et al.

2010, Shi et al. 2020a, Tchetgen et al. 2020). Negative controls are variables associated

with the unmeasured confounders but not causally related to either the treatment or out-

come variables of primary interest. One can detect residual confounding bias leveraging such

known null eﬀects: presence of an association between the negative control and the exposure

or outcome constitutes compelling evidence of residual confounding bias, while the absence

of such association implies no empirical evidence of such bias. NCs have traditionally been

used to rule out non-causal explanations of empirical ﬁndings (Rosenbaum 1989, Weiss 2002,

Lipsitch et al. 2010, Glass 2014). Recently, a sequence of NC methods have been developed

to identify causal eﬀects and correct for unmeasured confounding bias (Miao, Geng & Tch-

etgen Tchetgen 2018, Deaner 2018, Shi et al. 2020b, Singh 2020, Cui et al. 2020, Ying et al.

2021, Kallus et al. 2021, Dukes et al. 2021, Li et al. 2022).

A key challenge in the use of NC methods is that until now, NC variables have had

to be identiﬁed laboriously from background knowledge. It also had to be assumed that

the identiﬁed variables were genuine NCs, as no validation test existed unless one is willing

to make additional assumptions. Such situations are common in causal inference, e.g., the

assumption of no unmeasured confounding is also untestable. Nevertheless, we will show

that under certain conditions, it is possible to leverage certain subcovariance matrix rank

constraints to validate a particular class of NC variables, referred to as disconnected NCs

which we formally deﬁne in Section 2.1, satisfying a speciﬁc causal structural model.

In this paper, we utilize some lesser known theory regarding relationships between sub-

covariance matrix rank constraints and the graphical structure of causal models to provide

both theory and algorithms for evaluating NC variables. First, we provide a statistical test

that can be used to determine whether a triplet of candidate NCs are real disconnected NCs

or not. Second, we provide a simple algorithm for searching among a set of candidate NCs,

and identifying subsets of those variables that collectively meet the conditions of being dis-

connected NCs. Third, we combine our proposed method for ﬁnding valid NC variables with

a recently developed double-NC method for causal inference (Miao, Shi & Tchetgen Tchet-

gen 2018, Shi et al. 2020b, Cui et al. 2020), creating an algorithm that accurately estimates

and makes inferences about causal eﬀects from observational data. We refer to the proposed

method as the Data-driven Automated Negative Control Estimation (DANCE) algorithm.

We prove that our proposed methods are correct under fairly general assumptions, evaluate

their ﬁnite sample performance with a series of numerical experiments, and demonstrate

their usability on a real world data set.

The rest of the paper is organized as follows. In Section 2 we review the three main

topics that the work in this paper builds upon: negative controls, structural models, and

rank constraints. We then present a statistical validation test for disconnected NCs in Section

3, and prove its correctness in Section 3.3. Section 4 presents an algorithm that searches a

set of candidate NC variables to ﬁnd sets of disconnected NCs which pass the validation test,

and Section 5 presents the DANCE algorithm that combines with the double-NC method to

construct an all-in-one method for producing a valid causal eﬀect estimate from a data set

containing a collection of candidate NC variables, some of which are not necessarily valid

disconnected NCs. Section 6 presents numerical experiments to evaluate our proposed test

and algorithms, and compares them to two methods: a simple regression method ignoring

unmeasured confounding and a random selection of candidate NCs followed by the double-

NC method. An application of DANCE to a real clinical data set is described in Section 7.

Section 8 summarizes the strengths and limitations of the methods presented in this paper,

and points towards promising directions for future work.

2 Background

2.1 Unmeasured Confounding and Negative Control Methods

We adopt the potential outcome framework under the Stable Unit Treatment Value As-

sumption (SUTVA) (Rubin 1974, 1980, Cox 1992) and let (O(1), O(0)) denote the pair of

potential outcomes under treatment and control conditions, respectively. We are interested

in estimating the average treatment eﬀect (ATE), deﬁned as ∆ = E[O(1) −O(0)]. It suﬃces

to identify the counterfactual mean E[O(t)] for t∈ {0,1}. Let Odenote the observed out-

come and Tdenote the binary treatment. We suppress measured covariates for simplicity;

adjustment for measured covariates is discussed in Section 5.1.

Instead of making the no unmeasured confounding assumption, we allow the presence of

an unmeasured confounder Uwith a latent ignorability assumption that O(t)⊥⊥ T|U. If U

was measured, then E[O(t)] is identiﬁed under the ignorability assumption (Robins 1986).

However, when Uis unobserved and unadjusted, ATE estimation will be biased. In this

case, additional information is needed to identify and make inference about the ATE.

An increasingly popular approach to mitigate bias due to unmeasured confounding is to

use its proxies. For example, as shown in Figure 1, if Ucan be measured with error via

proxy variables Zand W, then one can leverage Zand Wto identify the confounding bias

due to Uand remove such bias from the estimated causal eﬀect. Such proxy variables have

been referred to as negative controls (Lipsitch et al. 2010, Shi et al. 2020a). Formally, a

negative control outcome, denoted as W, is a variable known not to be causally aﬀected by

the treatment of interest. Likewise, a negative control exposure, denoted as Z, is a variable

known not to causally aﬀect the outcome of interest. The negative control exposure and

outcome variables should share a confounding mechanism with the exposure and outcome

variables of primary interest. In summary, Zand Wsatisﬁes

(T, Z)⊥⊥ (O(t), W )|U. (1)

There are a number of causal graphs that satisfy the NC assumptions (Shi et al. 2020a).

For example, both a valid instrumental variable independent of the unmeasured confounder

and an invalid instrumental variable associated with the unmeasured confounder are valid

negative control exposures. Alternative directed acyclic graphs encoding the NC assumptions

are available in Shi et al. (2020a).

Figure 1 presents a special case where Zand Ware causally related to neither the

treatment nor the outcome of interest, hence Zand Wcan serve as either negative control

exposure or negative control outcome (Shi et al. 2020b, Tchetgen et al. 2020). We refer to

such a special class of NC variables as the disconnected NCs. Formally, the disconnected

NCs satisfy the following assumption

(Z, W )⊥⊥ (T, O)|U.

Compared to the fundamental NC assumption (1), the disconnected NCs satisfy additional

assumptions that Z⊥⊥ T|Uand W⊥⊥ O|U.

T O

U ZW

treatment outcome

unmeasured

confounders

negative

control

negative

control

Figure 1: Causal graph of two disconnected NCs, Z and W, suppressing the measured

covariates X which is implicitly conditioned on in all arguments.

Using a pair of negative control exposure and outcome variables, referred to as the double-

NC, Miao, Geng & Tchetgen Tchetgen (2018) established nonparametric identiﬁcation of the

average treatment eﬀect (ATE). Intuitively, having additional children of Uthat are condi-

tionally independent with Tand Oallows for identiﬁcation of the unmeasured confounding

bias due to the inﬂuence of Uon Tand O, and subsequently this quantity can be removed

from the association between Tand O, leaving an unbiased estimate of T’s eﬀect on O. Re-

cently, the NC framework has been extended to proximal causal inference, which partitions

measured covariates into proxies satisfying NC conditions, acknowledging that covariate mea-

surements are at best proxies of the underlying confounding mechanisms (Tchetgen et al.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Data-drivenAutomatedNegativeControlEstimation(DANCE):Searchfor,Validationof,andCausalInferencewithNegativeControlsErichKummerfeld1,JaewonLim2,andXuShi31InstituteforHealthInformatics,UniversityofMinnesota2DepartmentofBiostatistics,UniversityofWashington3DepartmentofBiostatistics,UniversityofMichiganA...

展开>> 收起<<

Data-driven Automated Negative Control Estimation DANCE Search for Validation of and Causal Inference with Negative Controls.pdf

共39页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Data-driven Automated Negative Control Estimation DANCE Search for Validation of and Causal Inference with Negative Controls

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: