Flexible Instrumental Variable Models With Bayesian Additive Regression Trees Charles Spanbauerand Wei Pan

2025-04-27 0 0 717.12KB 14 页 10玖币

侵权投诉

Flexible Instrumental Variable Models With Bayesian Additive

Regression Trees

Charles Spanbauer∗and Wei Pan

Division of Biostatistics, University of Minnesota, Minneapolis, MN

Abstract

Methods utilizing instrumental variables have been a fundamental statistical approach to estima-

tion in the presence of unmeasured confounding, usually occurring in non-randomized observational

data common to ﬁelds such as economics and public health. However, such methods usually make

constricting linearity and additivity assumptions that are inapplicable to the complex modeling

challenges of today. The growing body of observational data being collected will necessitate ﬂexible

regression modeling while also being able to control for confounding using instrumental variables.

Therefore, this article presents a nonlinear instrumental variable regression model based on Bayesian

regression tree ensembles to estimate such relationships, including interactions, in the presence of

confounding. One exciting application of this method is to use genetic variants as instruments,

known as Mendelian randomization. Body mass index is one factor that is hypothesized to have a

nonlinear relationship with cardiovascular risk factors such as blood pressure while interacting with

age. Heterogeneity in patient characteristics such as age could be clinically interesting from a pre-

cision medicine perspective where individualized treatment is emphasized. We present our ﬂexible

Bayesian instrumental variable regression tree method with an example from the UK Biobank where

body mass index is related to blood pressure using genetic variants as the instruments.

Keywords— causality, genetics, instrumental variables, machine learning, Mendelian randomization

1 Introduction

The presence of unmeasured confounding, particularly in non-randomized observational data, can bias the esti-

mated eﬀect of an exposure on an outcome and make the interpretation of scientiﬁc results diﬃcult. Obtaining

unbiased results in such a case has traditionally been done through instrumental variable (IV) methods. IV

methods incorporate an additional variable, the instrument, to induce exposure variability that is independent

of the confounders thereby yielding unbiased estimation of the exposure eﬀect (Stock & Trebbi,2003). However,

this strategy is only valid if the instrument satisﬁes three properties, usually known as the instrumental vari-

able assumptions. Fields such as economics and public health collect vast quantities of observational data with

confounding issues and so researchers in these ﬁelds frequently turn to IV analyses to obtain unbiased results.

The three properties central to obtaining unbiased inference are usually known as the instrumental variable

assumptions. First, the instruments must be predictive of the exposure in order to induce exposure variability.

Second, the instruments must be uncorrelated with the confounders which ensures that this induced variability is

independent of the confounders. Finally, there must be no direct eﬀect between the instruments and the outcome,

i.e. the instruments only aﬀect the outcome through the exposure. Taken together, these properties imply that

associating the instrument-induced exposure to the outcome will be free of confounder inﬂuence, yielding an

unbiased estimate of the exposure eﬀect that can be interpreted causally. Unfortunately, only the ﬁrst of these

assumptions is testable with the gathered data. As such, it is a notoriously diﬃcult and challenging problem

to ﬁnd valid instruments for a particular research question and it usually requires domain knowledge about the

data generating mechanism. For example, a classical question in economics seeks to correctly identify the extent

to which education levels correlate with wages, but that relationship is hypothesized to be confounded by the

innate ability levels of the subjects (Card,1999). Not only is ability level unmeasured, but there may not even

be consensus as to what constitutes innate ability. Nevertheless, IV methods can be used to assess this research

question because ability level only needs to be deﬁned enough to justify the choice of instrument and whether

∗Corresponding author, spanb008@umn.edu

arXiv:2210.01872v1 [stat.ME] 4 Oct 2022

it satisﬁes the IV assumptions. The geographic proximity to a two or four year college has been used as an

instrument in this case, the choice of which is based on domain knowledge within the economics and education

literature.

In biomedical research, IV methods have played a large and increasingly important role in the analysis of

observational data which is plagued by confounding. Furthermore, the volume and complexity of these datasets

are growing over time, necessitating the development of more sophisticated IV methods. See Baiocchi et al. (2014)

for an overview of IV methods in biomedical sciences. For example, the use of IV methods has recently shown

promise in the area of statistical genetics where a causal genetic relationship can be established by using genetic

variants as instruments, speciﬁcally single-nucleotide polymorphisms (SNPs), for various traits or phenotypes.

This is commonly called Mendelian randomization (MR) because genetic instruments are used as a proxy for

the true randomization that is absent in observational data. The IV assumptions in such a case are biologically

plausible as long as these SNPs are carefully chosen so as to not be associated with environmental confounders.

Most IV approaches are based on the linear two-stage least-squares (2SLS) methodology which predicts the

exposure from the instruments in a ﬁrst stage regression and then uses the predicted exposure to estimate the

unbiased eﬀect of the exposure on the outcome in a second stage regression. Based on the IV assumptions,

the variability in the predicted exposure should not be associated with the confounders and so the estimated

eﬀect of the exposure will be unbiased. Therefore, the results from this second stage regression will have a causal

interpretation assuming the validity of the instruments. However, all 2SLS methods rely on linearity assumptions

that can be problematic (Horowitz,2011). This is particularly true in MR methods that use genetic variants as

instruments. Work by Grinberg & Wallace (2021) suggests that linear models may not be suﬃcient to capture the

eﬀect of genetic variants on diﬀerent traits. Relaxing the linearity assumption could lead to improved prediction

in the ﬁrst stage and improved inference via better power in the second stage. Therefore, methods that relax

linearity and additivity assumptions may prove beneﬁcial over ones that do not in the statistical genetics context.

Beyond genetic variants, the traits of interest themselves may have relationships hypothesized to be nonlinear,

the estimates of which can provide a more complete picture of the underlying regression relationship. There are

a variety of examples in biomedical research (Peter et al.,2015;Scarneciu et al.,2017), including include body

mass index with blood pressure, an example that is explored in Section 4using the UK Biobank dataset. There

are additional examples where nonlinear eﬀects have been investigated in ﬁelds including, but not limited to:

economics (Burke et al.,2015;Botev et al.,2019), marketing (Tuu & Olsen,2010;Yin et al.,2017), and political

science (Kiel,2000;Lipsitz & Padilla,2021).

Homogeneity of the exposure eﬀect has been called an implicit assumption of IV methods (Lousdal,2018),

an assumption increasingly regarded as untenable in clinical settings whose focus is on precision medicine and

individualized treatment. Individualizing the estimation of the exposure eﬀect by allowing ﬂexible interactions

with patient characteristics could be incredibly useful in forming beneﬁcial treatment strategies. Estimating the

global eﬀect of a trait without considering age may give an incomplete picture of the results. For example, age

is known to have a heterogenous impact on genetic risk for various diseases (Jiang et al.,2021). The National

Academy of Medicine lists “increasing evidence generation” as one of ﬁve main challenges facing researchers in

the area of precision medicine (Dzau & Ginsburg,2016).

To this end, nonparametric IV regression, which relaxes the linearity and additivity assumptions of 2SLS, has

been a popular research area of research for statisticians and econometricians. Examples of such methodologies

include Newey & Powell (2003) which are based on series approximations, Burgess et al. (2014) which stratiﬁes

the exposure and calculates the causal eﬀect within each strata, and Guo & Small (2016) which adjusts for the

ﬁrst stage errors in the second stage model, something known as control function estimation. In another example,

(Chetverikov & Wilhelm,2017) impose monotonicity which can help stabilize the high variance typically found

with series approximations. Stratifying the exposure is an eﬀective strategy but requires the cut points to be

chosen by the analyst. This method also utilizes the ratio estimator which can only have a single instrument.

Finally, these methods do not allow for general heterogeneity in the exposure eﬀect unless the interactions are

speciﬁed by the analyst.

Bayesian inference has also been used for IV models as an alternative to methods based on 2SLS or control

functions. For examples, see Lopes & Polson (2014) and Wiesenfarth et al. (2014). One beneﬁt is the inherent

regularization that is provided by the prior which can help guard against the bias that arises when the instruments

are not predictive of the exposure, a violation of the IV assumptions. Another beneﬁt is that both stages are

modeled simultaneously with a full probability distribution. This means that the uncertainty inherent in the

ﬁrst stage will be propagated to the second stage during estimation, in contrast to 2SLS where the uncertainty

associated with the ﬁrst stage predictions is not accounted for. Indeed, Bayesian inference allow for the uncertainty

for any quantity of interest to be estimated when MCMC is employed to sample from the posterior distribution.

One ﬁnal beneﬁt is that Bayesian inference allows for easy extendibility through hierarchical modeling in order

to handle more complicated analysis scenarios.

To allow for both nonlinearity and heterogeneity, Bayesian additive regression trees (Chipman et al.,2010),

or BART, priors can be incorporated to the Bayesian framework for IV analysis. BART has shown promise

in ﬂexibly estimating general regression relationships without needing any assumptions on the functional form

such as higher-order or interaction terms. This article introduces such a heterogenous nonparametric model for

use in instrumental variable analyses called npivBART-h, along with simple default settings for the prior so

that the method can be used easily and eﬃciently with minimal learning curve. BART has experienced a usage

increase over the past decade along with the development of multiple extensions to handle increasingly complex

data. For example, BART has support for many alternative outcome types such as binary, count (Murray,2021),

and survival (Sparapani et al.,2016) including competing risks (Sparapani et al.,2020a) and recurrent events

(Sparapani et al.,2020b). Repeated measure outcomes can also be handled through a random eﬀect speciﬁcation

(Tan et al.,2018;Spanbauer & Sparapani,2021) that is combined with a BART prior to estimate nonlinear ﬁxed

eﬀects. There has also been research toward applying BART in high-dimensional sparse situations (Roˇckov´a &

van der Pas,2017;Linero,2018). Finally, work has been done on applying BART to precision medicine using

individidualized treatment rules (Logan et al.,2019). All of these extensions could easily be incorporated into

the npivBART-h framework, broadening the scope of its applicability for causal inference.

This article is organized as follows. In Section 2, the npivBART-h model is speciﬁed with advice for setting

the priors. Section 3presents a simulation study showing the unbiased inference for these methods in the presence

of confounding. Estimation consequences in the presence of weak instruments, an IV assumption violation, is

explored through simulation as well. The results from the UK Biobank data relating body mass index to blood

pressure with heterogeneity in age and sex using npivBART-h are presented in Section 4. The paper concludes

with a discussion in Section 5.

2 Instrumental Variable Analysis w/ BART

The IV model using Bayesian Additive Regression Trees is described in this section. A brief overview of parametric

Bayesian linear IV analysis methods is presented and then our extension to nonlinear relationships using BART

is given in Section 2.1. The default regularization priors used in traditional BART are discussed as well as how

these priors can be adapted to the IV setting in Section 2.2. Section 2.3 gives a brief treatment of simpler

semiparametric models that can estimate interpretable linear eﬀects while Section 2.4 delves into the Dirichlet

process mixture model speciﬁcation of the errors.

2.1 Linear Simultaneous Equations Model and npivBART-h

The instrumental variables strategy commonly used in Bayesian IV analyses deﬁnes a linear simultaneous equa-

tions model as

ti=γ0zi+ti(1)

yi=βti+δ0xi+yi,(2)

where βis the causal eﬀect of interest assuming the IV assumptions are met. The dependency between equations

are adjusted for through the error speciﬁcation i= (ti, yi)0∼N2(02,Σ). The covariance between the errors

controls the degree of confounding. Equation (1) represents the ﬁrst stage model relating instruments ziwith the

exposure of interest ti. Equation (2) represents the second stage model with the outcome yibeing modeled as

a linear function of the exposure and some other measured covariates xi. We assume centered outcomes and so

the intercepts are omitted for simplicity. This model is deﬁned in Rossi et al. (2012) and Lopes & Polson (2014).

The goal is to relax the linearity assumption in the above model while still retaining unbiased estimation of

the causal exposure eﬀect. This can be done with:

ti=f1(zi) + ti(3)

yi=f2(ti,xi) + yi(4)

where iis deﬁned as above. The model that Equations (3) and (4) specify is analogous to the above model, but

f2replaces βas the causal exposure eﬀect of interest. Note that the other covariates xiare incorporated into

function f2which yields a heterogenous causal eﬀect that is not necessarily linear. BART is used to place such

a nonlinear prior on f1and f2, which implies the following speciﬁcation:

f1(zi)≈

h=1

g(zi;Th,Mh)

f2(ti,xi)≈

h=1

g(ti,xi;Sh,Lh).

Here, gare recursively deﬁned piecewise constant functions otherwise known as regression trees. In this deﬁnition,

Thand Shrepresent the probabilistic structure for partitioning the predictor space in function g.Mhand Lh

represent the terminal node values that serve as the output of function g. For simplicity let (T,M) denote the

set of all Httrees approximating f1and let (S,L) be denoted similarly for f2. In a Bayesian sense, specifying

priors on f1and f2can be done by specifying priors on (T,M) and (S,L) respectively.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FlexibleInstrumentalVariableModelsWithBayesianAdditiveRegressionTreesCharlesSpanbauer*andWeiPanDivisionofBiostatistics,UniversityofMinnesota,Minneapolis,MNAbstractMethodsutilizinginstrumentalvariableshavebeenafundamentalstatisticalapproachtoestima-tioninthepresenceofunmeasuredconfounding,usuallyoccu...

展开>> 收起<<

Flexible Instrumental Variable Models With Bayesian Additive Regression Trees Charles Spanbauerand Wei Pan.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Flexible Instrumental Variable Models With Bayesian Additive Regression Trees Charles Spanbauerand Wei Pan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: