Flexible Instrumental Variable Models With Bayesian Additive Regression Trees Charles Spanbauerand Wei Pan

2025-04-27 0 0 717.12KB 14 页 10玖币
侵权投诉
Flexible Instrumental Variable Models With Bayesian Additive
Regression Trees
Charles Spanbauerand Wei Pan
Division of Biostatistics, University of Minnesota, Minneapolis, MN
Abstract
Methods utilizing instrumental variables have been a fundamental statistical approach to estima-
tion in the presence of unmeasured confounding, usually occurring in non-randomized observational
data common to fields such as economics and public health. However, such methods usually make
constricting linearity and additivity assumptions that are inapplicable to the complex modeling
challenges of today. The growing body of observational data being collected will necessitate flexible
regression modeling while also being able to control for confounding using instrumental variables.
Therefore, this article presents a nonlinear instrumental variable regression model based on Bayesian
regression tree ensembles to estimate such relationships, including interactions, in the presence of
confounding. One exciting application of this method is to use genetic variants as instruments,
known as Mendelian randomization. Body mass index is one factor that is hypothesized to have a
nonlinear relationship with cardiovascular risk factors such as blood pressure while interacting with
age. Heterogeneity in patient characteristics such as age could be clinically interesting from a pre-
cision medicine perspective where individualized treatment is emphasized. We present our flexible
Bayesian instrumental variable regression tree method with an example from the UK Biobank where
body mass index is related to blood pressure using genetic variants as the instruments.
Keywords— causality, genetics, instrumental variables, machine learning, Mendelian randomization
1 Introduction
The presence of unmeasured confounding, particularly in non-randomized observational data, can bias the esti-
mated effect of an exposure on an outcome and make the interpretation of scientific results difficult. Obtaining
unbiased results in such a case has traditionally been done through instrumental variable (IV) methods. IV
methods incorporate an additional variable, the instrument, to induce exposure variability that is independent
of the confounders thereby yielding unbiased estimation of the exposure effect (Stock & Trebbi,2003). However,
this strategy is only valid if the instrument satisfies three properties, usually known as the instrumental vari-
able assumptions. Fields such as economics and public health collect vast quantities of observational data with
confounding issues and so researchers in these fields frequently turn to IV analyses to obtain unbiased results.
The three properties central to obtaining unbiased inference are usually known as the instrumental variable
assumptions. First, the instruments must be predictive of the exposure in order to induce exposure variability.
Second, the instruments must be uncorrelated with the confounders which ensures that this induced variability is
independent of the confounders. Finally, there must be no direct effect between the instruments and the outcome,
i.e. the instruments only affect the outcome through the exposure. Taken together, these properties imply that
associating the instrument-induced exposure to the outcome will be free of confounder influence, yielding an
unbiased estimate of the exposure effect that can be interpreted causally. Unfortunately, only the first of these
assumptions is testable with the gathered data. As such, it is a notoriously difficult and challenging problem
to find valid instruments for a particular research question and it usually requires domain knowledge about the
data generating mechanism. For example, a classical question in economics seeks to correctly identify the extent
to which education levels correlate with wages, but that relationship is hypothesized to be confounded by the
innate ability levels of the subjects (Card,1999). Not only is ability level unmeasured, but there may not even
be consensus as to what constitutes innate ability. Nevertheless, IV methods can be used to assess this research
question because ability level only needs to be defined enough to justify the choice of instrument and whether
Corresponding author, spanb008@umn.edu
1
arXiv:2210.01872v1 [stat.ME] 4 Oct 2022
it satisfies the IV assumptions. The geographic proximity to a two or four year college has been used as an
instrument in this case, the choice of which is based on domain knowledge within the economics and education
literature.
In biomedical research, IV methods have played a large and increasingly important role in the analysis of
observational data which is plagued by confounding. Furthermore, the volume and complexity of these datasets
are growing over time, necessitating the development of more sophisticated IV methods. See Baiocchi et al. (2014)
for an overview of IV methods in biomedical sciences. For example, the use of IV methods has recently shown
promise in the area of statistical genetics where a causal genetic relationship can be established by using genetic
variants as instruments, specifically single-nucleotide polymorphisms (SNPs), for various traits or phenotypes.
This is commonly called Mendelian randomization (MR) because genetic instruments are used as a proxy for
the true randomization that is absent in observational data. The IV assumptions in such a case are biologically
plausible as long as these SNPs are carefully chosen so as to not be associated with environmental confounders.
Most IV approaches are based on the linear two-stage least-squares (2SLS) methodology which predicts the
exposure from the instruments in a first stage regression and then uses the predicted exposure to estimate the
unbiased effect of the exposure on the outcome in a second stage regression. Based on the IV assumptions,
the variability in the predicted exposure should not be associated with the confounders and so the estimated
effect of the exposure will be unbiased. Therefore, the results from this second stage regression will have a causal
interpretation assuming the validity of the instruments. However, all 2SLS methods rely on linearity assumptions
that can be problematic (Horowitz,2011). This is particularly true in MR methods that use genetic variants as
instruments. Work by Grinberg & Wallace (2021) suggests that linear models may not be sufficient to capture the
effect of genetic variants on different traits. Relaxing the linearity assumption could lead to improved prediction
in the first stage and improved inference via better power in the second stage. Therefore, methods that relax
linearity and additivity assumptions may prove beneficial over ones that do not in the statistical genetics context.
Beyond genetic variants, the traits of interest themselves may have relationships hypothesized to be nonlinear,
the estimates of which can provide a more complete picture of the underlying regression relationship. There are
a variety of examples in biomedical research (Peter et al.,2015;Scarneciu et al.,2017), including include body
mass index with blood pressure, an example that is explored in Section 4using the UK Biobank dataset. There
are additional examples where nonlinear effects have been investigated in fields including, but not limited to:
economics (Burke et al.,2015;Botev et al.,2019), marketing (Tuu & Olsen,2010;Yin et al.,2017), and political
science (Kiel,2000;Lipsitz & Padilla,2021).
Homogeneity of the exposure effect has been called an implicit assumption of IV methods (Lousdal,2018),
an assumption increasingly regarded as untenable in clinical settings whose focus is on precision medicine and
individualized treatment. Individualizing the estimation of the exposure effect by allowing flexible interactions
with patient characteristics could be incredibly useful in forming beneficial treatment strategies. Estimating the
global effect of a trait without considering age may give an incomplete picture of the results. For example, age
is known to have a heterogenous impact on genetic risk for various diseases (Jiang et al.,2021). The National
Academy of Medicine lists “increasing evidence generation” as one of five main challenges facing researchers in
the area of precision medicine (Dzau & Ginsburg,2016).
To this end, nonparametric IV regression, which relaxes the linearity and additivity assumptions of 2SLS, has
been a popular research area of research for statisticians and econometricians. Examples of such methodologies
include Newey & Powell (2003) which are based on series approximations, Burgess et al. (2014) which stratifies
the exposure and calculates the causal effect within each strata, and Guo & Small (2016) which adjusts for the
first stage errors in the second stage model, something known as control function estimation. In another example,
(Chetverikov & Wilhelm,2017) impose monotonicity which can help stabilize the high variance typically found
with series approximations. Stratifying the exposure is an effective strategy but requires the cut points to be
chosen by the analyst. This method also utilizes the ratio estimator which can only have a single instrument.
Finally, these methods do not allow for general heterogeneity in the exposure effect unless the interactions are
specified by the analyst.
Bayesian inference has also been used for IV models as an alternative to methods based on 2SLS or control
functions. For examples, see Lopes & Polson (2014) and Wiesenfarth et al. (2014). One benefit is the inherent
regularization that is provided by the prior which can help guard against the bias that arises when the instruments
are not predictive of the exposure, a violation of the IV assumptions. Another benefit is that both stages are
modeled simultaneously with a full probability distribution. This means that the uncertainty inherent in the
first stage will be propagated to the second stage during estimation, in contrast to 2SLS where the uncertainty
associated with the first stage predictions is not accounted for. Indeed, Bayesian inference allow for the uncertainty
for any quantity of interest to be estimated when MCMC is employed to sample from the posterior distribution.
One final benefit is that Bayesian inference allows for easy extendibility through hierarchical modeling in order
to handle more complicated analysis scenarios.
To allow for both nonlinearity and heterogeneity, Bayesian additive regression trees (Chipman et al.,2010),
or BART, priors can be incorporated to the Bayesian framework for IV analysis. BART has shown promise
in flexibly estimating general regression relationships without needing any assumptions on the functional form
such as higher-order or interaction terms. This article introduces such a heterogenous nonparametric model for
use in instrumental variable analyses called npivBART-h, along with simple default settings for the prior so
2
that the method can be used easily and efficiently with minimal learning curve. BART has experienced a usage
increase over the past decade along with the development of multiple extensions to handle increasingly complex
data. For example, BART has support for many alternative outcome types such as binary, count (Murray,2021),
and survival (Sparapani et al.,2016) including competing risks (Sparapani et al.,2020a) and recurrent events
(Sparapani et al.,2020b). Repeated measure outcomes can also be handled through a random effect specification
(Tan et al.,2018;Spanbauer & Sparapani,2021) that is combined with a BART prior to estimate nonlinear fixed
effects. There has also been research toward applying BART in high-dimensional sparse situations (Roˇckov´a &
van der Pas,2017;Linero,2018). Finally, work has been done on applying BART to precision medicine using
individidualized treatment rules (Logan et al.,2019). All of these extensions could easily be incorporated into
the npivBART-h framework, broadening the scope of its applicability for causal inference.
This article is organized as follows. In Section 2, the npivBART-h model is specified with advice for setting
the priors. Section 3presents a simulation study showing the unbiased inference for these methods in the presence
of confounding. Estimation consequences in the presence of weak instruments, an IV assumption violation, is
explored through simulation as well. The results from the UK Biobank data relating body mass index to blood
pressure with heterogeneity in age and sex using npivBART-h are presented in Section 4. The paper concludes
with a discussion in Section 5.
2 Instrumental Variable Analysis w/ BART
The IV model using Bayesian Additive Regression Trees is described in this section. A brief overview of parametric
Bayesian linear IV analysis methods is presented and then our extension to nonlinear relationships using BART
is given in Section 2.1. The default regularization priors used in traditional BART are discussed as well as how
these priors can be adapted to the IV setting in Section 2.2. Section 2.3 gives a brief treatment of simpler
semiparametric models that can estimate interpretable linear effects while Section 2.4 delves into the Dirichlet
process mixture model specification of the errors.
2.1 Linear Simultaneous Equations Model and npivBART-h
The instrumental variables strategy commonly used in Bayesian IV analyses defines a linear simultaneous equa-
tions model as
ti=γ0zi+ti(1)
yi=βti+δ0xi+yi,(2)
where βis the causal effect of interest assuming the IV assumptions are met. The dependency between equations
are adjusted for through the error specification i= (ti, yi)0N2(02,Σ). The covariance between the errors
controls the degree of confounding. Equation (1) represents the first stage model relating instruments ziwith the
exposure of interest ti. Equation (2) represents the second stage model with the outcome yibeing modeled as
a linear function of the exposure and some other measured covariates xi. We assume centered outcomes and so
the intercepts are omitted for simplicity. This model is defined in Rossi et al. (2012) and Lopes & Polson (2014).
The goal is to relax the linearity assumption in the above model while still retaining unbiased estimation of
the causal exposure effect. This can be done with:
ti=f1(zi) + ti(3)
yi=f2(ti,xi) + yi(4)
where iis defined as above. The model that Equations (3) and (4) specify is analogous to the above model, but
f2replaces βas the causal exposure effect of interest. Note that the other covariates xiare incorporated into
function f2which yields a heterogenous causal effect that is not necessarily linear. BART is used to place such
a nonlinear prior on f1and f2, which implies the following specification:
f1(zi)
Ht
X
h=1
g(zi;Th,Mh)
f2(ti,xi)
Hy
X
h=1
g(ti,xi;Sh,Lh).
Here, gare recursively defined piecewise constant functions otherwise known as regression trees. In this definition,
Thand Shrepresent the probabilistic structure for partitioning the predictor space in function g.Mhand Lh
represent the terminal node values that serve as the output of function g. For simplicity let (T,M) denote the
set of all Httrees approximating f1and let (S,L) be denoted similarly for f2. In a Bayesian sense, specifying
priors on f1and f2can be done by specifying priors on (T,M) and (S,L) respectively.
3
摘要:

FlexibleInstrumentalVariableModelsWithBayesianAdditiveRegressionTreesCharlesSpanbauer*andWeiPanDivisionofBiostatistics,UniversityofMinnesota,Minneapolis,MNAbstractMethodsutilizinginstrumentalvariableshavebeenafundamentalstatisticalapproachtoestima-tioninthepresenceofunmeasuredconfounding,usuallyoccu...

展开>> 收起<<
Flexible Instrumental Variable Models With Bayesian Additive Regression Trees Charles Spanbauerand Wei Pan.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:717.12KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注