it satisfies the IV assumptions. The geographic proximity to a two or four year college has been used as an
instrument in this case, the choice of which is based on domain knowledge within the economics and education
literature.
In biomedical research, IV methods have played a large and increasingly important role in the analysis of
observational data which is plagued by confounding. Furthermore, the volume and complexity of these datasets
are growing over time, necessitating the development of more sophisticated IV methods. See Baiocchi et al. (2014)
for an overview of IV methods in biomedical sciences. For example, the use of IV methods has recently shown
promise in the area of statistical genetics where a causal genetic relationship can be established by using genetic
variants as instruments, specifically single-nucleotide polymorphisms (SNPs), for various traits or phenotypes.
This is commonly called Mendelian randomization (MR) because genetic instruments are used as a proxy for
the true randomization that is absent in observational data. The IV assumptions in such a case are biologically
plausible as long as these SNPs are carefully chosen so as to not be associated with environmental confounders.
Most IV approaches are based on the linear two-stage least-squares (2SLS) methodology which predicts the
exposure from the instruments in a first stage regression and then uses the predicted exposure to estimate the
unbiased effect of the exposure on the outcome in a second stage regression. Based on the IV assumptions,
the variability in the predicted exposure should not be associated with the confounders and so the estimated
effect of the exposure will be unbiased. Therefore, the results from this second stage regression will have a causal
interpretation assuming the validity of the instruments. However, all 2SLS methods rely on linearity assumptions
that can be problematic (Horowitz,2011). This is particularly true in MR methods that use genetic variants as
instruments. Work by Grinberg & Wallace (2021) suggests that linear models may not be sufficient to capture the
effect of genetic variants on different traits. Relaxing the linearity assumption could lead to improved prediction
in the first stage and improved inference via better power in the second stage. Therefore, methods that relax
linearity and additivity assumptions may prove beneficial over ones that do not in the statistical genetics context.
Beyond genetic variants, the traits of interest themselves may have relationships hypothesized to be nonlinear,
the estimates of which can provide a more complete picture of the underlying regression relationship. There are
a variety of examples in biomedical research (Peter et al.,2015;Scarneciu et al.,2017), including include body
mass index with blood pressure, an example that is explored in Section 4using the UK Biobank dataset. There
are additional examples where nonlinear effects have been investigated in fields including, but not limited to:
economics (Burke et al.,2015;Botev et al.,2019), marketing (Tuu & Olsen,2010;Yin et al.,2017), and political
science (Kiel,2000;Lipsitz & Padilla,2021).
Homogeneity of the exposure effect has been called an implicit assumption of IV methods (Lousdal,2018),
an assumption increasingly regarded as untenable in clinical settings whose focus is on precision medicine and
individualized treatment. Individualizing the estimation of the exposure effect by allowing flexible interactions
with patient characteristics could be incredibly useful in forming beneficial treatment strategies. Estimating the
global effect of a trait without considering age may give an incomplete picture of the results. For example, age
is known to have a heterogenous impact on genetic risk for various diseases (Jiang et al.,2021). The National
Academy of Medicine lists “increasing evidence generation” as one of five main challenges facing researchers in
the area of precision medicine (Dzau & Ginsburg,2016).
To this end, nonparametric IV regression, which relaxes the linearity and additivity assumptions of 2SLS, has
been a popular research area of research for statisticians and econometricians. Examples of such methodologies
include Newey & Powell (2003) which are based on series approximations, Burgess et al. (2014) which stratifies
the exposure and calculates the causal effect within each strata, and Guo & Small (2016) which adjusts for the
first stage errors in the second stage model, something known as control function estimation. In another example,
(Chetverikov & Wilhelm,2017) impose monotonicity which can help stabilize the high variance typically found
with series approximations. Stratifying the exposure is an effective strategy but requires the cut points to be
chosen by the analyst. This method also utilizes the ratio estimator which can only have a single instrument.
Finally, these methods do not allow for general heterogeneity in the exposure effect unless the interactions are
specified by the analyst.
Bayesian inference has also been used for IV models as an alternative to methods based on 2SLS or control
functions. For examples, see Lopes & Polson (2014) and Wiesenfarth et al. (2014). One benefit is the inherent
regularization that is provided by the prior which can help guard against the bias that arises when the instruments
are not predictive of the exposure, a violation of the IV assumptions. Another benefit is that both stages are
modeled simultaneously with a full probability distribution. This means that the uncertainty inherent in the
first stage will be propagated to the second stage during estimation, in contrast to 2SLS where the uncertainty
associated with the first stage predictions is not accounted for. Indeed, Bayesian inference allow for the uncertainty
for any quantity of interest to be estimated when MCMC is employed to sample from the posterior distribution.
One final benefit is that Bayesian inference allows for easy extendibility through hierarchical modeling in order
to handle more complicated analysis scenarios.
To allow for both nonlinearity and heterogeneity, Bayesian additive regression trees (Chipman et al.,2010),
or BART, priors can be incorporated to the Bayesian framework for IV analysis. BART has shown promise
in flexibly estimating general regression relationships without needing any assumptions on the functional form
such as higher-order or interaction terms. This article introduces such a heterogenous nonparametric model for
use in instrumental variable analyses called npivBART-h, along with simple default settings for the prior so
2