
Experiment-Selector CV-TMLE A PREPRINT
Yet combining these data types comes with the risk of introducing bias from multiple sources, including measurement
error, selection bias, and confounding (Bareinboim and Pearl, 2016). Bareinboim and Pearl (2016) previously defined a
structural causal model-based framework for causal inference when multiple data sources are utilized, a setting known
as data fusion. Using directed acyclic graphs (DAGs), this framework helps researchers understand what assumptions,
testable or untestable, must be made in order to identify a causal effect from the combined data. By introducing
observational data, we can no longer rely on randomization to satisfy the assumption that there are no unmeasured
common causes of the intervention and the outcome in the pooled data. Furthermore, causal identification of the
average treatment effect is generally precluded if the conditional expectations of the counterfactual outcomes given the
measured covariates are different for those in the trial compared to those in the RWD (Rudolph and van der Laan, 2017).
Such a difference can occur for a number of reasons, including changes in medical care over time or health benefits
simply from being enrolled in a clinical trial (Ghadessi et al., 2020; Viele et al., 2014; Pocock, 1976; Chow et al.,
2013). In 1976, Stuart Pocock developed a set of criteria for evaluating whether historical control groups are sufficiently
comparable to trial controls such that the suspected bias of combining data sources would be small (Pocock, 1976). We
are not limited to historical information, however, but could also incorporate data from prospectively followed cohorts
in established health care systems. These and other considerations proposed by subsequent authors are vital when
designing a hybrid randomized-RWD study to make included populations similar, minimize measurement error, and
measure relevant confounding factors (FDA, 2018; Franklin et al., 2020; Ghadessi et al., 2020).
Despite careful consideration of an appropriate real-world control group, the possibility of bias remains, casting doubt
on whether effect estimates from combined RCT-RWD analyses are truly causal. A growing number of data fusion
estimators — discussed in Related Literature below — attempt to estimate the bias from including RWD in order
to decide whether to incorporate RWD or how to weight RWD in a combined analysis. A key insight from this
literature is that there is an inherent tradeoff between maximizing power when unbiased external data are available
and maintaining close to nominal coverage across the spectrum of potential magnitudes of RWD bias (Chen et al.,
2021; Oberst et al., 2022). The strengths and limitations of existing methods led us to consider an alternate approach to
augmenting the control arm of an RCT with external data that incorporates multiple estimates of bias to boost potential
power gains while providing robust inference despite violations of necessary identification assumptions. Framing the
decision of whether to integrate RWD (and by extension, which RWD to integrate) as a problem of data-adaptive
experiment selection, we develop a novel cross-validated targeted maximum likelihood estimator for this context that 1)
incorporates an estimate of the average treatment effect on a negative control outcome (NCO) into the bias estimate,
2) uses cross-validation to separate bias estimation from effect estimation, and 3) constructs confidence intervals by
sampling from the estimated limit distribution of this estimator, where the sampling process includes an estimate of the
bias, further promoting accurate inference.
The remainder of this paper is organized as follows. In Section 2, we discuss related data fusion estimators. In Section
3, we introduce the problem of data-adaptive experiment selection and discuss issues of causal identification, including
estimation of bias due to inclusion of RWD. In Section 4, we introduce potential criteria for including RWD based on
optimizing the bias-variance tradeoff and utilizing the estimated effect of treatment on an NCO. In Section 5, we develop
an extension of the cross-validated targeted maximum likelihood estimator (CV-TMLE) (Zheng and van der Laan, 2010;
Hubbard et al., 2016) for this new context of data-adaptive experiment selection and define the limit distribution of this
estimator under varying amounts of bias. In Section 6, we set up a simulation to assess the performance of our estimator
and describe four potential comparator methods: two test-then-pool approaches (Viele et al., 2014), one method of
Bayesian dynamic borrowing (Schmidli et al., 2014), and a difference-in-differences (DID) approach to adjusting for
bias based on a negative control outcome (Sofer et al., 2016; Shi et al., 2020b). We also introduce a CV-TMLE based
version of this DID method. In Section 7, we compare the causal coverage, power, bias, variance, and mean squared
error of the experiment-selector CV-TMLE to these four methods as well as to a CV-TMLE and t-test for the RCT only.
In Section 8, we demonstrate the use of the experiment-selector CV-TMLE to distinguish biased from unbiased external
controls in a real data analysis of the effect of liraglutide versus placebo on improvement in glycemic control in the
Central/South America subgroup of the LEADER trial.
2 Related Literature
A growing literature highlights different strategies for combined RCT-RWD analyses. One set of approaches, known as
Bayesian dynamic borrowing, generates a prior distribution of the RCT control parameter based on external control
data, with different approaches to down-weighting the observational information (Pocock, 1976; Ibrahim and Chen,
2000; Hobbs et al., 2012; Schmidli et al., 2014). These methods generally require assumptions on the distributions of
the involved parameters, which may significantly impact the effect estimates (Galwey, 2017; Dejardin et al., 2018).
While these methods can decrease bias compared to pooling alone, multiple studies have noted either increased type 1
error or decreased power when there is heterogeneity between the historical and RCT control groups (Dejardin et al.,
2018; Viele et al., 2014; Galwey, 2017; Cuffe, 2011; Harun et al., 2020).
2