Ecient surrogate-assisted inference for patient-reported outcome measures with complex missing mechanism Jaeyoung ParkMuxuan LiangYing-Qi ZhaoXiang Zhong

2025-05-03 0 0 413.9KB 21 页 10玖币
侵权投诉
Efficient surrogate-assisted inference for patient-reported
outcome measures with complex missing mechanism
Jaeyoung Park Muxuan LiangYing-Qi ZhaoXiang Zhong§
Abstract
Patient-reported outcome (PRO) measures are increasingly collected as a means of measuring
healthcare quality and value. The capability to predict such measures enables patient-provider shared
decision making and the delivery of patient-centered care. However, PRO measures often suffer
from high missing rates, and the missingness may depend on many patient factors. Under such a
complex missing mechanism, developing a predictive model for PRO measures with valid inference
procedures is challenging, especially when flexible imputation models such as machine learning or
nonparametric methods are used. Specifically, the slow convergence rate of the flexible imputation
model may lead to non-negligible bias, and the traditional missing propensity, capable of removing such
a bias, is hard to estimate due to the complex missing mechanism. To efficiently infer the parameters
of interest, we propose to use an informative surrogate that enables a flexible imputation model lying
in a low-dimensional subspace. To remove the bias due to the flexible imputation model, we identify
a class of weighting functions as alternatives to the traditional propensity score and estimate the
low-dimensional one within the identified function class. Based on the estimated low-dimensional
weighting function, we construct a one-step debiased estimator without using any information of the
true missing propensity. We establish the asymptotic normality of the one-step debiased estimator.
Simulation and an application to real-world data demonstrate the superiority of the proposed method.
Keywords: Missing Data; Dimension Reduction; Semiparametric Inference; Semi-supervised
Learning; Double Machine Learning.
Booth School of Business, University of Chicago
Department of Biostatistics, University of Florida
Public Health Sciences Divisions, Fred Hutchinson Cancer Center
§Department of Industrial and Systems Engineering, University of Florida
arXiv:2210.09362v2 [stat.ME] 27 Feb 2023
1 Introduction
Patient-reported outcome (PRO) measures are increasingly collected before and after an intervention
or a treatment as a means of measuring healthcare quality and value, which is an important step
toward patient-centered care. Knowing the measure goes up or down alone might not be sufficient to
determine the effectiveness of the intervention. More importantly, whether the measure has changed
with a sufficiently large margin, known as the minimally clinically important difference (MCID), needs
to be evaluated. If the intervention is an elective surgery, identifying patients at risk of not achieving
an MCID, particularly before the surgery, is important for pre-surgical decisions. There is a growing
interest in applying machine learning techniques to predict whether a patient is likely to achieve an
MCID before their surgery and identify predictive factors associated with post-surgical PRO measures.
The increasing adoption of electronic health record (EHR) systems has provided unprecedented
opportunities to learn an interpretable model for predicting PRO measures using massive observational
data. Although the volume of observational data is large, the quality of such observational data may
be uncertain. One of the major difficulties is missing data, especially missing the outcome data.
In our motivating example, the MCIDs can only be observed from the participants who take both
pre- and post-surgical surveys. The participants who completed both surveys may only account
for a small portion (e.g., 1/3) of the participants whose EHR data is available, according to the
response rate reported in literature (Ho et al., 2019; Pronk et al., 2019) and from our own data.
Unfortunately, low survey response rates are not uncommon in healthcare and other service industries.
In this work, our objective is to develop an interpretable predictive model for the outcome subject to
missing. Specifically, we aim at developing a linear prediction model by minimizing the deviance of
a generalized linear model (GLM) with a valid inference procedure for the coefficients under possible
model misspecification.
Many approaches have been developed to deal with missing outcomes under the assumption of
missing at random (MAR) (Kang and Schafer, 2007). One seminal work is the propensity inverse
weighting approach (Rosenbaum and Rubin, 1983; Horvitz and Thompson, 1952). For this approach,
one first estimates the probability of missing w.r.t the covariate (also called the propensity) and then
uses the inverse of the estimated propensity to adjust for the selection bias. When the propensity is
poorly estimated, the propensity inverse weighting methods may not perform well. Another major
type of approach is known as imputation. This approach first learns an imputation model using
the fully observed part of the data; then, imputes the missing outcomes with the predicted values;
and finally, refits the predictive model based on the imputed outcomes (Rubin, 2004). When the
2
estimated imputation model is misspecified, the refitted predictive model may also be biased. To
maintain robustness against the possible misspecification in the propensity and the imputation models,
one possible solution is to use the doubly robust methods (Robins et al., 1994). The doubly robust
methods that incorporate both the propensity score and the imputation models can lead to a consistent
estimate for the outcome as long as either model is correctly specified (Tan, 2006, 2010; Qin et al.,
2008; Qin and Zhang, 2007; Rubin and van der Laan, 2008; Cao et al., 2009; Han, 2012; Rotnitzky
et al., 2012; Han et al., 2016).
Statistical inference for the parameters in predictive modeling with outcome missingness is also
challenging. In particular, when the missing mechanism is dependent on multiple covariates through a
nonlinear relationship, an unbiased estimator for the missing propensity with a fast convergence rate
may be infeasible. For the inverse weighting approaches and the doubly robust methods, a parametric
model for the propensity may not capture the potential non-linearity. To ensure an unbiased propensity
estimate, nonparametric regressions and machine learning methods have been adopted. These methods
may lead to a slower convergence rate and hinder the inference of the parameters in the predictive
model, especially when the number of the covariates is large. When the number of the covariates is
small, to address the slow convergence rate, the double machine learning approach was proposed in
Chernozhukov et al. (2018). They adopted a cross-fitting algorithm using a doubly robust formulation
and proposed to estimate both the propensity and the imputation model using nonparametric or
machine learning methods. They proved that, as long as the product of the convergence rates of
the propensity and imputation estimates is smaller than n´1{2, a valid inference procedure for the
parameters in the predictive model is possible, where nrepresents the sample size. However, the large
number of the covariates and not meeting the smoothness condition on the true propensity may negate
the required rate condition.
To help address the above statistical inference challenge due to the presence of a large number of
covariates, one possible strategy is to leverage a surrogate outcome. The surrogate outcomes herein are
defined as alternative clinical outcomes that are likely to predict the clinical benefit of primary interest.
In our motivating example, the MCID of the global physical health T-score in the Patient-Reported
Outcomes Measurement Information System (PROMIS) survey is a well acknowledged measurement
for evaluating surgery benefit. There are other PRO measures collected that represent different but
related mental or physical health performances that can be considered as surrogate outcomes. In
many applications, a surrogate outcome can help improve the efficiency or overcome the difficulties
due to complex missing mechanisms. In the application of causal inference (Prentice, 1989; Frangakis
and Rubin, 2002; Fleming et al., 1994; Cheng et al., 2018; Anderer et al., 2022), a surrogate can be
3
used to improve the efficiency of estimating the average treatment effect (ATE). In the application
of semi-supervised inference, under the assumption of missing completely at random (MCAR), Hou
et al. (2021) showed that a surrogate can help infer the predicted risk derived from a high-dimensional
working model even when the true risk prediction model depends on multiple covariates. However,
their approach cannot be applied under the assumption of missing at random (MAR), which is the
setting we need to deal with.
In this work, we focus on how to use surrogate outcomes to develop interpretable predictive models
with outcome missingness. The parameter of interest herein is defined as the minimizer of the deviance
under a GLM with possible model misspecification. We propose a concept of an informative surrogate,
defined as a surrogate outcome that enables a low-dimensional imputation model conditional on the
surrogate and the covariates (i.e., the imputation model lies in a low-dimensional subspace generated
by the surrogate and covariates). Under the MAR assumption, we exploit the role of this informative
surrogate to 1) allow for a low-dimensional imputation model under a large number of covariates;
2) avoid estimating the complex missing propensity. To harvest the potential benefit brought by
informative surrogate outcomes, we propose the following procedure. First, we estimate a flexible
imputation model (e.g., using kernel regression or basis expansion) in a reduced subspace that is
constructed by leveraging the information from informative surrogate outcomes. Subsequently, we
can impute the missing outcomes and obtain an initial estimator for the parameters of interest. Then,
we bypass the estimation of the complex missing propensity and instead estimate a low-dimensional
weighting function based on the reduced subspace to adjust for the possible bias due to the estimated
imputation model. Finally, a one-step debiased estimator for the parameters in the predictive model
can be constructed. Both the point and interval estimates of the parameters can be obtained from
the proposed procedure. We show that the proposed method can provide a valid inference procedure
for the parameters of interest without requiring a consistent propensity estimation. In addition,
when the true propensity lies in the same subspace as the imputation model, the proposed method
leads to a semiparametric efficient estimator for the parameters in the predictive model. Extensive
simulation and an analysis of real-world data are provided to demonstrate the superior performance
of the proposed method.
The remainder of the paper is organized as follows. In Section 2, we define the parameter of
interest and introduce our proposed method. In Section 3, we demonstrate the theoretical validity
of the proposed method. In Section 4, we provide numerical studies to bolster the superiority of the
proposed methods over other existing methods and methods without information of the surrogate. In
Section 5, we apply the proposed method to derive a predictive rule to infer post-surgery improvement
4
for joint replacement surgery patients. In Section 6, we discuss possible future works.
2 Method
Let Xbe a p-dimensional covariate and Ybe a binary, categorical, or continuous outcome of interest.
Without loss of generality, we choose a GLM as a working model for ErY|Xs. Following the
notation of exponential family distributions (Shao, 2003), a GLM assumes that ErY|Xs “ b1pXJβq,
where b1p¨q, the derivative of function bp¨q, is a known link function. The parameter of interest, βis
often defined as the minimizer of the deviance (or equivalently, the negative log-likelihood) under the
working model, i.e., β˚arg min Er`pβqs,where `pβq “ bpXJβq ´ YXJβ.If the working model is
misspecified, i.e., ErY|Xs “ b1pXJβ˚q,β˚that minimizes the deviance, a goodness-of-fit statistic,
is still meaningful. For a linear working model, the link function b1ptqis the identity function, and
the function bptq “ t2{2; the objective is equivalent to the least square. Notice that the parameter of
interest β˚is defined under the full distribution where Yand Xare always observed. To ensure that
β˚can be identified under the full distribution, we assume that b2p¨q is always positive and ErXXJs
is positive definite.
For actual data, the outcome Ycan be missing. We collect the covariate X, the outcome Y, the
informative surrogate outcome Z, and the missing indicator Rfrom all samples. The missing indicator
Rindicates whether Yis observed (R1) or not (R0). We also assume that the surrogate Zcan
be fully observed. Collectively, the observed data can be denoted as pX, Z, R, RY q. To ensure the
identifiability of β˚using the actual data, we assume that YKR|X, Z.
2.1 First step: dimension reduction through informative surrogate
In this section, we propose a two-step procedure under the assumption of YKR|X, Z. To start
with, we formally define the concept of informative surrogate outcomes and introduce the required
assumption for the identifiability of β˚.
An surrogate outcome Zis informative if there exists a pp`1q ˆ dmatrix, Γ, with orthogonal
columns satisfying YKĂ
X|ΓJĂ
Xand dăp, where Ă
XJ“ pZ, XJq. This definition implies that,
conditioning on the surrogate outcome, the dimension of the space constructed by the covariates and
the surrogate can be reduced to d, which is expected to be much smaller than p. The columns of Γ
represent the reduced subspace. Thus, if the surrogate is informative, QpZ, Xq:ErY|Z, Xsis a
function lying in a low-dimensional subspace, i.e., there exists an unknown link function gsuch that
QpZ, Xq “ gpΓJĂ
Xq. Consequently, an efficient estimator to this low-dimensional imputation model
may have a faster convergence rate than directing using the kernel regression to estimate ErY|Xs
5
摘要:

Ecientsurrogate-assistedinferenceforpatient-reportedoutcomemeasureswithcomplexmissingmechanismJaeyoungPark*MuxuanLiang„Ying-QiZhao…XiangZhong§AbstractPatient-reportedoutcome(PRO)measuresareincreasinglycollectedasameansofmeasuringhealthcarequalityandvalue.Thecapabilitytopredictsuchmeasuresenablespat...

展开>> 收起<<
Ecient surrogate-assisted inference for patient-reported outcome measures with complex missing mechanism Jaeyoung ParkMuxuan LiangYing-Qi ZhaoXiang Zhong.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:413.9KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注