Ecient surrogate-assisted inference for patient-reported outcome measures with complex missing mechanism Jaeyoung ParkMuxuan LiangYing-Qi ZhaoXiang Zhong

2025-05-03 1 0 413.9KB 21 页 10玖币

侵权投诉

Eﬃcient surrogate-assisted inference for patient-reported

outcome measures with complex missing mechanism

Jaeyoung Park ∗Muxuan Liang†Ying-Qi Zhao‡Xiang Zhong§

Abstract

Patient-reported outcome (PRO) measures are increasingly collected as a means of measuring

healthcare quality and value. The capability to predict such measures enables patient-provider shared

decision making and the delivery of patient-centered care. However, PRO measures often suﬀer

from high missing rates, and the missingness may depend on many patient factors. Under such a

complex missing mechanism, developing a predictive model for PRO measures with valid inference

procedures is challenging, especially when ﬂexible imputation models such as machine learning or

nonparametric methods are used. Speciﬁcally, the slow convergence rate of the ﬂexible imputation

model may lead to non-negligible bias, and the traditional missing propensity, capable of removing such

a bias, is hard to estimate due to the complex missing mechanism. To eﬃciently infer the parameters

of interest, we propose to use an informative surrogate that enables a ﬂexible imputation model lying

in a low-dimensional subspace. To remove the bias due to the ﬂexible imputation model, we identify

a class of weighting functions as alternatives to the traditional propensity score and estimate the

low-dimensional one within the identiﬁed function class. Based on the estimated low-dimensional

weighting function, we construct a one-step debiased estimator without using any information of the

true missing propensity. We establish the asymptotic normality of the one-step debiased estimator.

Simulation and an application to real-world data demonstrate the superiority of the proposed method.

Keywords: Missing Data; Dimension Reduction; Semiparametric Inference; Semi-supervised

Learning; Double Machine Learning.

∗Booth School of Business, University of Chicago

†Department of Biostatistics, University of Florida

‡Public Health Sciences Divisions, Fred Hutchinson Cancer Center

§Department of Industrial and Systems Engineering, University of Florida

arXiv:2210.09362v2 [stat.ME] 27 Feb 2023

1 Introduction

Patient-reported outcome (PRO) measures are increasingly collected before and after an intervention

or a treatment as a means of measuring healthcare quality and value, which is an important step

toward patient-centered care. Knowing the measure goes up or down alone might not be suﬃcient to

determine the eﬀectiveness of the intervention. More importantly, whether the measure has changed

with a suﬃciently large margin, known as the minimally clinically important diﬀerence (MCID), needs

to be evaluated. If the intervention is an elective surgery, identifying patients at risk of not achieving

an MCID, particularly before the surgery, is important for pre-surgical decisions. There is a growing

interest in applying machine learning techniques to predict whether a patient is likely to achieve an

MCID before their surgery and identify predictive factors associated with post-surgical PRO measures.

The increasing adoption of electronic health record (EHR) systems has provided unprecedented

opportunities to learn an interpretable model for predicting PRO measures using massive observational

data. Although the volume of observational data is large, the quality of such observational data may

be uncertain. One of the major diﬃculties is missing data, especially missing the outcome data.

In our motivating example, the MCIDs can only be observed from the participants who take both

pre- and post-surgical surveys. The participants who completed both surveys may only account

for a small portion (e.g., 1/3) of the participants whose EHR data is available, according to the

response rate reported in literature (Ho et al., 2019; Pronk et al., 2019) and from our own data.

Unfortunately, low survey response rates are not uncommon in healthcare and other service industries.

In this work, our objective is to develop an interpretable predictive model for the outcome subject to

missing. Speciﬁcally, we aim at developing a linear prediction model by minimizing the deviance of

a generalized linear model (GLM) with a valid inference procedure for the coeﬃcients under possible

model misspeciﬁcation.

Many approaches have been developed to deal with missing outcomes under the assumption of

missing at random (MAR) (Kang and Schafer, 2007). One seminal work is the propensity inverse

weighting approach (Rosenbaum and Rubin, 1983; Horvitz and Thompson, 1952). For this approach,

one ﬁrst estimates the probability of missing w.r.t the covariate (also called the propensity) and then

uses the inverse of the estimated propensity to adjust for the selection bias. When the propensity is

poorly estimated, the propensity inverse weighting methods may not perform well. Another major

type of approach is known as imputation. This approach ﬁrst learns an imputation model using

the fully observed part of the data; then, imputes the missing outcomes with the predicted values;

and ﬁnally, reﬁts the predictive model based on the imputed outcomes (Rubin, 2004). When the

estimated imputation model is misspeciﬁed, the reﬁtted predictive model may also be biased. To

maintain robustness against the possible misspeciﬁcation in the propensity and the imputation models,

one possible solution is to use the doubly robust methods (Robins et al., 1994). The doubly robust

methods that incorporate both the propensity score and the imputation models can lead to a consistent

estimate for the outcome as long as either model is correctly speciﬁed (Tan, 2006, 2010; Qin et al.,

2008; Qin and Zhang, 2007; Rubin and van der Laan, 2008; Cao et al., 2009; Han, 2012; Rotnitzky

et al., 2012; Han et al., 2016).

Statistical inference for the parameters in predictive modeling with outcome missingness is also

challenging. In particular, when the missing mechanism is dependent on multiple covariates through a

nonlinear relationship, an unbiased estimator for the missing propensity with a fast convergence rate

may be infeasible. For the inverse weighting approaches and the doubly robust methods, a parametric

model for the propensity may not capture the potential non-linearity. To ensure an unbiased propensity

estimate, nonparametric regressions and machine learning methods have been adopted. These methods

may lead to a slower convergence rate and hinder the inference of the parameters in the predictive

model, especially when the number of the covariates is large. When the number of the covariates is

small, to address the slow convergence rate, the double machine learning approach was proposed in

Chernozhukov et al. (2018). They adopted a cross-ﬁtting algorithm using a doubly robust formulation

and proposed to estimate both the propensity and the imputation model using nonparametric or

machine learning methods. They proved that, as long as the product of the convergence rates of

the propensity and imputation estimates is smaller than n´1{2, a valid inference procedure for the

parameters in the predictive model is possible, where nrepresents the sample size. However, the large

number of the covariates and not meeting the smoothness condition on the true propensity may negate

the required rate condition.

To help address the above statistical inference challenge due to the presence of a large number of

covariates, one possible strategy is to leverage a surrogate outcome. The surrogate outcomes herein are

deﬁned as alternative clinical outcomes that are likely to predict the clinical beneﬁt of primary interest.

In our motivating example, the MCID of the global physical health T-score in the Patient-Reported

Outcomes Measurement Information System (PROMIS) survey is a well acknowledged measurement

for evaluating surgery beneﬁt. There are other PRO measures collected that represent diﬀerent but

related mental or physical health performances that can be considered as surrogate outcomes. In

many applications, a surrogate outcome can help improve the eﬃciency or overcome the diﬃculties

due to complex missing mechanisms. In the application of causal inference (Prentice, 1989; Frangakis

and Rubin, 2002; Fleming et al., 1994; Cheng et al., 2018; Anderer et al., 2022), a surrogate can be

used to improve the eﬃciency of estimating the average treatment eﬀect (ATE). In the application

of semi-supervised inference, under the assumption of missing completely at random (MCAR), Hou

et al. (2021) showed that a surrogate can help infer the predicted risk derived from a high-dimensional

working model even when the true risk prediction model depends on multiple covariates. However,

their approach cannot be applied under the assumption of missing at random (MAR), which is the

setting we need to deal with.

In this work, we focus on how to use surrogate outcomes to develop interpretable predictive models

with outcome missingness. The parameter of interest herein is deﬁned as the minimizer of the deviance

under a GLM with possible model misspeciﬁcation. We propose a concept of an informative surrogate,

deﬁned as a surrogate outcome that enables a low-dimensional imputation model conditional on the

surrogate and the covariates (i.e., the imputation model lies in a low-dimensional subspace generated

by the surrogate and covariates). Under the MAR assumption, we exploit the role of this informative

surrogate to 1) allow for a low-dimensional imputation model under a large number of covariates;

2) avoid estimating the complex missing propensity. To harvest the potential beneﬁt brought by

informative surrogate outcomes, we propose the following procedure. First, we estimate a ﬂexible

imputation model (e.g., using kernel regression or basis expansion) in a reduced subspace that is

constructed by leveraging the information from informative surrogate outcomes. Subsequently, we

can impute the missing outcomes and obtain an initial estimator for the parameters of interest. Then,

we bypass the estimation of the complex missing propensity and instead estimate a low-dimensional

weighting function based on the reduced subspace to adjust for the possible bias due to the estimated

imputation model. Finally, a one-step debiased estimator for the parameters in the predictive model

can be constructed. Both the point and interval estimates of the parameters can be obtained from

the proposed procedure. We show that the proposed method can provide a valid inference procedure

for the parameters of interest without requiring a consistent propensity estimation. In addition,

when the true propensity lies in the same subspace as the imputation model, the proposed method

leads to a semiparametric eﬃcient estimator for the parameters in the predictive model. Extensive

simulation and an analysis of real-world data are provided to demonstrate the superior performance

of the proposed method.

The remainder of the paper is organized as follows. In Section 2, we deﬁne the parameter of

interest and introduce our proposed method. In Section 3, we demonstrate the theoretical validity

of the proposed method. In Section 4, we provide numerical studies to bolster the superiority of the

proposed methods over other existing methods and methods without information of the surrogate. In

Section 5, we apply the proposed method to derive a predictive rule to infer post-surgery improvement

for joint replacement surgery patients. In Section 6, we discuss possible future works.

2 Method

Let Xbe a p-dimensional covariate and Ybe a binary, categorical, or continuous outcome of interest.

Without loss of generality, we choose a GLM as a working model for ErY|Xs. Following the

notation of exponential family distributions (Shao, 2003), a GLM assumes that ErY|Xs “ b1pXJβq,

where b1p¨q, the derivative of function bp¨q, is a known link function. The parameter of interest, βis

often deﬁned as the minimizer of the deviance (or equivalently, the negative log-likelihood) under the

working model, i.e., β˚“arg min Er`pβqs,where `pβq “ bpXJβq ´ YXJβ.If the working model is

misspeciﬁed, i.e., ErY|Xs “ b1pXJβ˚q,β˚that minimizes the deviance, a goodness-of-ﬁt statistic,

is still meaningful. For a linear working model, the link function b1ptqis the identity function, and

the function bptq “ t2{2; the objective is equivalent to the least square. Notice that the parameter of

interest β˚is deﬁned under the full distribution where Yand Xare always observed. To ensure that

β˚can be identiﬁed under the full distribution, we assume that b2p¨q is always positive and ErXXJs

is positive deﬁnite.

For actual data, the outcome Ycan be missing. We collect the covariate X, the outcome Y, the

informative surrogate outcome Z, and the missing indicator Rfrom all samples. The missing indicator

Rindicates whether Yis observed (R“1) or not (R“0). We also assume that the surrogate Zcan

be fully observed. Collectively, the observed data can be denoted as pX, Z, R, RY q. To ensure the

identiﬁability of β˚using the actual data, we assume that YKR|X, Z.

2.1 First step: dimension reduction through informative surrogate

In this section, we propose a two-step procedure under the assumption of YKR|X, Z. To start

with, we formally deﬁne the concept of informative surrogate outcomes and introduce the required

assumption for the identiﬁability of β˚.

An surrogate outcome Zis informative if there exists a pp`1q ˆ dmatrix, Γ, with orthogonal

columns satisfying YKĂ

X|ΓJĂ

Xand dăp, where Ă

XJ“ pZ, XJq. This deﬁnition implies that,

conditioning on the surrogate outcome, the dimension of the space constructed by the covariates and

the surrogate can be reduced to d, which is expected to be much smaller than p. The columns of Γ

represent the reduced subspace. Thus, if the surrogate is informative, QpZ, Xq:“ErY|Z, Xsis a

function lying in a low-dimensional subspace, i.e., there exists an unknown link function gsuch that

QpZ, Xq “ gpΓJĂ

Xq. Consequently, an eﬃcient estimator to this low-dimensional imputation model

may have a faster convergence rate than directing using the kernel regression to estimate ErY|Xs

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Ecientsurrogate-assistedinferenceforpatient-reportedoutcomemeasureswithcomplexmissingmechanismJaeyoungPark*MuxuanLiangYing-QiZhaoXiangZhong§AbstractPatient-reportedoutcome(PRO)measuresareincreasinglycollectedasameansofmeasuringhealthcarequalityandvalue.Thecapabilitytopredictsuchmeasuresenablespat...

展开>> 收起<<

Ecient surrogate-assisted inference for patient-reported outcome measures with complex missing mechanism Jaeyoung ParkMuxuan LiangYing-Qi ZhaoXiang Zhong.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Ecient surrogate-assisted inference for patient-reported outcome measures with complex missing mechanism Jaeyoung ParkMuxuan LiangYing-Qi ZhaoXiang Zhong

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: