Approximating Partial Likelihood Estimators via Optimal Subsampling Haixiang Zhang1 Lulu Zuo1 HaiYing Wang2and Liuquan Sun3

2025-04-27 0 0 690.13KB 63 页 10玖币
侵权投诉
Approximating Partial Likelihood Estimators
via Optimal Subsampling
Haixiang Zhang1, Lulu Zuo1, HaiYing Wang2and Liuquan Sun3
1Center for Applied Mathematics, Tianjin University, Tianjin 300072, China
2Department of Statistics, University of Connecticut, Storrs, Mansfield, CT 06269, USA
3Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
Abstract
With the growing availability of large-scale biomedical data, it is often time-consuming
or infeasible to directly perform traditional statistical analysis with relatively limited
computing resources at hand. We propose a fast subsampling method to effectively
approximate the full data maximum partial likelihood estimator in Cox’s model, which
largely reduces the computational burden when analyzing massive survival data. We
establish consistency and asymptotic normality of a general subsample-based estima-
tor. The optimal subsampling probabilities with explicit expressions are determined
via minimizing the trace of the asymptotic variance-covariance matrix for a linearly
transformed parameter estimator. We propose a two-step subsampling algorithm for
practical implementation, which has a significant reduction in computing time com-
pared to the full data method. The asymptotic properties of the resulting two-step
subsample-based estimator is also established. Extensive numerical experiments and
a real-world example are provided to assess our subsampling strategy. Supplemental
materials for this article are available online.
Keywords: Asymptotic normality; Empirical process; L-optimality criterion; Mas-
sive data; Survival analysis.
Corresponding author. Email: haixiang.zhang@tju.edu.cn (Haixiang Zhang)
1
arXiv:2210.04581v2 [stat.ME] 17 May 2023
1 Introduction
With the development of science and technology, the amounts of available data are rapidly
increasing in recent years. A major bottleneck to analyze huge datasets is that the data
volume exceeds the capacity of available computational resources. It is not always possible
to meet the demands for computational speed and storage memory if we directly perform
traditional analysis for large datasets with a single computer at hand. To cope with big
data, there are many statistical methods in the literature dealing with the heavy calculation
and storage burden. Basically, we could classify these methods into three categories. (i)
divide-and-conquer approach (Zhao et al., 2016; Battey et al., 2018; Shi et al., 2018; Jordan
et al., 2019; Volgushev et al., 2019; Chen et al., 2022; Fan et al., 2021). (ii) online updating
approach (Schifano et al., 2016; Luo and Song, 2020; Lin et al., 2020; Luo et al., 2022;
Wang et al., 2022b). (iii) subsampling-based approach. The subsampling is an emerging
field for big data. Many papers have been published during recent years. For example,
Wang et al. (2018) and Wang (2019) studied the optimal subsampling for massive logistic
regression. Wang et al. (2019) presented an information-based subdata selection approach
for linear regression with big datasets. Zhang et al. (2020) studied an effective sketching
method for massive datasets via A-optimal subsampling. Yao and Wang (2019), Han et al.
(2020) and Yao et al. (2021) proposed several subsampling methods for large-scale multiclass
logistic regression. Yu et al. (2022) considered optimal Poisson subsampling for maximum
quasi-likelihood estimator with massive data. Zhang et al. (2021) proposed a response-free
optimal sampling procedure for generalized linear models under measurement constraints.
Wang and Ma (2021) studied the optimal subsampling for quantile regression in big data.
Liu et al. (2021) proposed an optimal subsampling method for the functional linear model via
L-optimality criterion. Zhang and Wang (2021) and Zuo et al. (2021b) considered optimal
distributed subsampling methods for big data in the context of linear and logistic models,
respectively. Ai et al. (2021) studied the optimal subsampling method for generalized linear
models under the A-optimality criterion. Wang and Zhang (2022) proposed an optimal
subsampling procedure for multiplicative regression with massive data. For more related
results on massive data analysis, we refer to several review papers by Wang et al. (2016),
2
Lee and Ng (2020), Yao and Wang (2021), Chen et al. (2021b), Li and Meng (2021) and Yu
et al. (2023).
The aforementioned investigations focused on developing statistical methods for large
datasets with uncensored observations. In recent years, huge biomedical datasets become
increasingly common, and they are often subject to censoring (Kleinbaum and Klein, 2005).
There have been several recent papers on statistical analysis of massive censored survival
data. For example, Xue et al. (2019) and Wu et al. (2021) studied the online updating
approach for streams of survival data. Keret and Gorfine (2020) presented an optimal Cox
regression subsampling procedure with rare events. Tarkhan and Simon (2020) and Xu
et al. (2020) used the stochastic gradient descent algorithms to analyse large-scale survival
datasets with Cox’s model and the accelerated failure time models, respectively. Li et al.
(2020) proposed a batch screening iterative Lasso method for large-scale and ultrahigh-
dimensional Cox model. Zuo et al. (2021a) proposed a sampling-based method for massive
survival data with additive hazards model. Wang et al. (2021) studied an efficient divide-and-
conquer algorithm to fit high-dimensional Cox regression for massive datasets. Yang et al.
(2022) studied the optimal subsampling algorithms for parametric accelerate failure time
models with massive survival data. In spite of the aforementioned papers, existing research
on massive survival data is relatively limited, and it is meaningful to further investigate the
statistical theories in the area of large-scale survival analysis.
It is worthy mentioning that subsampling is an emerging area of research, which has
attracted great attentions in both statistics and computer science (Ma et al., 2015; Bai
et al., 2021). Subsampling methods focus on selecting a small proportion of the full data
as a surrogate to perform statistical computations. A key to success of subsampling is
to design nonuniform sampling probabilities so that those influential or informative data
points are sampled with high probabilities. Although significant progress has been made
towards developing optimal subsampling theory for uncensored observations, to the best
of our knowledge, the research on optimal subsampling for large-scale survival data lags
behind. In consideration of the important role of Cox’s model in the field of survival analysis
(Cox, 1972; Fleming and Harrington, 1991), it is desirable to develop effective subsampling
methods in the context of Cox’s model for massive survival data. This paper aims to close this
3
gap by developing a subsample-based estimator to fast and effectively approximate the full
data maximum partial likelihood estimator. Our aim is to design an efficient subsampling
and estimation strategy to better balance the trade-off between computational efficiency
and statistical efficiency. Here are some key differences between our proposed subsampling
approach and some recently developed approaches on Cox’s model with large-scale data: (i)
Keret and Gorfine (2020) proposed a subsampling-based estimation for Cox’s model with
rare events by including all observed failures, while our optimal subsampling method is
developed for Cox’s model under the regular setting that observed failure times are not rare
compared with the observed censoring times. (ii) Tarkhan and Simon (2020) presented a
stochastic gradient descent (SGD) procedure for Cox’s model. This method primarily intends
to resolve the problems that the whole dataset cannot be easily loaded in memory; the main
aim is to deal with the out-of-memory issue rather than speeding up the calculation. (iii)
Li et al. (2020) and Wang et al. (2021) studied the variable selection problem for ultrahigh-
dimensional Cox’s model, which is different from the focus of our paper on dealing with very
large sample sizes.
The main contributions of our proposed subsampling method include three aspects: First,
the computation of our subsample-based estimator is much faster than that of the full data
estimator calculated by the standard R function coxph. Therefore, it effectively reduces the
computational burden when analysing massive survival data with Cox’s model. Second, we
provide an explicit expression for the optimal subsampling distribution, which has much bet-
ter performance than the uniform subsampling distribution in terms of statistical efficiency.
Third, we establish consistency and asymptotic normality of the proposed subsample estima-
tor, which is useful for performing statistical inference (e.g. constructing confidence intervals
and testing hypotheses).
The remainder of this paper is organized as follows. In Section 2, we review the setup and
notations for Cox’s model. A general subsample-based estimator is proposed to approximate
the full data maximum partial likelihood estimator. In Section 3, we establish consistency
and asymptotic normality of a general subsample-based estimator. The optimal subsampling
probabilities are explicitly specified in the context of L-optimality criterion. In Section 4,
we give a two-step subsampling algorithm together with the asymptotic properties of the
4
resulting estimator. In Section 5, extensive simulations together with an application are
conducted to verify the validity of the proposed subsampling procedure. Some concluding
remarks are presented in Section 6. Technical proofs are given in the supplement.
2 Model and Subsample-Based Estimation
In many biomedical applications, the outcome of interest is measured as a “time-to-event”,
such as death and onset of cancer. The time to occurrence of an event is referred to as
a failure time (Kalbfleisch and Prentice, 2002), and its typical characteristic is subject to
possible right censoring. For i= 1,· · · , n, let Tibe the failure time, Cibe the censoring
time, and Xibe the p-dimensional vector of time-independent covariates (e.g., treatment
indicator, blood pressure, age, and gender). We assume that Tiand Ciare conditionally
independent given Xi. The observed failure time is Yi= min(Ti, Ci), and the failure indicator
is ∆i=I(TiCi), where I(·) is the indicator function. For convenience, we denote the
full data of independent and identically distributed observations from the population as
Dn={(Xi,i, Yi), i = 1,· · · , n}. The Cox’s proportional hazard regression model (Cox,
1972) is commonly used to describe the relationship between covariates of an individual and
the risk of experiencing an event. This model assumes that the conditional hazard rate
function of Tigiven Xiis
λ(t|Xi) = λ0(t) exp(β0Xi),(1)
where λ0(t) is an unknown baseline hazard function, β= (β1,· · · , βp)0is a p-dimensional
vector of regression parameters, and its true value belongs to a compact set Θ Rp. To
estimate β, Cox (1975) proposed a novel partial likelihood method. The negative log-partial
likelihood function is
`(β) = 1
n
n
X
i=1 Zτ
0"β0Xilog (n
X
j=1
I(Yjt) exp(β0Xj))#dNi(t),(2)
where Ni(t) = I(∆i= 1, Yit) is a counting process and τis a prespecified positive
constant. One advantage of Cox’s partial likelihood method is that the criterion function
given in (2) does not involve the nonparametric baseline hazard function λ0(t), and the
5
摘要:

ApproximatingPartialLikelihoodEstimatorsviaOptimalSubsamplingHaixiangZhang1,LuluZuo1,HaiYingWang2andLiuquanSun31CenterforAppliedMathematics,TianjinUniversity,Tianjin300072,China2DepartmentofStatistics,UniversityofConnecticut,Storrs,Mans eld,CT06269,USA3AcademyofMathematicsandSystemsScience,ChineseA...

展开>> 收起<<
Approximating Partial Likelihood Estimators via Optimal Subsampling Haixiang Zhang1 Lulu Zuo1 HaiYing Wang2and Liuquan Sun3.pdf

共63页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:63 页 大小:690.13KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 63
客服
关注