Approximating Partial Likelihood Estimators via Optimal Subsampling Haixiang Zhang1 Lulu Zuo1 HaiYing Wang2and Liuquan Sun3

2025-04-27 0 0 690.13KB 63 页 10玖币

侵权投诉

Approximating Partial Likelihood Estimators

via Optimal Subsampling

Haixiang Zhang1∗, Lulu Zuo1, HaiYing Wang2and Liuquan Sun3

1Center for Applied Mathematics, Tianjin University, Tianjin 300072, China

2Department of Statistics, University of Connecticut, Storrs, Mansﬁeld, CT 06269, USA

3Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China

Abstract

With the growing availability of large-scale biomedical data, it is often time-consuming

or infeasible to directly perform traditional statistical analysis with relatively limited

computing resources at hand. We propose a fast subsampling method to eﬀectively

approximate the full data maximum partial likelihood estimator in Cox’s model, which

largely reduces the computational burden when analyzing massive survival data. We

establish consistency and asymptotic normality of a general subsample-based estima-

tor. The optimal subsampling probabilities with explicit expressions are determined

via minimizing the trace of the asymptotic variance-covariance matrix for a linearly

transformed parameter estimator. We propose a two-step subsampling algorithm for

practical implementation, which has a signiﬁcant reduction in computing time com-

pared to the full data method. The asymptotic properties of the resulting two-step

subsample-based estimator is also established. Extensive numerical experiments and

a real-world example are provided to assess our subsampling strategy. Supplemental

materials for this article are available online.

Keywords: Asymptotic normality; Empirical process; L-optimality criterion; Mas-

sive data; Survival analysis.

∗Corresponding author. Email: haixiang.zhang@tju.edu.cn (Haixiang Zhang)

arXiv:2210.04581v2 [stat.ME] 17 May 2023

1 Introduction

With the development of science and technology, the amounts of available data are rapidly

increasing in recent years. A major bottleneck to analyze huge datasets is that the data

volume exceeds the capacity of available computational resources. It is not always possible

to meet the demands for computational speed and storage memory if we directly perform

traditional analysis for large datasets with a single computer at hand. To cope with big

data, there are many statistical methods in the literature dealing with the heavy calculation

and storage burden. Basically, we could classify these methods into three categories. (i)

divide-and-conquer approach (Zhao et al., 2016; Battey et al., 2018; Shi et al., 2018; Jordan

et al., 2019; Volgushev et al., 2019; Chen et al., 2022; Fan et al., 2021). (ii) online updating

approach (Schifano et al., 2016; Luo and Song, 2020; Lin et al., 2020; Luo et al., 2022;

Wang et al., 2022b). (iii) subsampling-based approach. The subsampling is an emerging

ﬁeld for big data. Many papers have been published during recent years. For example,

Wang et al. (2018) and Wang (2019) studied the optimal subsampling for massive logistic

regression. Wang et al. (2019) presented an information-based subdata selection approach

for linear regression with big datasets. Zhang et al. (2020) studied an eﬀective sketching

method for massive datasets via A-optimal subsampling. Yao and Wang (2019), Han et al.

(2020) and Yao et al. (2021) proposed several subsampling methods for large-scale multiclass

logistic regression. Yu et al. (2022) considered optimal Poisson subsampling for maximum

quasi-likelihood estimator with massive data. Zhang et al. (2021) proposed a response-free

optimal sampling procedure for generalized linear models under measurement constraints.

Wang and Ma (2021) studied the optimal subsampling for quantile regression in big data.

Liu et al. (2021) proposed an optimal subsampling method for the functional linear model via

L-optimality criterion. Zhang and Wang (2021) and Zuo et al. (2021b) considered optimal

distributed subsampling methods for big data in the context of linear and logistic models,

respectively. Ai et al. (2021) studied the optimal subsampling method for generalized linear

models under the A-optimality criterion. Wang and Zhang (2022) proposed an optimal

subsampling procedure for multiplicative regression with massive data. For more related

results on massive data analysis, we refer to several review papers by Wang et al. (2016),

Lee and Ng (2020), Yao and Wang (2021), Chen et al. (2021b), Li and Meng (2021) and Yu

et al. (2023).

The aforementioned investigations focused on developing statistical methods for large

datasets with uncensored observations. In recent years, huge biomedical datasets become

increasingly common, and they are often subject to censoring (Kleinbaum and Klein, 2005).

There have been several recent papers on statistical analysis of massive censored survival

data. For example, Xue et al. (2019) and Wu et al. (2021) studied the online updating

approach for streams of survival data. Keret and Gorﬁne (2020) presented an optimal Cox

regression subsampling procedure with rare events. Tarkhan and Simon (2020) and Xu

et al. (2020) used the stochastic gradient descent algorithms to analyse large-scale survival

datasets with Cox’s model and the accelerated failure time models, respectively. Li et al.

(2020) proposed a batch screening iterative Lasso method for large-scale and ultrahigh-

dimensional Cox model. Zuo et al. (2021a) proposed a sampling-based method for massive

survival data with additive hazards model. Wang et al. (2021) studied an eﬃcient divide-and-

conquer algorithm to ﬁt high-dimensional Cox regression for massive datasets. Yang et al.

(2022) studied the optimal subsampling algorithms for parametric accelerate failure time

models with massive survival data. In spite of the aforementioned papers, existing research

on massive survival data is relatively limited, and it is meaningful to further investigate the

statistical theories in the area of large-scale survival analysis.

It is worthy mentioning that subsampling is an emerging area of research, which has

attracted great attentions in both statistics and computer science (Ma et al., 2015; Bai

et al., 2021). Subsampling methods focus on selecting a small proportion of the full data

as a surrogate to perform statistical computations. A key to success of subsampling is

to design nonuniform sampling probabilities so that those inﬂuential or informative data

points are sampled with high probabilities. Although signiﬁcant progress has been made

towards developing optimal subsampling theory for uncensored observations, to the best

of our knowledge, the research on optimal subsampling for large-scale survival data lags

behind. In consideration of the important role of Cox’s model in the ﬁeld of survival analysis

(Cox, 1972; Fleming and Harrington, 1991), it is desirable to develop eﬀective subsampling

methods in the context of Cox’s model for massive survival data. This paper aims to close this

gap by developing a subsample-based estimator to fast and eﬀectively approximate the full

data maximum partial likelihood estimator. Our aim is to design an eﬃcient subsampling

and estimation strategy to better balance the trade-oﬀ between computational eﬃciency

and statistical eﬃciency. Here are some key diﬀerences between our proposed subsampling

approach and some recently developed approaches on Cox’s model with large-scale data: (i)

Keret and Gorﬁne (2020) proposed a subsampling-based estimation for Cox’s model with

rare events by including all observed failures, while our optimal subsampling method is

developed for Cox’s model under the regular setting that observed failure times are not rare

compared with the observed censoring times. (ii) Tarkhan and Simon (2020) presented a

stochastic gradient descent (SGD) procedure for Cox’s model. This method primarily intends

to resolve the problems that the whole dataset cannot be easily loaded in memory; the main

aim is to deal with the out-of-memory issue rather than speeding up the calculation. (iii)

Li et al. (2020) and Wang et al. (2021) studied the variable selection problem for ultrahigh-

dimensional Cox’s model, which is diﬀerent from the focus of our paper on dealing with very

large sample sizes.

The main contributions of our proposed subsampling method include three aspects: First,

the computation of our subsample-based estimator is much faster than that of the full data

estimator calculated by the standard R function coxph. Therefore, it eﬀectively reduces the

computational burden when analysing massive survival data with Cox’s model. Second, we

provide an explicit expression for the optimal subsampling distribution, which has much bet-

ter performance than the uniform subsampling distribution in terms of statistical eﬃciency.

Third, we establish consistency and asymptotic normality of the proposed subsample estima-

tor, which is useful for performing statistical inference (e.g. constructing conﬁdence intervals

and testing hypotheses).

The remainder of this paper is organized as follows. In Section 2, we review the setup and

notations for Cox’s model. A general subsample-based estimator is proposed to approximate

the full data maximum partial likelihood estimator. In Section 3, we establish consistency

and asymptotic normality of a general subsample-based estimator. The optimal subsampling

probabilities are explicitly speciﬁed in the context of L-optimality criterion. In Section 4,

we give a two-step subsampling algorithm together with the asymptotic properties of the

resulting estimator. In Section 5, extensive simulations together with an application are

conducted to verify the validity of the proposed subsampling procedure. Some concluding

remarks are presented in Section 6. Technical proofs are given in the supplement.

2 Model and Subsample-Based Estimation

In many biomedical applications, the outcome of interest is measured as a “time-to-event”,

such as death and onset of cancer. The time to occurrence of an event is referred to as

a failure time (Kalbﬂeisch and Prentice, 2002), and its typical characteristic is subject to

possible right censoring. For i= 1,· · · , n, let Tibe the failure time, Cibe the censoring

time, and Xibe the p-dimensional vector of time-independent covariates (e.g., treatment

indicator, blood pressure, age, and gender). We assume that Tiand Ciare conditionally

independent given Xi. The observed failure time is Yi= min(Ti, Ci), and the failure indicator

is ∆i=I(Ti≤Ci), where I(·) is the indicator function. For convenience, we denote the

full data of independent and identically distributed observations from the population as

Dn={(Xi,∆i, Yi), i = 1,· · · , n}. The Cox’s proportional hazard regression model (Cox,

1972) is commonly used to describe the relationship between covariates of an individual and

the risk of experiencing an event. This model assumes that the conditional hazard rate

function of Tigiven Xiis

λ(t|Xi) = λ0(t) exp(β0Xi),(1)

where λ0(t) is an unknown baseline hazard function, β= (β1,· · · , βp)0is a p-dimensional

vector of regression parameters, and its true value belongs to a compact set Θ ⊂Rp. To

estimate β, Cox (1975) proposed a novel partial likelihood method. The negative log-partial

likelihood function is

`(β) = −1

i=1 Zτ

0"β0Xi−log (n

j=1

I(Yj≥t) exp(β0Xj))#dNi(t),(2)

where Ni(t) = I(∆i= 1, Yi≤t) is a counting process and τis a prespeciﬁed positive

constant. One advantage of Cox’s partial likelihood method is that the criterion function

given in (2) does not involve the nonparametric baseline hazard function λ0(t), and the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ApproximatingPartialLikelihoodEstimatorsviaOptimalSubsamplingHaixiangZhang1,LuluZuo1,HaiYingWang2andLiuquanSun31CenterforAppliedMathematics,TianjinUniversity,Tianjin300072,China2DepartmentofStatistics,UniversityofConnecticut,Storrs,Manseld,CT06269,USA3AcademyofMathematicsandSystemsScience,ChineseA...

展开>> 收起<<

Approximating Partial Likelihood Estimators via Optimal Subsampling Haixiang Zhang1 Lulu Zuo1 HaiYing Wang2and Liuquan Sun3.pdf

共63页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Approximating Partial Likelihood Estimators via Optimal Subsampling Haixiang Zhang1 Lulu Zuo1 HaiYing Wang2and Liuquan Sun3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: