Scalable Estimation and Inference for Censored Quantile Regression Process

2025-05-03 0 0 753KB 26 页 10玖币
侵权投诉
Submitted to the Annals of Statistics
SCALABLE ESTIMATION AND INFERENCE FOR
CENSORED QUANTILE REGRESSION PROCESS
BYXUMING HE1, XIAOOU PAN2,KEAN MING TAN1,*
AND WEN-XIN ZHOU2,
1Department of Statistics, University of Michigan, xmhe@umich.edu;*keanming@umich.edu
2Department of Mathematics, University of California, San Diego, xip024@ucsd.edu;wez243@ucsd.edu
Censored quantile regression (CQR) has become a valuable tool to study
the heterogeneous association between a possibly censored outcome and a
set of covariates, yet computation and statistical inference for CQR have re-
mained a challenge for large-scale data with many covariates. In this paper,
we focus on a smoothed martingale-based sequential estimating equations
approach, to which scalable gradient-based algorithms can be applied. Theo-
retically, we provide a unified analysis of the smoothed sequential estimator
and its penalized counterpart in increasing dimensions. When the covariate
dimension grows with the sample size at a sublinear rate, we establish the
uniform convergence rate (over a range of quantile indexes) and provide a
rigorous justification for the validity of a multiplier bootstrap procedure for
inference. In high-dimensional sparse settings, our results considerably im-
prove the existing work on CQR by relaxing an exponential term of sparsity.
We also demonstrate the advantage of the smoothed CQR over existing meth-
ods with both simulated experiments and data applications.
1. Introduction. Censored data are prevalent in many applications where the response
variable of interest is partially observed, mostly due to loss of follow-up. For instance, in
a lung cancer study considered by Shedden et al. [53], 46.6% of the lung cancer patients’
survival time are censored, due to either early withdrawal from the study or death because of
other reasons that are unrelated to lung cancer. Commonly used methods to study the asso-
ciation between the censored response and explanatory variables (covariates) are through the
use of Cox proportional hazards model and the accelerated failure time (AFT) model [1,40].
Both models assume homogeneous covariate effects and are not applicable to cases in which
the lower and upper quantiles of the conditional distribution of the censored response, po-
tentially with different covariate effects, are of interest. Moreover, in many scientific studies,
higher or lower quantiles of the response variable are more of interest than the mean. To cap-
ture heterogeneous covariate effects and to better predict the response at different quantile
levels, various censored quantile regression (CQR) methods have been developed under dif-
ferent assumptions on the censoring mechanism [51,52,67,6,9,25,48,61,41,66,10,11].
We refer the reader to Chapters 6 and 7 in [34] as well as [47] for a comprehensive review of
censored quantile regression.
We consider the random right censoring mechanism, in which the censoring points are
unknown for the uncensored observations. Statsitcal methods for CQR were first proposed
under the stringent assumption that the uncensored response variable (not observable due to
censoring) is marginally independent of the censoring variable; see, for example [67,25].
Under a more relaxed conditional independence assumption, conditioned on the covariates,
[48] generalized the Kaplan-Meier estimator for estimating the (univariate) survival function
MSC2020 subject classifications:Primary 62J05, 62J07; secondary 62F40.
Keywords and phrases: Censored quantile regression, Smoothing, High dimensional survival data, Non-
asymptotic theory, Weighted bootstrap.
1
arXiv:2210.12629v1 [math.ST] 23 Oct 2022
2
to the regression setting, based on Efron [14]’s redistribution-of-mass construction. From a
different perspective, [45] employed a martingale-based approach for fitting CQR, and the
resulting method has been shown to be closely related to [48]’s method [43,46]. Both [48]’s
and [45]’s methods, along with their variants, involve solving a series of quantile regression
problems that can be reformulated as linear programs, solvable by the simplex or interior
point method [3,49,38]. Statistical properties of the aforementioned methods have been well
studied, assuming that the number of covariates, p, is fixed [43,45,50,46]. To this date, the
impact of dimensionality in the increasing-pregime, in which pis allowed to increase with
the number of observations, has remained unclear in the presence of censored outcomes.
In the high-dimensional setting in which p>n, convex and nonconvex penalty functions
are often employed to perform variable selection and to achieve a trade-off between statistical
bias and model complexity. While penalized Cox proportional hazards and AFT models have
been well studied [16,29,7,5], existing work on penalized CQR under the framework of
[48] and [45] in the high-dimensional setting is still lagging. Large-sample properties of
penalized CQR estimators were first derived under the fixed-psetting (p<n), mainly due
to the technical challenges introduced by the sequential nature of the procedure [55,62,65].
More recently, [70] studied a penalized CQR estimator, extending the method of [45] to the
high-dimensional setting (p > n). They showed that the estimation error (under `2-norm) of
the `1-penalized CQR estimator is upper bounded by Oexp(Cs)pslog(p)/nwith high
probability, where C > 0is a dimension-free constant. Compared to the `1-penalized QR
for uncensored data [4], whose convergence rate is of order Opslog(p)/n, there is a
substantial gap in terms of the impact of the sparsity parameter s.
In addition to the above theoretical issues, our study is motivated by the computational
hardness of CQR under the framework of [48] and [45] for problems with large dimension.
Recall that this framework involves fitting a series of quantile regressions sequentially over
a dense grid of quantile indexes, each of which is solvable by the Frisch-Newton algorithm
with computational complexity that grows as a cubic function of p[49]. Moreover, under the
regime in which p < n, the asymptotic covariance matrix of the estimator is rather compli-
cated and thus resampling methods are often used to perform statistical inference [48,45].
A sample-based inference procedure (without resampling) for Peng-Huang’s estimator [45]
is available by adapting the plug-in covariance estimation method from [56]. In the high-
dimensional setting (p > n), computation of the `1-penalized QR is based on either refor-
mulation as linear programs [39] or alternating direction method of multiplier algorithms
[68,22]. These algorithms are generic and applicable to a broad spectrum of problems but
lack scalability. Since the `1-penalized CQR not only requires the estimation of the whole
quantile regression process, but also relies on cross-validation to select the sequence of
(mostly different) penalty levels, the state-of-the-art methods [70,17] can be highly inef-
ficient when applied to large-pproblems.
To illustrate the computational challenge for CQR, we compare the `1-penalized CQR pro-
posed by Zheng et al. [70] and our proposed method by analyzing a gene expression dataset
studied in [53]. In this study, 22,283 genes from 442 lung adenocarcinomas are incorporated
to predict the survival time in lung cancer, with 46.6% subjects that are censored. We im-
plement both methods with quantile grid set as {0.1,0.11,...,0.7}, and use a predetermined
sequence of regularization parameters. For Zheng et al. [70], we use the rqPen package to
compute the `1-penalized QR estimator at each quantile level [54]. The computational time
and maximum allocated memory are reported in Table 1. The reference machine for this
experiment is a worker node with 2.5 GHz 32-core processor and 512 GB of memory in a
high-performance computing cluster.
In this paper, we develop a smoothed framework for CQR that is scalable to problems
with large dimension pin both low- and high-dimensional settings. Our proposed method
CENSORED QUANTILE REGRESSION IN HIGH DIMENSIONS 3
Methods Runtime Allocated memory
`1-penalized CQR 170 hours+ 38 GB
Proposed method 2 minutes 926 MB
TABLE 1
Computational runtime and maximum allocated memory for fitting `1-penalized CQR and the proposed method
on the gene expression data with censored response in [53]. One gigabyte (GB) equals 1024 megabytes (MB).
is motivated by the smoothed estimating equation approach that has surfaced mostly in the
econometrics literature [63,64,31,12,18,23], which can be applied to the stochastic inte-
gral based sequential estimation procedure proposed by Peng and Huang [45] for CQR. We
show in Section 2.2 that the smoothed sequential estimating equations method can be refor-
mulated as solving a sequence of optimization problems with (at least) twice-differentiable
and convex loss functions for which gradient-based algorithms are available. Large-scale sta-
tistical inference can then be performed efficiently via multiplier/weighted bootstrap. In the
high-dimensional setting, we propose and analyze `1-penalized smoothed CQR estimators
obtained by sequentially minimizing smoothed convex loss functions plus `1-penalty, which
we solve using a scalable and efficient majorize-minimization-type algorithm, as evidenced
in Table 1.
Theoretically, we provide a unified analysis for the proposed smoothed estimator in both
low- and high-dimensional settings. In the low-dimensional case where the dimension is al-
lowed to increase with the sample size, we establish the uniform rate of convergence and
a uniform Bahadur-type representation for the smoothed CQR estimator. We also provide a
rigorous justification for the validity of a weighted/multiplier bootstrap procedure with ex-
plicit error bounds as functions of (n, p). To our knowledge, these are the first results for
censored quantile regression in the increasing-pregime with p < n. The main challenges are
as follows. To fit the QR process with censored response variables, the stochastic integral
based approach entails a sequence of estimating equations that correspond to a prespecified
grid of quantile indexes. A sequence of pointwise estimators can then be sequentially ob-
tained by solving these equations. The sequential nature of this procedure poses technical
challenges because at each quantile level, the objective function (or the estimating equation)
depends on all of the previous estimates. To establish convergence rates for the estimated
regression process, a delicate analysis beyond what is used in [23] is required to deal with
the accumulated estimation error sequentially. The mesh width of the grid should converge
to zero at a proper rate in order to balance the accumulated estimation error and discretiza-
tion error. In the high-dimensional setting, we show that with suitably chosen penalty levels
and bandwidth, the `1-penalized smoothed CQR estimator has a uniform convergence rate of
O(pslog(p)/n), provided the sample size satisfies n&s3log(p). The technical arguments
used in this case are also very different from those in [70] and subsequent work [17], and as
a result, our conclusion improves that of Zheng et al. [70] by relaxing the exponential term
exp(Cs)in the convergence rate to a linear term in s. Such an improvement is significant
when the effective model size sis allowed to grow with nand pin the context of censored
quantile regression.
The rest of the article is organized as follows. In Section 2, we provide a formal formula-
tion of the CQR. We then briefly review the martingale-based estimating equation estimator
proposed by Peng and Huang [45] in Section 2.1. The proposed smoothed CQR is detailed
in Section 2.2, along with the multiplier bootstrap method for large-scale inference in Sec-
tion 2.3. We then provide a comprehensive theoretical analysis for the smoothed CQR esti-
mator in Section 3and its bootstrap counterpart. In Section 4, we generalize the smoothed
CQR to the high-dimensional setting by incorporating a penalty function to the smoothed
4
CQR loss and study the theoretical properties of the regularized estimator. Extensive numer-
ical studies and data applications are in Sections 5and 6. The Rcode that implements the
proposed method is available at https://github.com/XiaoouPan/scqr.
NOTATION. For any two real numbers aand b, we write ab= min{a, b}and ab=
max{a, b}. Given a pair of vectors u,vRp, we use uTvand hu,viinterchangeably to
denote their inner product. For a positive semi-definite matrix ΣRp×p, we define the Σ-
induced `2-norm ||u||Σ=||Σ1/2u||2for any uRp. For every r0, we use Bp(r) = {β
Rp:||β||2r}and Sp1(r) = {βRp:||β||2=r}to denote the Euclidean ball and sphere,
respectively, with radius r. In particular, we write Sp1=Sp1(1). Given an event/subset A,
{A} or Arepresents the indicator function of this event/subset. For two non-negative
arrays {an}n1and {bn}n1, we write an.bnif anCbnfor some constant C > 0inde-
pendent of n,an&bnif bn.an, and anbnif an.bnand an&bn.
2. Censored Quantile Regression. Let zRbe a response variable of interest, and
x= (x1,...,xp)Tbe a p-vector (p2) of random covariates with x11. In this work,
we focus on a global conditional quantile model on zdescribed as follows. Given a closed
interval [τL, τU](0,1), assume that the τ-th conditional quantile of zgiven xtakes the
form
F1
z|x(τ) = xTβ(τ)for any τ[τL, τU],(1)
where β(τ)Rp, formulated as a function of τ, is the unknown vector of regression coef-
ficients.
We assume that zis subject to right censoring by C, a random variable that is conditionally
independent of zgiven the covariates x. Let y=zCthe censored outcome, and ∆ = (z
C)be an event indicator. The observed samples {yi,i,xi}n
i=1 consist of independent and
identically distributed (i.i.d.) replicates of the triplet (y, ,x). In addition, we assume at the
outset that the lowest quantile of interest τLsatisfies P{yxTβ(τL),∆=0}= 0. This
condition, interpreted as no censoring below the τL-th quantile, is commonly imposed in
the context of CQR; see, e.g., Condition C in [48] and Assumption 3.1 in [70]. Moreover,
our quantiles of interest are confined up to τU<1subject to some identifiability concerns,
which is a subtle issue for CQR problems. Briefly speaking, the model (1) may become non-
identifiable as τmoves towards 1, due to large amount of censored information in the upper
tail. In practice, determining τUis usually a compromise between inference range of interest
and data censoring rate, and τLcan be chosen to be close to 0 if censoring occurs at early
stages. Theoretically, the above assumption on τLhelps us simplify the technical analysis.
The above model is broadly defined, yet it is inspired by approaching survival data with
quantile regression [37]. To briefly illustrate, let Tbe a non-negative random variable repre-
senting the failure time to an event. The conditional quantile model (1) on z= log(T)can be
viewed as a generalization of the standard AFT model in the sense that coefficients not only
shift the location but also affect the shape and dispersion of the conditional distributions.
2.1. Martingale-based estimating equation estimator. Under the global linear model (1),
two well-known methods are the recursively re-weighted estimator of [48] and the stochastic
integral based estimating equation estimator of [45]. Both methods are grid-based algorithms
that iteratively solve a sequence of (weighted) check function minimization problems over a
predetermined grid of τ-values. Motivated by the recent success of smoothing methods for
uncensored quantile regressions [18,23,57], we propose a smoothed estimating equation
approach for CQR in the next subsection. We start with a brief introduction of [45]’s method
that is built upon the martingale structure of randomly censored data.
CENSORED QUANTILE REGRESSION IN HIGH DIMENSIONS 5
To this end, denote by Λz|x(t) = log{1P(zt|x)}the cumulative conditional haz-
ard function of zgiven x, and define the counting processes Ni(t) = {yit, i= 1}
and N0i(t) = {yit, i= 0}for i= 1,...,n, where i= (ziCi). Define Fi(s) =
σ{Ni(u), N0i(u) : us}as the σ-algebra generated by the foregoing processes. Note that
{Fi(s) : sR}is an increasing family of sub-σ-algebras, also known as filtration, and Ni(t)
is an adapted sub-martingale. By the unique Doob-Meyer decomposition, one can construct
an Fi(t)-martingale Mi(t) = Ni(t)Λzi|xi(yit)satisfying E{Mi(t)|xi}= 0; see Sec-
tion 1.3 of [19] for details. Taking t=xT
iβ(τ)for each i, the martingale property implies
E"n
X
i=1 NixT
iβ(τ)Λzi|xiyixT
iβ(τ)xi#=0.
This lays the foundation for the stochastic integral based estimating equation approach. The
monotonicity of the function τ7→ xTβ(τ), implied by the global linearity in (1), leads to
Λzi|xiyixT
iβ(τ)=H(τ)HP(ziyi|xi)=Zτ
0{yixT
iβ(u)}dH(u)
for τ[τL, τU], where H(u) := log(1 u)for 0< u < 1. This motivates Peng and
Huang’s estimator [45], which solves the following estimating equation
1
n
n
X
i=1 "NixT
iβ(τ)Zτ
0{yixT
iβ(u)}dH(u)#xi=0,for every τLττU.
However, the exact solution to the above equation is not directly obtainable. By adapt-
ing Euler’s forward method for ordinary differential equation, [45] proposed a grid-based
sequential estimating procedure as follows. Let τL=τ0< τ1<···< τm=τUbe a grid
of quantile indices. Noting that P{yxTβ(τ0),∆ = 0}= 0, we have ERτ0
0{yi
xT
iβ(u)}dH(u) = τ0, and hence β(τ0)can be estimated by solving the usual quantile
equation (1/n)Pn
i=1{Ni(xT
iβ)τ0}xi=0. Denote e
β(τ0)as the solution to the above
equation. At grid points τk,k= 1,...,m, the estimators e
β(τk)are sequentially obtained
by solving
1
n
n
X
i=1 "Ni(xT
iβ)
k1
X
j=0 Zτj+1
τj{yixT
ie
β(τj)}dH(u)τ0#xi=0.(2)
The resulting estimated function e
β(·) : [τL, τU]7→ Rpis right-continuous and piecewise-
constant that jumps only at each grid point. Computationally, solving the above equation
is equivalent to minimizing an `1-type convex objective function after introducing a suffi-
ciently large pseudo point to the data. The minimizer, however, is not always uniquely de-
fined. To avoid this lack of uniqueness as well as grid dependence, [28] introduced a more
general (population) integral equation, and then proposed a Progressive Localized Minimiza-
tion (PLMIN) algorithm to solve its empirical version exactly. This algorithm automatically
determines the breakpoints of the solution and thus is grid-free. Under a continuity condition
on the density functions (see, e.g. condition (C2) in [28]), the estimating functions used in
[45] and [28] are asymptotically equivalent.
2.2. A smoothed estimating equation approach. Due to the discontinuity stemming from
the indicator function in the counting process Ni(·), exact solutions to the estimating equa-
tions (2) may not exist. In fact, e
β(τj)for j= 0,...,m are defined as the general solutions
to generalized estimating equations [20], which correspond to subgradients of some convex
摘要:

SubmittedtotheAnnalsofStatisticsSCALABLEESTIMATIONANDINFERENCEFORCENSOREDQUANTILEREGRESSIONPROCESSBYXUMINGHE1,XIAOOUPAN2,†KEANMINGTAN1,*ANDWEN-XINZHOU2,‡1DepartmentofStatistics,UniversityofMichigan,xmhe@umich.edu;*keanming@umich.edu2DepartmentofMathematics,UniversityofCalifornia,SanDiego,†xip024@ucs...

展开>> 收起<<
Scalable Estimation and Inference for Censored Quantile Regression Process.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:26 页 大小:753KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注