Scalable Estimation and Inference for Censored Quantile Regression Process

2025-05-03 0 0 753KB 26 页 10玖币

侵权投诉

Submitted to the Annals of Statistics

SCALABLE ESTIMATION AND INFERENCE FOR

CENSORED QUANTILE REGRESSION PROCESS

BYXUMING HE1, XIAOOU PAN2,†KEAN MING TAN1,*

AND WEN-XIN ZHOU2,‡

1Department of Statistics, University of Michigan, xmhe@umich.edu;*keanming@umich.edu

2Department of Mathematics, University of California, San Diego, †xip024@ucsd.edu;‡wez243@ucsd.edu

Censored quantile regression (CQR) has become a valuable tool to study

the heterogeneous association between a possibly censored outcome and a

set of covariates, yet computation and statistical inference for CQR have re-

mained a challenge for large-scale data with many covariates. In this paper,

we focus on a smoothed martingale-based sequential estimating equations

approach, to which scalable gradient-based algorithms can be applied. Theo-

retically, we provide a uniﬁed analysis of the smoothed sequential estimator

and its penalized counterpart in increasing dimensions. When the covariate

dimension grows with the sample size at a sublinear rate, we establish the

uniform convergence rate (over a range of quantile indexes) and provide a

rigorous justiﬁcation for the validity of a multiplier bootstrap procedure for

inference. In high-dimensional sparse settings, our results considerably im-

prove the existing work on CQR by relaxing an exponential term of sparsity.

We also demonstrate the advantage of the smoothed CQR over existing meth-

ods with both simulated experiments and data applications.

1. Introduction. Censored data are prevalent in many applications where the response

variable of interest is partially observed, mostly due to loss of follow-up. For instance, in

a lung cancer study considered by Shedden et al. [53], 46.6% of the lung cancer patients’

survival time are censored, due to either early withdrawal from the study or death because of

other reasons that are unrelated to lung cancer. Commonly used methods to study the asso-

ciation between the censored response and explanatory variables (covariates) are through the

use of Cox proportional hazards model and the accelerated failure time (AFT) model [1,40].

Both models assume homogeneous covariate effects and are not applicable to cases in which

the lower and upper quantiles of the conditional distribution of the censored response, po-

tentially with different covariate effects, are of interest. Moreover, in many scientiﬁc studies,

higher or lower quantiles of the response variable are more of interest than the mean. To cap-

ture heterogeneous covariate effects and to better predict the response at different quantile

levels, various censored quantile regression (CQR) methods have been developed under dif-

ferent assumptions on the censoring mechanism [51,52,67,6,9,25,48,61,41,66,10,11].

We refer the reader to Chapters 6 and 7 in [34] as well as [47] for a comprehensive review of

censored quantile regression.

We consider the random right censoring mechanism, in which the censoring points are

unknown for the uncensored observations. Statsitcal methods for CQR were ﬁrst proposed

under the stringent assumption that the uncensored response variable (not observable due to

censoring) is marginally independent of the censoring variable; see, for example [67,25].

Under a more relaxed conditional independence assumption, conditioned on the covariates,

[48] generalized the Kaplan-Meier estimator for estimating the (univariate) survival function

MSC2020 subject classiﬁcations:Primary 62J05, 62J07; secondary 62F40.

Keywords and phrases: Censored quantile regression, Smoothing, High dimensional survival data, Non-

asymptotic theory, Weighted bootstrap.

arXiv:2210.12629v1 [math.ST] 23 Oct 2022

to the regression setting, based on Efron [14]’s redistribution-of-mass construction. From a

different perspective, [45] employed a martingale-based approach for ﬁtting CQR, and the

resulting method has been shown to be closely related to [48]’s method [43,46]. Both [48]’s

and [45]’s methods, along with their variants, involve solving a series of quantile regression

problems that can be reformulated as linear programs, solvable by the simplex or interior

point method [3,49,38]. Statistical properties of the aforementioned methods have been well

studied, assuming that the number of covariates, p, is ﬁxed [43,45,50,46]. To this date, the

impact of dimensionality in the increasing-pregime, in which pis allowed to increase with

the number of observations, has remained unclear in the presence of censored outcomes.

In the high-dimensional setting in which p>n, convex and nonconvex penalty functions

are often employed to perform variable selection and to achieve a trade-off between statistical

bias and model complexity. While penalized Cox proportional hazards and AFT models have

been well studied [16,29,7,5], existing work on penalized CQR under the framework of

[48] and [45] in the high-dimensional setting is still lagging. Large-sample properties of

penalized CQR estimators were ﬁrst derived under the ﬁxed-psetting (p<n), mainly due

to the technical challenges introduced by the sequential nature of the procedure [55,62,65].

More recently, [70] studied a penalized CQR estimator, extending the method of [45] to the

high-dimensional setting (p > n). They showed that the estimation error (under `2-norm) of

the `1-penalized CQR estimator is upper bounded by Oexp(Cs)pslog(p)/nwith high

probability, where C > 0is a dimension-free constant. Compared to the `1-penalized QR

for uncensored data [4], whose convergence rate is of order Opslog(p)/n, there is a

substantial gap in terms of the impact of the sparsity parameter s.

In addition to the above theoretical issues, our study is motivated by the computational

hardness of CQR under the framework of [48] and [45] for problems with large dimension.

Recall that this framework involves ﬁtting a series of quantile regressions sequentially over

a dense grid of quantile indexes, each of which is solvable by the Frisch-Newton algorithm

with computational complexity that grows as a cubic function of p[49]. Moreover, under the

regime in which p < n, the asymptotic covariance matrix of the estimator is rather compli-

cated and thus resampling methods are often used to perform statistical inference [48,45].

A sample-based inference procedure (without resampling) for Peng-Huang’s estimator [45]

is available by adapting the plug-in covariance estimation method from [56]. In the high-

dimensional setting (p > n), computation of the `1-penalized QR is based on either refor-

mulation as linear programs [39] or alternating direction method of multiplier algorithms

[68,22]. These algorithms are generic and applicable to a broad spectrum of problems but

lack scalability. Since the `1-penalized CQR not only requires the estimation of the whole

quantile regression process, but also relies on cross-validation to select the sequence of

(mostly different) penalty levels, the state-of-the-art methods [70,17] can be highly inef-

ﬁcient when applied to large-pproblems.

To illustrate the computational challenge for CQR, we compare the `1-penalized CQR pro-

posed by Zheng et al. [70] and our proposed method by analyzing a gene expression dataset

studied in [53]. In this study, 22,283 genes from 442 lung adenocarcinomas are incorporated

to predict the survival time in lung cancer, with 46.6% subjects that are censored. We im-

plement both methods with quantile grid set as {0.1,0.11,...,0.7}, and use a predetermined

sequence of regularization parameters. For Zheng et al. [70], we use the rqPen package to

compute the `1-penalized QR estimator at each quantile level [54]. The computational time

and maximum allocated memory are reported in Table 1. The reference machine for this

experiment is a worker node with 2.5 GHz 32-core processor and 512 GB of memory in a

high-performance computing cluster.

In this paper, we develop a smoothed framework for CQR that is scalable to problems

with large dimension pin both low- and high-dimensional settings. Our proposed method

CENSORED QUANTILE REGRESSION IN HIGH DIMENSIONS 3

Methods Runtime Allocated memory

`1-penalized CQR 170 hours+ 38 GB

Proposed method 2 minutes 926 MB

TABLE 1

Computational runtime and maximum allocated memory for ﬁtting `1-penalized CQR and the proposed method

on the gene expression data with censored response in [53]. One gigabyte (GB) equals 1024 megabytes (MB).

is motivated by the smoothed estimating equation approach that has surfaced mostly in the

econometrics literature [63,64,31,12,18,23], which can be applied to the stochastic inte-

gral based sequential estimation procedure proposed by Peng and Huang [45] for CQR. We

show in Section 2.2 that the smoothed sequential estimating equations method can be refor-

mulated as solving a sequence of optimization problems with (at least) twice-differentiable

and convex loss functions for which gradient-based algorithms are available. Large-scale sta-

tistical inference can then be performed efﬁciently via multiplier/weighted bootstrap. In the

high-dimensional setting, we propose and analyze `1-penalized smoothed CQR estimators

obtained by sequentially minimizing smoothed convex loss functions plus `1-penalty, which

we solve using a scalable and efﬁcient majorize-minimization-type algorithm, as evidenced

in Table 1.

Theoretically, we provide a uniﬁed analysis for the proposed smoothed estimator in both

low- and high-dimensional settings. In the low-dimensional case where the dimension is al-

lowed to increase with the sample size, we establish the uniform rate of convergence and

a uniform Bahadur-type representation for the smoothed CQR estimator. We also provide a

rigorous justiﬁcation for the validity of a weighted/multiplier bootstrap procedure with ex-

plicit error bounds as functions of (n, p). To our knowledge, these are the ﬁrst results for

censored quantile regression in the increasing-pregime with p < n. The main challenges are

as follows. To ﬁt the QR process with censored response variables, the stochastic integral

based approach entails a sequence of estimating equations that correspond to a prespeciﬁed

grid of quantile indexes. A sequence of pointwise estimators can then be sequentially ob-

tained by solving these equations. The sequential nature of this procedure poses technical

challenges because at each quantile level, the objective function (or the estimating equation)

depends on all of the previous estimates. To establish convergence rates for the estimated

regression process, a delicate analysis beyond what is used in [23] is required to deal with

the accumulated estimation error sequentially. The mesh width of the grid should converge

to zero at a proper rate in order to balance the accumulated estimation error and discretiza-

tion error. In the high-dimensional setting, we show that with suitably chosen penalty levels

and bandwidth, the `1-penalized smoothed CQR estimator has a uniform convergence rate of

O(pslog(p)/n), provided the sample size satisﬁes n&s3log(p). The technical arguments

used in this case are also very different from those in [70] and subsequent work [17], and as

a result, our conclusion improves that of Zheng et al. [70] by relaxing the exponential term

exp(Cs)in the convergence rate to a linear term in s. Such an improvement is signiﬁcant

when the effective model size sis allowed to grow with nand pin the context of censored

quantile regression.

The rest of the article is organized as follows. In Section 2, we provide a formal formula-

tion of the CQR. We then brieﬂy review the martingale-based estimating equation estimator

proposed by Peng and Huang [45] in Section 2.1. The proposed smoothed CQR is detailed

in Section 2.2, along with the multiplier bootstrap method for large-scale inference in Sec-

tion 2.3. We then provide a comprehensive theoretical analysis for the smoothed CQR esti-

mator in Section 3and its bootstrap counterpart. In Section 4, we generalize the smoothed

CQR to the high-dimensional setting by incorporating a penalty function to the smoothed

CQR loss and study the theoretical properties of the regularized estimator. Extensive numer-

ical studies and data applications are in Sections 5and 6. The Rcode that implements the

proposed method is available at https://github.com/XiaoouPan/scqr.

NOTATION. For any two real numbers aand b, we write a∧b= min{a, b}and a∨b=

max{a, b}. Given a pair of vectors u,v∈Rp, we use uTvand hu,viinterchangeably to

denote their inner product. For a positive semi-deﬁnite matrix Σ∈Rp×p, we deﬁne the Σ-

induced `2-norm ||u||Σ=||Σ1/2u||2for any u∈Rp. For every r≥0, we use Bp(r) = {β∈

Rp:||β||2≤r}and Sp−1(r) = {β∈Rp:||β||2=r}to denote the Euclidean ball and sphere,

respectively, with radius r. In particular, we write Sp−1=Sp−1(1). Given an event/subset A,

{A} or Arepresents the indicator function of this event/subset. For two non-negative

arrays {an}n≥1and {bn}n≥1, we write an.bnif an≤Cbnfor some constant C > 0inde-

pendent of n,an&bnif bn.an, and anbnif an.bnand an&bn.

2. Censored Quantile Regression. Let z∈Rbe a response variable of interest, and

x= (x1,...,xp)Tbe a p-vector (p≥2) of random covariates with x1≡1. In this work,

we focus on a global conditional quantile model on zdescribed as follows. Given a closed

interval [τL, τU]⊆(0,1), assume that the τ-th conditional quantile of zgiven xtakes the

form

F−1

z|x(τ) = xTβ∗(τ)for any τ∈[τL, τU],(1)

where β∗(τ)∈Rp, formulated as a function of τ, is the unknown vector of regression coef-

ﬁcients.

We assume that zis subject to right censoring by C, a random variable that is conditionally

independent of zgiven the covariates x. Let y=z∧Cthe censored outcome, and ∆ = (z≤

C)be an event indicator. The observed samples {yi,∆i,xi}n

i=1 consist of independent and

identically distributed (i.i.d.) replicates of the triplet (y, ∆,x). In addition, we assume at the

outset that the lowest quantile of interest τLsatisﬁes P{y≤xTβ∗(τL),∆=0}= 0. This

condition, interpreted as no censoring below the τL-th quantile, is commonly imposed in

the context of CQR; see, e.g., Condition C in [48] and Assumption 3.1 in [70]. Moreover,

our quantiles of interest are conﬁned up to τU<1subject to some identiﬁability concerns,

which is a subtle issue for CQR problems. Brieﬂy speaking, the model (1) may become non-

identiﬁable as τmoves towards 1, due to large amount of censored information in the upper

tail. In practice, determining τUis usually a compromise between inference range of interest

and data censoring rate, and τLcan be chosen to be close to 0 if censoring occurs at early

stages. Theoretically, the above assumption on τLhelps us simplify the technical analysis.

The above model is broadly deﬁned, yet it is inspired by approaching survival data with

quantile regression [37]. To brieﬂy illustrate, let Tbe a non-negative random variable repre-

senting the failure time to an event. The conditional quantile model (1) on z= log(T)can be

viewed as a generalization of the standard AFT model in the sense that coefﬁcients not only

shift the location but also affect the shape and dispersion of the conditional distributions.

2.1. Martingale-based estimating equation estimator. Under the global linear model (1),

two well-known methods are the recursively re-weighted estimator of [48] and the stochastic

integral based estimating equation estimator of [45]. Both methods are grid-based algorithms

that iteratively solve a sequence of (weighted) check function minimization problems over a

predetermined grid of τ-values. Motivated by the recent success of smoothing methods for

uncensored quantile regressions [18,23,57], we propose a smoothed estimating equation

approach for CQR in the next subsection. We start with a brief introduction of [45]’s method

that is built upon the martingale structure of randomly censored data.

CENSORED QUANTILE REGRESSION IN HIGH DIMENSIONS 5

To this end, denote by Λz|x(t) = −log{1−P(z≤t|x)}the cumulative conditional haz-

ard function of zgiven x, and deﬁne the counting processes Ni(t) = {yi≤t, ∆i= 1}

and N0i(t) = {yi≤t, ∆i= 0}for i= 1,...,n, where ∆i= (zi≤Ci). Deﬁne Fi(s) =

σ{Ni(u), N0i(u) : u≤s}as the σ-algebra generated by the foregoing processes. Note that

{Fi(s) : s∈R}is an increasing family of sub-σ-algebras, also known as ﬁltration, and Ni(t)

is an adapted sub-martingale. By the unique Doob-Meyer decomposition, one can construct

an Fi(t)-martingale Mi(t) = Ni(t)−Λzi|xi(yi∧t)satisfying E{Mi(t)|xi}= 0; see Sec-

tion 1.3 of [19] for details. Taking t=xT

iβ∗(τ)for each i, the martingale property implies

E"n

i=1 NixT

iβ∗(τ)−Λzi|xiyi∧xT

iβ∗(τ)xi#=0.

This lays the foundation for the stochastic integral based estimating equation approach. The

monotonicity of the function τ7→ xTβ∗(τ), implied by the global linearity in (1), leads to

Λzi|xiyi∧xT

iβ∗(τ)=H(τ)∧HP(zi≤yi|xi)=Zτ

0{yi≥xT

iβ∗(u)}dH(u)

for τ∈[τL, τU], where H(u) := −log(1 −u)for 0< u < 1. This motivates Peng and

Huang’s estimator [45], which solves the following estimating equation

i=1 "NixT

iβ(τ)−Zτ

0{yi≥xT

iβ(u)}dH(u)#xi=0,for every τL≤τ≤τU.

However, the exact solution to the above equation is not directly obtainable. By adapt-

ing Euler’s forward method for ordinary differential equation, [45] proposed a grid-based

sequential estimating procedure as follows. Let τL=τ0< τ1<···< τm=τUbe a grid

of quantile indices. Noting that P{y≤xTβ∗(τ0),∆ = 0}= 0, we have ERτ0

0{yi≥

iβ∗(u)}dH(u) = τ0, and hence β∗(τ0)can be estimated by solving the usual quantile

equation (1/n)Pn

i=1{Ni(xT

iβ)−τ0}xi=0. Denote e

β(τ0)as the solution to the above

equation. At grid points τk,k= 1,...,m, the estimators e

β(τk)are sequentially obtained

by solving

i=1 "Ni(xT

iβ)−

k−1

j=0 Zτj+1

τj{yi≥xT

β(τj)}dH(u)−τ0#xi=0.(2)

The resulting estimated function e

β(·) : [τL, τU]7→ Rpis right-continuous and piecewise-

constant that jumps only at each grid point. Computationally, solving the above equation

is equivalent to minimizing an `1-type convex objective function after introducing a sufﬁ-

ciently large pseudo point to the data. The minimizer, however, is not always uniquely de-

ﬁned. To avoid this lack of uniqueness as well as grid dependence, [28] introduced a more

general (population) integral equation, and then proposed a Progressive Localized Minimiza-

tion (PLMIN) algorithm to solve its empirical version exactly. This algorithm automatically

determines the breakpoints of the solution and thus is grid-free. Under a continuity condition

on the density functions (see, e.g. condition (C2) in [28]), the estimating functions used in

[45] and [28] are asymptotically equivalent.

2.2. A smoothed estimating equation approach. Due to the discontinuity stemming from

the indicator function in the counting process Ni(·), exact solutions to the estimating equa-

tions (2) may not exist. In fact, e

β(τj)for j= 0,...,m are deﬁned as the general solutions

to generalized estimating equations [20], which correspond to subgradients of some convex

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SubmittedtotheAnnalsofStatisticsSCALABLEESTIMATIONANDINFERENCEFORCENSOREDQUANTILEREGRESSIONPROCESSBYXUMINGHE1,XIAOOUPAN2,KEANMINGTAN1,*ANDWEN-XINZHOU2,1DepartmentofStatistics,UniversityofMichigan,xmhe@umich.edu;*keanming@umich.edu2DepartmentofMathematics,UniversityofCalifornia,SanDiego,xip024@ucs...

展开>> 收起<<

Scalable Estimation and Inference for Censored Quantile Regression Process.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Scalable Estimation and Inference for Censored Quantile Regression Process

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: