Factor-Augmented Regularized Model for Hazard Regression Pierre Bayleand Jianqing Fan

2025-04-27 0 0 794.17KB 42 页 10玖币

侵权投诉

Factor-Augmented Regularized Model for

Hazard Regression

Pierre Bayle∗and Jianqing Fan∗

Department of Operations Research and Financial Engineering

Princeton University

September 30, 2022

Abstract

A prevalent feature of high-dimensional data is the dependence among covariates,

and model selection is known to be challenging when covariates are highly correlated.

To perform model selection for the high-dimensional Cox proportional hazards model in

presence of correlated covariates with factor structure, we propose a new model, Factor-

Augmented Regularized Model for Hazard Regression (FarmHazard), which builds upon

latent factors that drive covariate dependence and extends Cox’s model. This new

model generates procedures that operate in two steps by learning factors and idiosyn-

cratic components from high-dimensional covariate vectors and then using them as new

predictors. Cox’s model is a widely used semi-parametric model for survival analysis,

where censored data and time-dependent covariates bring additional technical chal-

lenges. We prove model selection consistency and estimation consistency under mild

conditions. We also develop a factor-augmented variable screening procedure to deal

with strong correlations in ultra-high dimensional problems. Extensive simulations and

real data experiments demonstrate that our procedures enjoy good performance and

achieve better results on model selection, out-of-sample C-index and screening than

alternative methods.

Keywords: High-dimensional, Cox’s proportional hazards model, Factor model, Model selec-

tion, Censored data, Variable screening.

∗The authors gratefully acknowledge the support of NIH grant 2R01-GM072611-14 and NSF grant DMS-

2053832. Emails: pbayle@princeton.edu,jqfan@princeton.edu.

arXiv:2210.01067v1 [stat.ME] 3 Oct 2022

1 Introduction

An enormous volume of data is accessible in many ﬁelds, including biomedicine and clinical

trials, and eﬃcient and valid statistical methods are necessary to study it. In survival

analysis, the outcome variable is time-to-event, such as biological death, relapse, failure of a

mechanical engine or credit default, and often some observations are censored. For example,

a study can come to an end while a fraction of subjects have not experienced the event

of interest, or a subject can leave the study before its end. In this context, a widely used

semi-parametric model is Cox’s proportional hazards model (Cox,1972,1975). Andersen

and Gill (1982) formulated it into a counting process framework. In the ﬁxed-dimension

setting, Tsiatis (1981) and Andersen and Gill (1982) proved the consistency and asymptotic

normality of the maximum partial likelihood estimator. Yet, modern datasets frequently

have more predictors than samples. Despite the very large number of predictors, most of

them are often irrelevant to explain the outcome, leading to sparse models. Reducing high-

dimensional data to the true set of relevant covariates is one of the most important tasks

in high-dimensional statistics and is a challenge in the analysis of big data. Models would

become more interpretable and prediction more accurate. To this end, several regularized

regression techniques have been extended to Cox’s proportional hazards model (Tibshirani,

1997;Fan and Li,2002). Bradic et al. (2011) established model selection consistency and

strong oracle properties for a large class of penalty functions in the ultra-high dimensional

setting, with LASSO and SCAD as special cases. Huang et al. (2013) and Kong and Nan

(2014) studied oracle inequalities for LASSO under diﬀerent conditions.

When variables are correlated, most model selection techniques fail to recover the set

of important predictors, in both high-dimensional and ultra-high dimensional settings. Fan

et al. (2020a) suggested FarmSelect, a two-step procedure that learns factors and idiosyn-

cratic components and use them as new predictors, to overcome the dependence problem

among covariates in the setting of `1-penalized generalized linear models. An even more

demanding task is to consider models that go beyond generalized linear models, such as

Cox’s proportional hazards model, where censored data and time-dependent covariates bring

additional technical challenges. To cope with correlation in high dimensions within the

challenging survival analysis setting, we propose Factor-Augmented Regularized Model for

Hazard Regression (FarmHazard). High-dimensional genomics and genetic data are natu-

rally strongly correlated, and our model is designed to address this kind of issues. It has

important applications, such as the prediction of the outcome of chemotherapy based on

gene-expression proﬁles coming from DNA microarrays (Rosenwald et al.,2002).

In ultra-high dimensional problems, characterized by a dimension that grows with the

sample size in a non-polynomial fashion, regularized regression faces multiple statistical and

computational challenges (Fan et al.,2009). To remedy this, screening methods (Fan and Lv,

2008;Fan and Song,2010;Wang and Leng,2016) have been developed; they enjoy statistical

guarantees and are computationally eﬃcient. Fan et al. (2010) extended the key idea of sure

independence screening to Cox’s model, and Zhao and Li (2012) provided theoretical support.

Yet, screening methods tend to include too many variables when strong correlations exist

among covariates (Fan and Lv,2008;Wang and Leng,2016). We propose a factor-augmented

variable screening procedure that is able to deal with these strong correlations for Cox’s

proportional hazards model.

The paper is organized as follows. Section 2formulates the problem. In Section 3,

we introduce FarmHazard and present properties of the estimated factors and idiosyncratic

components. We provide the main theoretical guarantees in Section 4, and perform extensive

simulations and real data experiments in Section 5. Proofs of the various results can be found

in the Appendix.

We introduce a few notations used throughout the paper. For any integer n, we denote

[n] = {1, . . . , n}.Indenotes the n×nidentity matrix and 0nrepresents the all-zero vector

in Rn. For a vector γ= (γ1, . . . , γm)>∈Rmand q∈N?, denote the `qnorm kγkq=

(Pm

i=1 |γi|q)1/q and kγk∞= max

i≤m|γi|. The support set supp(γ)is {i∈[m] : γi6= 0}, and

sign(γ)is the vector (sign(γi))i∈[m], where sign(γi)=1,0, or −1for γi>0,= 0 or <0,

respectively. For a set or an event A, we use I{A}to denote the indicator function of A. For

a set A, let |A|be its cardinality. For a matrix M, we denote by kMkmax = max

i,j |Mij|its

max norm, and by kMkqits induced q-norm for q∈N?∪{∞}. For M∈Rn×m,I⊆[n]and

J⊆[m], deﬁne MIJ = (Mij )i∈I,j∈J,MI·= (Mij)i∈I,j∈[m]and M·J= (Mij )i∈[n],j∈J. For a

vector γ∈Rm, deﬁne γ⊗0= 1,γ⊗1=γ,γ⊗2=γγ>, and γS= (γi)i∈Swhen S⊆[m]. Let ∇

and ∇2be the gradient and Hessian operators. For f:Rp→R,x∈Rpand I, J ⊆[p], deﬁne

∇If(x) = (∇f(x))Iand ∇2

IJ f(x) = (∇2f(x))IJ .N(µ,Σ)refers to the normal distribution

with mean vector µand covariance matrix Σ. For two numbers aand b,a∨band a∧b

denote their maximum and minimum, respectively.

2 Problem Setup

2.1 Cox’s proportional hazards model

Let T,Cand {x(t)∈Rp: 0 ≤t≤τ}denote the survival time, censoring time and

predictable covariate process, respectively, where τ < ∞is the study ending time. For each

sample, only one of the survival and censoring times is observed, whichever happens ﬁrst.

Let Z=T∧Cbe the observed time and δ=I{T≤C}be the censoring indicator. Tand

Care assumed to be conditionally independent given the covariates {x(t):0≤t≤τ}. The

observed data is an independent and identically distributed (i.i.d.) sample {({xi(t):0≤

t≤τ}, Zi, δi)}i∈[n]from the population ({x(t) : 0 ≤t≤τ}, Z, δ), and for simplicity it is

assumed that there are no tied observations and that the covariates are centered.

Cox’s proportional hazards model (Cox,1972,1975) is a semi-parametric model widely

used to model time-to-event outcomes. In this model, the conditional hazard function

λ(t|x(t)) of the survival time Tat time tgiven the covariate vector x(t)∈Rpis given

λ(t|x(t)) = λ0(t) exp(x(t)>β?),(2.1)

where λ0(·)is a baseline hazard function and β?= (β?

1, . . . , β?

p)>∈Rp. The function λ0(·)is

unspeciﬁed in this semi-parametric model: it is a nuisance function, and β?is the parameter

vector of interest which is assumed to be sparse.

Let X(t)∈Rn×pbe the design matrix at time 0≤t≤τ, and X={X(t):0≤t≤τ}.

Let Nbe the number of failures (satisfying δ= 1) and t1<··· < tNbe the ordered failure

times. For j∈[N], let (j)denote the label of the sample failing at time tj, i.e., the individual

with jth shortest survival time. The risk set Rj={i:Zi≥tj}at time tjis the set of samples

still at risk at tj. Cox’s log-partial likelihood is given by

Q(β;X,Z,δ) =

j=1 nx(j)(tj)>β−log X

i∈Rj

exp(xi(tj)>β)o.

Deﬁne the loss L(β;X,Z,δ) = −n−1Q(β;X,Z,δ). Note that Ldepends on Xand βonly

through the entries of the product Xβ. Throughout the paper, we will use the notation

L(Xβ)and its gradient and Hessian matrix will be taken with respect to β. Keeping the

design matrix Xin the notation will be useful as soon as we introduce factor modeling.

2.2 Counting process formulation

We adopt the counting process formulation of Andersen and Gill (1982). For i∈[n]and

t∈[0, τ], deﬁne the counting process Ni(t) = I{Zi≤t, δi= 1}and the at-risk indicator

process Yi(t) = I{Zi≥t}. Let N(t) = Pn

i=1 Ni(t). Using this notation, the loss Lis given

L(Xβ) = −1

i=1 Zτ

0{xi(t)>β}dNi(t) + 1

nZτ

log "n

i=1

Yi(t) exp(xi(t)>β)#dN(t).

For `∈ {0,1,2}, deﬁne the following quantities

S(`)(X,β, t) = 1

i=1

Yi(t){xi(t)}⊗`exp(xi(t)>β), s(`)

x(β, t) = E[S(`)(X,β, t)].

To simplify future notation, also deﬁne the following

V(X,β, t) = S(2)(X,β, t)

S(0)(X,β, t)−S(1)(X,β, t)

S(0)(X,β, t)⊗2

vx(β, t) = s(2)

x(β, t)

s(0)

x(β, t)−"s(1)

x(β, t)

s(0)

x(β, t)#⊗2

.(2.2)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Factor-AugmentedRegularizedModelforHazardRegressionPierreBayle*andJianqingFan*DepartmentofOperationsResearchandFinancialEngineeringPrincetonUniversitySeptember30,2022AbstractAprevalentfeatureofhigh-dimensionaldataisthedependenceamongcovariates,andmodelselectionisknowntobechallengingwhencovariatesare...

展开>> 收起<<

Factor-Augmented Regularized Model for Hazard Regression Pierre Bayleand Jianqing Fan.pdf

共42页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Factor-Augmented Regularized Model for Hazard Regression Pierre Bayleand Jianqing Fan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: