Factor-Augmented Regularized Model for Hazard Regression Pierre Bayleand Jianqing Fan

2025-04-27 0 0 794.17KB 42 页 10玖币
侵权投诉
Factor-Augmented Regularized Model for
Hazard Regression
Pierre Bayleand Jianqing Fan
Department of Operations Research and Financial Engineering
Princeton University
September 30, 2022
Abstract
A prevalent feature of high-dimensional data is the dependence among covariates,
and model selection is known to be challenging when covariates are highly correlated.
To perform model selection for the high-dimensional Cox proportional hazards model in
presence of correlated covariates with factor structure, we propose a new model, Factor-
Augmented Regularized Model for Hazard Regression (FarmHazard), which builds upon
latent factors that drive covariate dependence and extends Cox’s model. This new
model generates procedures that operate in two steps by learning factors and idiosyn-
cratic components from high-dimensional covariate vectors and then using them as new
predictors. Cox’s model is a widely used semi-parametric model for survival analysis,
where censored data and time-dependent covariates bring additional technical chal-
lenges. We prove model selection consistency and estimation consistency under mild
conditions. We also develop a factor-augmented variable screening procedure to deal
with strong correlations in ultra-high dimensional problems. Extensive simulations and
real data experiments demonstrate that our procedures enjoy good performance and
achieve better results on model selection, out-of-sample C-index and screening than
alternative methods.
Keywords: High-dimensional, Cox’s proportional hazards model, Factor model, Model selec-
tion, Censored data, Variable screening.
The authors gratefully acknowledge the support of NIH grant 2R01-GM072611-14 and NSF grant DMS-
2053832. Emails: pbayle@princeton.edu,jqfan@princeton.edu.
1
arXiv:2210.01067v1 [stat.ME] 3 Oct 2022
1 Introduction
An enormous volume of data is accessible in many fields, including biomedicine and clinical
trials, and efficient and valid statistical methods are necessary to study it. In survival
analysis, the outcome variable is time-to-event, such as biological death, relapse, failure of a
mechanical engine or credit default, and often some observations are censored. For example,
a study can come to an end while a fraction of subjects have not experienced the event
of interest, or a subject can leave the study before its end. In this context, a widely used
semi-parametric model is Cox’s proportional hazards model (Cox,1972,1975). Andersen
and Gill (1982) formulated it into a counting process framework. In the fixed-dimension
setting, Tsiatis (1981) and Andersen and Gill (1982) proved the consistency and asymptotic
normality of the maximum partial likelihood estimator. Yet, modern datasets frequently
have more predictors than samples. Despite the very large number of predictors, most of
them are often irrelevant to explain the outcome, leading to sparse models. Reducing high-
dimensional data to the true set of relevant covariates is one of the most important tasks
in high-dimensional statistics and is a challenge in the analysis of big data. Models would
become more interpretable and prediction more accurate. To this end, several regularized
regression techniques have been extended to Cox’s proportional hazards model (Tibshirani,
1997;Fan and Li,2002). Bradic et al. (2011) established model selection consistency and
strong oracle properties for a large class of penalty functions in the ultra-high dimensional
setting, with LASSO and SCAD as special cases. Huang et al. (2013) and Kong and Nan
(2014) studied oracle inequalities for LASSO under different conditions.
When variables are correlated, most model selection techniques fail to recover the set
of important predictors, in both high-dimensional and ultra-high dimensional settings. Fan
et al. (2020a) suggested FarmSelect, a two-step procedure that learns factors and idiosyn-
cratic components and use them as new predictors, to overcome the dependence problem
among covariates in the setting of `1-penalized generalized linear models. An even more
demanding task is to consider models that go beyond generalized linear models, such as
Cox’s proportional hazards model, where censored data and time-dependent covariates bring
2
additional technical challenges. To cope with correlation in high dimensions within the
challenging survival analysis setting, we propose Factor-Augmented Regularized Model for
Hazard Regression (FarmHazard). High-dimensional genomics and genetic data are natu-
rally strongly correlated, and our model is designed to address this kind of issues. It has
important applications, such as the prediction of the outcome of chemotherapy based on
gene-expression profiles coming from DNA microarrays (Rosenwald et al.,2002).
In ultra-high dimensional problems, characterized by a dimension that grows with the
sample size in a non-polynomial fashion, regularized regression faces multiple statistical and
computational challenges (Fan et al.,2009). To remedy this, screening methods (Fan and Lv,
2008;Fan and Song,2010;Wang and Leng,2016) have been developed; they enjoy statistical
guarantees and are computationally efficient. Fan et al. (2010) extended the key idea of sure
independence screening to Cox’s model, and Zhao and Li (2012) provided theoretical support.
Yet, screening methods tend to include too many variables when strong correlations exist
among covariates (Fan and Lv,2008;Wang and Leng,2016). We propose a factor-augmented
variable screening procedure that is able to deal with these strong correlations for Cox’s
proportional hazards model.
The paper is organized as follows. Section 2formulates the problem. In Section 3,
we introduce FarmHazard and present properties of the estimated factors and idiosyncratic
components. We provide the main theoretical guarantees in Section 4, and perform extensive
simulations and real data experiments in Section 5. Proofs of the various results can be found
in the Appendix.
We introduce a few notations used throughout the paper. For any integer n, we denote
[n] = {1, . . . , n}.Indenotes the n×nidentity matrix and 0nrepresents the all-zero vector
in Rn. For a vector γ= (γ1, . . . , γm)>Rmand qN?, denote the `qnorm kγkq=
(Pm
i=1 |γi|q)1/q and kγk= max
im|γi|. The support set supp(γ)is {i[m] : γi6= 0}, and
sign(γ)is the vector (sign(γi))i[m], where sign(γi)=1,0, or 1for γi>0,= 0 or <0,
respectively. For a set or an event A, we use I{A}to denote the indicator function of A. For
a set A, let |A|be its cardinality. For a matrix M, we denote by kMkmax = max
i,j |Mij|its
3
max norm, and by kMkqits induced q-norm for qN?{∞}. For MRn×m,I[n]and
J[m], define MIJ = (Mij )iI,jJ,MI·= (Mij)iI,j[m]and M·J= (Mij )i[n],jJ. For a
vector γRm, define γ0= 1,γ1=γ,γ2=γγ>, and γS= (γi)iSwhen S[m]. Let
and 2be the gradient and Hessian operators. For f:RpR,xRpand I, J [p], define
If(x) = (f(x))Iand 2
IJ f(x) = (2f(x))IJ .N(µ,Σ)refers to the normal distribution
with mean vector µand covariance matrix Σ. For two numbers aand b,aband ab
denote their maximum and minimum, respectively.
2 Problem Setup
2.1 Cox’s proportional hazards model
Let T,Cand {x(t)Rp: 0 tτ}denote the survival time, censoring time and
predictable covariate process, respectively, where τ < is the study ending time. For each
sample, only one of the survival and censoring times is observed, whichever happens first.
Let Z=TCbe the observed time and δ=I{TC}be the censoring indicator. Tand
Care assumed to be conditionally independent given the covariates {x(t):0tτ}. The
observed data is an independent and identically distributed (i.i.d.) sample {({xi(t):0
tτ}, Zi, δi)}i[n]from the population ({x(t) : 0 tτ}, Z, δ), and for simplicity it is
assumed that there are no tied observations and that the covariates are centered.
Cox’s proportional hazards model (Cox,1972,1975) is a semi-parametric model widely
used to model time-to-event outcomes. In this model, the conditional hazard function
λ(t|x(t)) of the survival time Tat time tgiven the covariate vector x(t)Rpis given
by
λ(t|x(t)) = λ0(t) exp(x(t)>β?),(2.1)
where λ0(·)is a baseline hazard function and β?= (β?
1, . . . , β?
p)>Rp. The function λ0(·)is
unspecified in this semi-parametric model: it is a nuisance function, and β?is the parameter
vector of interest which is assumed to be sparse.
4
Let X(t)Rn×pbe the design matrix at time 0tτ, and X={X(t):0tτ}.
Let Nbe the number of failures (satisfying δ= 1) and t1<··· < tNbe the ordered failure
times. For j[N], let (j)denote the label of the sample failing at time tj, i.e., the individual
with jth shortest survival time. The risk set Rj={i:Zitj}at time tjis the set of samples
still at risk at tj. Cox’s log-partial likelihood is given by
Q(β;X,Z,δ) =
N
X
j=1 nx(j)(tj)>βlog X
iRj
exp(xi(tj)>β)o.
Define the loss L(β;X,Z,δ) = n1Q(β;X,Z,δ). Note that Ldepends on Xand βonly
through the entries of the product Xβ. Throughout the paper, we will use the notation
L(Xβ)and its gradient and Hessian matrix will be taken with respect to β. Keeping the
design matrix Xin the notation will be useful as soon as we introduce factor modeling.
2.2 Counting process formulation
We adopt the counting process formulation of Andersen and Gill (1982). For i[n]and
t[0, τ], define the counting process Ni(t) = I{Zit, δi= 1}and the at-risk indicator
process Yi(t) = I{Zit}. Let N(t) = Pn
i=1 Ni(t). Using this notation, the loss Lis given
by
L(Xβ) = 1
n
n
X
i=1 Zτ
0{xi(t)>β}dNi(t) + 1
nZτ
0
log "n
X
i=1
Yi(t) exp(xi(t)>β)#dN(t).
For `∈ {0,1,2}, define the following quantities
S(`)(X,β, t) = 1
n
n
X
i=1
Yi(t){xi(t)}`exp(xi(t)>β), s(`)
x(β, t) = E[S(`)(X,β, t)].
To simplify future notation, also define the following
V(X,β, t) = S(2)(X,β, t)
S(0)(X,β, t)S(1)(X,β, t)
S(0)(X,β, t)2
,
vx(β, t) = s(2)
x(β, t)
s(0)
x(β, t)"s(1)
x(β, t)
s(0)
x(β, t)#2
.(2.2)
5
摘要:

Factor-AugmentedRegularizedModelforHazardRegressionPierreBayle*andJianqingFan*DepartmentofOperationsResearchandFinancialEngineeringPrincetonUniversitySeptember30,2022AbstractAprevalentfeatureofhigh-dimensionaldataisthedependenceamongcovariates,andmodelselectionisknowntobechallengingwhencovariatesare...

展开>> 收起<<
Factor-Augmented Regularized Model for Hazard Regression Pierre Bayleand Jianqing Fan.pdf

共42页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:42 页 大小:794.17KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 42
客服
关注