When to encourage using Gaussian regression for feature selection tasks with time-to-event outcome Rong Lu PhD

2025-04-29 0 0 1.55MB 31 页 10玖币
侵权投诉
When to encourage using Gaussian regression for feature selection tasks with
time-to-event outcome
Rong Lu, PhD*
*The Quantitative Sciences Unit, Division of Biomedical Informatics Research,
Department of Medicine, Stanford University, Stanford, California
Corresponding Author: Rong Lu, PhD (https://orcid.org/0000-0003-4321-9144);
ronglu@stanford.edu; 3180 Porter Drive, Palo Alto, CA 94304.
Keywords: feature selection, time-to-event outcome, survival analysis, glmnet Cox
Conflicts of Interest: None
Funding Sources: This work is partially supported by the Biostatistics Shared
Resource (BSR) of the NIH-funded Stanford Cancer Institute (P30CA124435) and the
Stanford Center for Clinical and Translational Research and Education (UL1-
TR003142).
Role of the Funder/Sponsor: The funders had no role in the design and conduct of the
study; collection, management, analysis, and interpretation of the data; preparation,
review, or approval of the manuscript; and decision to submit the manuscript for
publication.
KEY POINTS
Question: Which statistical methods should be used for feature selection with respect
to time-to-event outcomes if the sample size is small or some of the key covariates are
not measured?
Findings: Univariate Cox regression is not the best-performing model for feature
selection or effect size ranking if the true models are multivariate Cox regression with
Gaussian covariates and half of the covariates are not measured, regardless of the
correlation strength between features. The regularized Cox regression with 𝜆 = 𝜆1𝑠𝑒 and
the Gaussian regression of log-transformed survival time with two covariates (the event
indicator plus one feature at a time) are better models for feature selection when total
number of events is small/modest (<500) and the true models are multivariate Cox
regression with Gaussian covariates.
Meaning: This study demonstrates the importance of including Gaussian regression of
log-transformed survival time in survival analysis when sample size is small.
ABSTRACT
IMPORTANCE: Feature selection with respect to time-to-event outcomes is one of the
fundamental problems in clinical trials and biomarker discovery studies. But it’s unclear
which statistical methods should be used when sample size is small or some of the key
covariates are not measured.
DESIGN: In this simulation study, the true models are multivariate Cox proportional
hazards models with 10 covariates. It’s assumed that only 5 out the 10 true features are
observed/measured for all model fitting, along with 5 random noise features. Each
sample size scenario is explored using 10,000 simulation datasets. Eight regression
models are applied to each dataset to estimate feature effects, including both
regularized Gaussian regression (elastic net penalty) and regularized Cox regression
(glmnet Cox).
RESULTS: If the covariates are highly correlated Gaussian, the Gaussian regression of
log-transformed survival time with only two covariates outperforms all tested Cox
regression models when total number of events <500.
INTRODUCTION
Feature selection with respect to time-to-event outcomes is one of the fundamental
problems in clinical trials and biomarker discovery studies [1-4]. Many cancer trials use
either the overall survival or the progression-free survival as the primary outcome to
explore or validate the efficacy of new treatments [1-3]. To this date, the Cox
proportional hazards model is still the most used method for testing the effect of
intervention in randomized clinical trials [3,5]. But in biomarker discovery studies that
focus on screening large number of genetic markers, the regularized Cox regression
seems to be more popular in recent years [6-12]. It’s unclear why regularized Cox
regression only gained popularity in screening large number of biomarkers, but not in
analyses of randomized trials. One potential explanation is that many researchers are
choosing statistical methods based on how many features were measured/available for
analysis. The appropriateness of such motivation might be questionable if one believes
that we should never assume anything that’s not measured must have no effect on the
outcome of interest.
If we want to make inference starting from the assumption that unknown/unmeasured
factors can significantly affect the outcome of interest, it will be very important to study
the best way of choosing statistical methods and compare different methods’
performance under the same assumption. However, this assumption of unmeasured
features seems to be rarely used in developing methodologies within the framework of
multivariate regression models. Most simulation studies assumed that true features
used for generating survival time were all available for feature selection analysis [9-17].
Motivated by this observation, this work will explore and compare the performance of
several regression models in a simulation study of time-to-event outcomes, assuming
that half of the true covariates are not measured for feature selection.
In literature it also seems that Gaussian regression methods are rarely used to study
time-to-event outcomes, despite their wide usage in analyzing other data types where
Gaussian models assumptions cannot all hold strictly in practice [18-21]. Many
methodological variants of Cox regression are being developed actively and are
specifically designed for handling application challenges such as time-varying
covariates/coefficients and violation of proportional hazards assumption [22-26]. While
tailoring Cox regression to different application scenarios remains a very hot research
topic, not much effort seems to be devoted to compare models that were not originally
proposed for survival analysis with typical survival models. In this simulation study, I am
also interested in testing the performance of a few simple Gaussian models in feature
selection tasks with time-to-event outcomes, by using data generated from multivariate
Cox proportional hazards models.
It is also observed that many researchers prefer firstly fitting univariate Cox regression
models or performing literature review to select features that might have significant
association with time-to-event outcomes, before including only those selected features
in fitting multivariate Cox regression to estimate effect sizes [27-31]. This strategy
makes intuitive sense only if the selection performed by univariate Cox regression or
literature review can significantly increase the likelihood of identifying all features with
摘要:

WhentoencourageusingGaussianregressionforfeatureselectiontaskswithtime-to-eventoutcomeRongLu,PhD**TheQuantitativeSciencesUnit,DivisionofBiomedicalInformaticsResearch,DepartmentofMedicine,StanfordUniversity,Stanford,CaliforniaCorrespondingAuthor:RongLu,PhD(https://orcid.org/0000-0003-4321-9144);rongl...

展开>> 收起<<
When to encourage using Gaussian regression for feature selection tasks with time-to-event outcome Rong Lu PhD.pdf

共31页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:31 页 大小:1.55MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 31
客服
关注