When to encourage using Gaussian regression for feature selection tasks with time-to-event outcome Rong Lu PhD

2025-04-29 0 0 1.55MB 31 页 10玖币

侵权投诉

When to encourage using Gaussian regression for feature selection tasks with

time-to-event outcome

Rong Lu, PhD*

*The Quantitative Sciences Unit, Division of Biomedical Informatics Research,

Department of Medicine, Stanford University, Stanford, California

Corresponding Author: Rong Lu, PhD (https://orcid.org/0000-0003-4321-9144);

ronglu@stanford.edu; 3180 Porter Drive, Palo Alto, CA 94304.

Keywords: feature selection, time-to-event outcome, survival analysis, glmnet Cox

Conflicts of Interest: None

Funding Sources: This work is partially supported by the Biostatistics Shared

Resource (BSR) of the NIH-funded Stanford Cancer Institute (P30CA124435) and the

Stanford Center for Clinical and Translational Research and Education (UL1-

TR003142).

Role of the Funder/Sponsor: The funders had no role in the design and conduct of the

study; collection, management, analysis, and interpretation of the data; preparation,

review, or approval of the manuscript; and decision to submit the manuscript for

publication.

KEY POINTS

Question: Which statistical methods should be used for feature selection with respect

to time-to-event outcomes if the sample size is small or some of the key covariates are

not measured?

Findings: Univariate Cox regression is not the best-performing model for feature

selection or effect size ranking if the true models are multivariate Cox regression with

Gaussian covariates and half of the covariates are not measured, regardless of the

correlation strength between features. The regularized Cox regression with 𝜆 = 𝜆1𝑠𝑒 and

the Gaussian regression of log-transformed survival time with two covariates (the event

indicator plus one feature at a time) are better models for feature selection when total

number of events is small/modest (<500) and the true models are multivariate Cox

regression with Gaussian covariates.

Meaning: This study demonstrates the importance of including Gaussian regression of

log-transformed survival time in survival analysis when sample size is small.

ABSTRACT

IMPORTANCE: Feature selection with respect to time-to-event outcomes is one of the

fundamental problems in clinical trials and biomarker discovery studies. But it’s unclear

which statistical methods should be used when sample size is small or some of the key

covariates are not measured.

DESIGN: In this simulation study, the true models are multivariate Cox proportional

hazards models with 10 covariates. It’s assumed that only 5 out the 10 true features are

observed/measured for all model fitting, along with 5 random noise features. Each

sample size scenario is explored using 10,000 simulation datasets. Eight regression

models are applied to each dataset to estimate feature effects, including both

regularized Gaussian regression (elastic net penalty) and regularized Cox regression

(glmnet Cox).

RESULTS: If the covariates are highly correlated Gaussian, the Gaussian regression of

log-transformed survival time with only two covariates outperforms all tested Cox

regression models when total number of events <500.

INTRODUCTION

Feature selection with respect to time-to-event outcomes is one of the fundamental

problems in clinical trials and biomarker discovery studies [1-4]. Many cancer trials use

either the overall survival or the progression-free survival as the primary outcome to

explore or validate the efficacy of new treatments [1-3]. To this date, the Cox

proportional hazards model is still the most used method for testing the effect of

intervention in randomized clinical trials [3,5]. But in biomarker discovery studies that

focus on screening large number of genetic markers, the regularized Cox regression

seems to be more popular in recent years [6-12]. It’s unclear why regularized Cox

regression only gained popularity in screening large number of biomarkers, but not in

analyses of randomized trials. One potential explanation is that many researchers are

choosing statistical methods based on how many features were measured/available for

analysis. The appropriateness of such motivation might be questionable if one believes

that we should never assume anything that’s not measured must have no effect on the

outcome of interest.

If we want to make inference starting from the assumption that unknown/unmeasured

factors can significantly affect the outcome of interest, it will be very important to study

the best way of choosing statistical methods and compare different methods’

performance under the same assumption. However, this assumption of unmeasured

features seems to be rarely used in developing methodologies within the framework of

multivariate regression models. Most simulation studies assumed that true features

used for generating survival time were all available for feature selection analysis [9-17].

Motivated by this observation, this work will explore and compare the performance of

several regression models in a simulation study of time-to-event outcomes, assuming

that half of the true covariates are not measured for feature selection.

In literature it also seems that Gaussian regression methods are rarely used to study

time-to-event outcomes, despite their wide usage in analyzing other data types where

Gaussian models’ assumptions cannot all hold strictly in practice [18-21]. Many

methodological variants of Cox regression are being developed actively and are

specifically designed for handling application challenges such as time-varying

covariates/coefficients and violation of proportional hazards assumption [22-26]. While

tailoring Cox regression to different application scenarios remains a very hot research

topic, not much effort seems to be devoted to compare models that were not originally

proposed for survival analysis with typical survival models. In this simulation study, I am

also interested in testing the performance of a few simple Gaussian models in feature

selection tasks with time-to-event outcomes, by using data generated from multivariate

Cox proportional hazards models.

It is also observed that many researchers prefer firstly fitting univariate Cox regression

models or performing literature review to select features that might have significant

association with time-to-event outcomes, before including only those selected features

in fitting multivariate Cox regression to estimate effect sizes [27-31]. This strategy

makes intuitive sense only if the selection performed by univariate Cox regression or

literature review can significantly increase the likelihood of identifying all features with

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhentoencourageusingGaussianregressionforfeatureselectiontaskswithtime-to-eventoutcomeRongLu,PhD**TheQuantitativeSciencesUnit,DivisionofBiomedicalInformaticsResearch,DepartmentofMedicine,StanfordUniversity,Stanford,CaliforniaCorrespondingAuthor:RongLu,PhD(https://orcid.org/0000-0003-4321-9144);rongl...

展开>> 收起<<

When to encourage using Gaussian regression for feature selection tasks with time-to-event outcome Rong Lu PhD.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

When to encourage using Gaussian regression for feature selection tasks with time-to-event outcome Rong Lu PhD

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: