Predicting Survival Outcomes in the Presence of Unlabeled Data

2025-05-02 0 0 538KB 21 页 10玖币
侵权投诉
Predicting Survival Outcomes in the Presence of Unlabeled Data
Fateme Nateghi Haredashta,b,
, Celine Vensa,b
aKU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, 8500 Kortrijk,
Belgium
bITEC - imec and KU Leuven, Etienne Sabbelaan 51, 8500 Kortrijk, Belgium
Abstract
Many clinical studies require the follow-up of patients over time. This is challenging: apart from frequently
observed drop-out, there are often also organizational and financial challenges, which can lead to reduced
data collection and, in turn, can complicate subsequent analyses. In contrast, there is often plenty of baseline
data available of patients with similar characteristics and background information, e.g., from patients that
fall outside the study time window. In this article, we investigate whether we can benefit from the inclusion
of such unlabeled data instances to predict accurate survival times. In other words, we introduce a third
level of supervision in the context of survival analysis, apart from fully observed and censored instances, we
also include unlabeled instances. We propose three approaches to deal with this novel setting and provide
an empirical comparison over fifteen real-life clinical and gene expression survival datasets. Our results
demonstrate that all approaches are able to increase the predictive performance over independent test data.
We also show that integrating the partial supervision provided by censored data in a semi-supervised wrapper
approach generally provides the best results, often achieving high improvements, compared to not using
unlabeled data.
Keywords: Survival analysis, Semi-supervised learning, Random survival forest, Self-training
1. Introduction
Many clinical studies require following subjects over time and measuring the time until a certain event is
experienced (e.g., death, progression, hospital discharge, etc). The resulting collected datasets are typically
analyzed with survival analysis techniques. Survival analysis is a branch of statistics that analyzes the
expected duration until an event of interest occurs [1]. Censoring is an essential concept in survival analysis
which makes it challenging compared to other analytical methods. Censoring can occur due to various
reasons, such as drop-out, and means that the observed time is different from the actual event time. In the
case of right censoring, for instance, we know that the actual event time is greater than the observed time
Corresponding author
Email addresses: fateme.nateghi@kuleuven.be (Fateme Nateghi Haredasht), celine.vens@kuleuven.be (Celine Vens)
1
arXiv:2210.13891v1 [cs.LG] 25 Oct 2022
[2].
Traditional survival analysis methods include the Cox Proportional Hazards model (CPH) [3]. CPH is
basically a linear regression model that predicts simultaneously the effect of several risk factors on survival
time. However, these standard survival models encounter some challenges when it comes to real-world
datasets. For instance, they cannot easily capture nonlinear relationships between the covariates. In addition,
in many applications, the presence of high-dimensional data is quite common, e.g., gene expression data;
however, these traditional methods are not able to efficiently deal with such high-dimensional data. As a
result, machine learning-based techniques have become increasingly popular in the survival analysis context
over recent years [4]. Applying machine learning methods directly to censored data is challenging since the
value of a measurement or observation is only partially known. Several studies have successfully modified
machine learning algorithms to make use of censored information in survival analysis, e.g., decision trees
[5], artificial neural networks (ANN) [6], and support vector machines (SVM) [7] to name a few. Popular
ensemble-based frameworks include bagging survival trees [8] and random survival forests [9]. Also, more
advanced learning tasks such as active learning [10] and transfer learning [11] have been extended toward
survival analysis.
Long-term follow-up of patients is often expensive, both time- and effort-wise and financially. As a result, the
number of subjects that are included in a study and followed in time is often limited. However, many more
subjects may exist (e.g., through retrospective data collection) that meet the inclusion/exclusion criteria of
the follow-up study. If the study aims to predict outcomes based on variables collected at baseline, then we
hypothesize that these extra (unlabeled) data points might actually boost the predictive performance of the
resulting model, if used wisely. This corresponds to a semi-supervised learning set-up [12], which deals with
scenarios where only a small part of the instances in the training data have an outcome label attached, but
the rest is unlabeled. To our knowledge, such a semi-supervised learning set-up has never been investigated
in the context of survival analysis, and with this article, we aim to fill this gap.
Including unlabeled instances in a survival analysis task leads to three distinct subsets of data, that differ
in the amount of supervised information they contain: a set of (1) fully observed, (2) partially observed
(censored), and (3) unobserved data points. Our goal is to look at these three subsets of data altogether.
In particular, we address two research questions: (1) can the predictive performance over an independent
test set be increased by including unlabeled instances (i.e., does the semi-supervised learning setting carry
over to the survival analysis context)?, and (2) what is the best approach to integrate the 3 subsets of data
in the analysis? To address this second question, we propose and compare three different approaches. The
first approach is to treat the unlabeled instances as censored with the censoring time equal to zero and
apply a machine learning-based survival analysis technique. For the second approach, we apply a standard
semi-supervised learning approach. In particular, we use the widely used self-training wrapper technique
2
[13, 14]. This technique first builds a classifier over the labeled (in our case, observed and censored) data
points and iteratively augments the labeled set with highly confident predictions over the unlabeled dataset.
In the third approach, we propose an adaptation of the second one, in which we initially add the censored
instances to the unlabeled set, and exploit the censored information in the data augmentation process, to
decide how many instances to add to the labeled set in each iteration. In all three approaches, we use random
survival forests as base learner [9]. In order to answer the research questions, we apply and compare the
approaches using fifteen real-life datasets from the healthcare domain.
The remainder of this article is organized as follows. Section 2 introduces the background and reviews some
concepts of the employed models including random survival forest and self-training approaches. Section 3
describes related work. In section 4, three proposed approaches are introduced, two of which are a self-
training-based framework that copes with the survival data. Section 5 presents the experimental set-up,
including dataset description, unlabeled data generation, performance evaluation, and comparison methods
and parameter instantiation. Results are presented in section 6. Conclusions are drawn in section 7.
2. Background
In this section, we first review some concepts of using machine learning methods for survival analysis.
Afterward, we explain the self-training technique and how one can apply it to a survival analysis problem.
2.1. Random survival forest
Random survival forests are well-known ensemble-based learning models that have been widely used in
many survival applications and have been shown to be superior to traditional survival models [15]. Random
survival forest (RSF) [9] is quite close to the original Random Forest by Breiman [16]. The random forest
algorithm makes a prediction based on tree-structured models. Similar to the random forest, RSF combines
bootstrapping, tree building, and prediction aggregating. However, in the splitting criterion to grow a tree
and in the predictions returned in the leaf nodes, RSF explicitly considers survival time and censoring
information. RSF has three main steps. As the first step, it draws Bbootstrap samples from the original
data. In the second step, for each bootstrap sample, a survival tree is grown. At each node of a tree,
pcandidate variables are randomly selected, where pis a parameter, often defined as a proportion of the
original number of variables. The task is to split the node into two child nodes using the best candidate
variable and split point, as determined by the log-rank test [17]. The best split is the one that maximizes
survival differences between the two child nodes. Growing the obtained tree structure is continued until a
stop criterion holds (e.g., until the number of observed instances in the terminal nodes drops below a specified
value). In the last step, the cumulative hazard function (CHF) associated with each terminal node in a tree
is calculated by the Nelson-Aalen estimator, which is a non-parametric estimator of the CHF [18]. All cases
3
Add the corresponding
observations, together
with their prediction,
to the labeled data
Train a base model
using labeled data
Find the most
confident predictions
Make prediction for
unlabeled data
Stop when the
stopping criterion is
met
Figure 1: Self-training framework. The framework takes a set of labeled and unlabeled data instances as input and starts in
the top left box.
within the same terminal node have the same CHF. The ensemble CHF is constructed as the average over
the CHF of the Bsurvival trees.
Noteworthy, the survival function and cumulative hazard function as linked as follows [19]:
S(t) = eH(t)
where H(t) and S(t) denote the cumulative hazard function and the survival function, respectively.
2.2. Self-training method
The semi-supervised learning (SSL) paradigm is a combination of supervised and unsupervised learning
and has been widely used in many applications such as healthcare [20, 21, 22]. The primary goal of SSL
methods is to take advantage of the unlabeled data in addition to the labeled data, in order to obtain a
better prediction model. The acquisition of labeled data is usually expensive, time-consuming, and often
difficult, specifically when it comes to healthcare and follow-up data. Hence, achieving good performance
with supervised techniques is challenging, since the number of labeled instances is often too small. Over
the years, many SSL techniques have been proposed [23, 24]. In this article, we will focus on self-training
(sometimes also called self-learning) [13], one of the most widely used algorithms for SSL. Self-training has
been used in different approaches like deep neural networks [25], face recognition [26], and parsing [27]. This
framework overcomes the issue of insufficient labeled data by augmenting the training set with unlabeled
instances. It starts with training a model using a base learner on the labeled set and then augments this
set with the predictions for the unlabeled instances that the model is most confident in (see Figure 1).
This procedure is repeated until a certain stopping criterion is met. This stopping criterion, the number
of instances to augment in each iteration, and the definition of confidence are instantiated according to the
problem at hand.
4
3. Related work
Semi-supervised learning (SSL) methods have been applied in many different domains [23, 24]. However,
few efforts have been made in order to generalize SSL algorithms to be suitable for survival analysis.
Bair and Tibshirani [28] combine supervised and unsupervised learning to predict survival times for cancer
patients. They first employ a supervised approach to select a subset of genes from a gene expression dataset
that correlates with survival. Then, unsupervised clustering is applied to these gene subsets to identify
cancer subtypes. Once such subtypes are identified, they apply again supervised learning techniques to
classify future patients into the appropriate subgroup or to predict their survival. Although the authors call
the resulting approach semi-supervised, their setting is clearly different from ours.
There has also been some work that models a survival analysis task as a semi-supervised learning problem
by employing a self-training strategy to predict event times from observed and censored data points. Both
[29, 30] treat the censored data points as unlabeled, thereby ignoring the time-to-event information that they
contain. Liang et al [31] do use some information from the censored times, in the sense that they disregard
data points for which the model predicts a value lower than the right-censored time points. They combine
Cox proportional hazard (Cox) and accelerated failure time (AFT) model in a semi-supervised set-up to
predict the treatment risk and the survival time of cancer patients. Regularization is used for gene selection,
which is an essential task in cancer survival analysis. The authors found that many censored data points
always violate the constraint that the predicted survival time should be higher than the censored time,
restricting the full exploitation of the censored data. Therefore, in follow-up work [32], they embedded a
self-paced learning mechanism in their framework to gradually introduce more complex data samples in the
training process, leading to a more accurate estimation for the censored samples. An important difference
between our work and the discussed studies is that we consider situations where apart from fully observed
and censored instances, we also have a third category, namely extra data points that are unlabeled. To our
knowledge, this is the first study to investigate the use of unlabeled instances in the survival context.
4. Methodology
In order to predict event times in the presence of observed, censored, and unlabeled instances, we propose
three approaches.
The first approach is a straightforward application of a survival analysis method (in our case, RSF), in which
we add the unlabeled set as censored instances, with the corresponding event time set to zero. We call the
first approach random survival forest with unlabeled data (RSF+UD). Figure 2 depicts the block diagram
of the first proposed pipeline.
In the second approach, we apply a semi-supervised learning approach called self-trained random survival
5
摘要:

PredictingSurvivalOutcomesinthePresenceofUnlabeledDataFatemeNateghiHaredashta,b,,CelineVensa,baKULeuven,CampusKULAK-DepartmentofPublicHealthandPrimaryCare,EtienneSabbelaan53,8500Kortrijk,BelgiumbITEC-imecandKULeuven,EtienneSabbelaan51,8500Kortrijk,BelgiumAbstractManyclinicalstudiesrequirethefollow-...

展开>> 收起<<
Predicting Survival Outcomes in the Presence of Unlabeled Data.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:538KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注