Predicting Survival Outcomes in the Presence of Unlabeled Data

2025-05-02 0 0 538KB 21 页 10玖币

侵权投诉

Fateme Nateghi Haredashta,b,∗

, Celine Vensa,b

aKU Leuven, Campus KULAK - Department of Public Health and Primary Care, Etienne Sabbelaan 53, 8500 Kortrijk,

Belgium

bITEC - imec and KU Leuven, Etienne Sabbelaan 51, 8500 Kortrijk, Belgium

Abstract

Many clinical studies require the follow-up of patients over time. This is challenging: apart from frequently

observed drop-out, there are often also organizational and ﬁnancial challenges, which can lead to reduced

data collection and, in turn, can complicate subsequent analyses. In contrast, there is often plenty of baseline

data available of patients with similar characteristics and background information, e.g., from patients that

fall outside the study time window. In this article, we investigate whether we can beneﬁt from the inclusion

of such unlabeled data instances to predict accurate survival times. In other words, we introduce a third

level of supervision in the context of survival analysis, apart from fully observed and censored instances, we

also include unlabeled instances. We propose three approaches to deal with this novel setting and provide

an empirical comparison over ﬁfteen real-life clinical and gene expression survival datasets. Our results

demonstrate that all approaches are able to increase the predictive performance over independent test data.

We also show that integrating the partial supervision provided by censored data in a semi-supervised wrapper

approach generally provides the best results, often achieving high improvements, compared to not using

unlabeled data.

Keywords: Survival analysis, Semi-supervised learning, Random survival forest, Self-training

1. Introduction

Many clinical studies require following subjects over time and measuring the time until a certain event is

experienced (e.g., death, progression, hospital discharge, etc). The resulting collected datasets are typically

analyzed with survival analysis techniques. Survival analysis is a branch of statistics that analyzes the

expected duration until an event of interest occurs [1]. Censoring is an essential concept in survival analysis

which makes it challenging compared to other analytical methods. Censoring can occur due to various

reasons, such as drop-out, and means that the observed time is diﬀerent from the actual event time. In the

case of right censoring, for instance, we know that the actual event time is greater than the observed time

∗Corresponding author

Email addresses: fateme.nateghi@kuleuven.be (Fateme Nateghi Haredasht), celine.vens@kuleuven.be (Celine Vens)

arXiv:2210.13891v1 [cs.LG] 25 Oct 2022

[2].

Traditional survival analysis methods include the Cox Proportional Hazards model (CPH) [3]. CPH is

basically a linear regression model that predicts simultaneously the eﬀect of several risk factors on survival

time. However, these standard survival models encounter some challenges when it comes to real-world

datasets. For instance, they cannot easily capture nonlinear relationships between the covariates. In addition,

in many applications, the presence of high-dimensional data is quite common, e.g., gene expression data;

however, these traditional methods are not able to eﬃciently deal with such high-dimensional data. As a

result, machine learning-based techniques have become increasingly popular in the survival analysis context

over recent years [4]. Applying machine learning methods directly to censored data is challenging since the

value of a measurement or observation is only partially known. Several studies have successfully modiﬁed

machine learning algorithms to make use of censored information in survival analysis, e.g., decision trees

[5], artiﬁcial neural networks (ANN) [6], and support vector machines (SVM) [7] to name a few. Popular

ensemble-based frameworks include bagging survival trees [8] and random survival forests [9]. Also, more

advanced learning tasks such as active learning [10] and transfer learning [11] have been extended toward

survival analysis.

Long-term follow-up of patients is often expensive, both time- and eﬀort-wise and ﬁnancially. As a result, the

number of subjects that are included in a study and followed in time is often limited. However, many more

subjects may exist (e.g., through retrospective data collection) that meet the inclusion/exclusion criteria of

the follow-up study. If the study aims to predict outcomes based on variables collected at baseline, then we

hypothesize that these extra (unlabeled) data points might actually boost the predictive performance of the

resulting model, if used wisely. This corresponds to a semi-supervised learning set-up [12], which deals with

scenarios where only a small part of the instances in the training data have an outcome label attached, but

the rest is unlabeled. To our knowledge, such a semi-supervised learning set-up has never been investigated

in the context of survival analysis, and with this article, we aim to ﬁll this gap.

Including unlabeled instances in a survival analysis task leads to three distinct subsets of data, that diﬀer

in the amount of supervised information they contain: a set of (1) fully observed, (2) partially observed

(censored), and (3) unobserved data points. Our goal is to look at these three subsets of data altogether.

In particular, we address two research questions: (1) can the predictive performance over an independent

test set be increased by including unlabeled instances (i.e., does the semi-supervised learning setting carry

over to the survival analysis context)?, and (2) what is the best approach to integrate the 3 subsets of data

in the analysis? To address this second question, we propose and compare three diﬀerent approaches. The

ﬁrst approach is to treat the unlabeled instances as censored with the censoring time equal to zero and

apply a machine learning-based survival analysis technique. For the second approach, we apply a standard

semi-supervised learning approach. In particular, we use the widely used self-training wrapper technique

[13, 14]. This technique ﬁrst builds a classiﬁer over the labeled (in our case, observed and censored) data

points and iteratively augments the labeled set with highly conﬁdent predictions over the unlabeled dataset.

In the third approach, we propose an adaptation of the second one, in which we initially add the censored

instances to the unlabeled set, and exploit the censored information in the data augmentation process, to

decide how many instances to add to the labeled set in each iteration. In all three approaches, we use random

survival forests as base learner [9]. In order to answer the research questions, we apply and compare the

approaches using ﬁfteen real-life datasets from the healthcare domain.

The remainder of this article is organized as follows. Section 2 introduces the background and reviews some

concepts of the employed models including random survival forest and self-training approaches. Section 3

describes related work. In section 4, three proposed approaches are introduced, two of which are a self-

training-based framework that copes with the survival data. Section 5 presents the experimental set-up,

including dataset description, unlabeled data generation, performance evaluation, and comparison methods

and parameter instantiation. Results are presented in section 6. Conclusions are drawn in section 7.

2. Background

In this section, we ﬁrst review some concepts of using machine learning methods for survival analysis.

Afterward, we explain the self-training technique and how one can apply it to a survival analysis problem.

2.1. Random survival forest

Random survival forests are well-known ensemble-based learning models that have been widely used in

many survival applications and have been shown to be superior to traditional survival models [15]. Random

survival forest (RSF) [9] is quite close to the original Random Forest by Breiman [16]. The random forest

algorithm makes a prediction based on tree-structured models. Similar to the random forest, RSF combines

bootstrapping, tree building, and prediction aggregating. However, in the splitting criterion to grow a tree

and in the predictions returned in the leaf nodes, RSF explicitly considers survival time and censoring

information. RSF has three main steps. As the ﬁrst step, it draws Bbootstrap samples from the original

data. In the second step, for each bootstrap sample, a survival tree is grown. At each node of a tree,

pcandidate variables are randomly selected, where pis a parameter, often deﬁned as a proportion of the

original number of variables. The task is to split the node into two child nodes using the best candidate

variable and split point, as determined by the log-rank test [17]. The best split is the one that maximizes

survival diﬀerences between the two child nodes. Growing the obtained tree structure is continued until a

stop criterion holds (e.g., until the number of observed instances in the terminal nodes drops below a speciﬁed

value). In the last step, the cumulative hazard function (CHF) associated with each terminal node in a tree

is calculated by the Nelson-Aalen estimator, which is a non-parametric estimator of the CHF [18]. All cases

Add the corresponding

observations, together

with their prediction,

to the labeled data

Train a base model

using labeled data

Find the most

conﬁdent predictions

Make prediction for

unlabeled data

Stop when the

stopping criterion is

met

Figure 1: Self-training framework. The framework takes a set of labeled and unlabeled data instances as input and starts in

the top left box.

within the same terminal node have the same CHF. The ensemble CHF is constructed as the average over

the CHF of the Bsurvival trees.

Noteworthy, the survival function and cumulative hazard function as linked as follows [19]:

S(t) = e−H(t)

where H(t) and S(t) denote the cumulative hazard function and the survival function, respectively.

2.2. Self-training method

The semi-supervised learning (SSL) paradigm is a combination of supervised and unsupervised learning

and has been widely used in many applications such as healthcare [20, 21, 22]. The primary goal of SSL

methods is to take advantage of the unlabeled data in addition to the labeled data, in order to obtain a

better prediction model. The acquisition of labeled data is usually expensive, time-consuming, and often

diﬃcult, speciﬁcally when it comes to healthcare and follow-up data. Hence, achieving good performance

with supervised techniques is challenging, since the number of labeled instances is often too small. Over

the years, many SSL techniques have been proposed [23, 24]. In this article, we will focus on self-training

(sometimes also called self-learning) [13], one of the most widely used algorithms for SSL. Self-training has

been used in diﬀerent approaches like deep neural networks [25], face recognition [26], and parsing [27]. This

framework overcomes the issue of insuﬃcient labeled data by augmenting the training set with unlabeled

instances. It starts with training a model using a base learner on the labeled set and then augments this

set with the predictions for the unlabeled instances that the model is most conﬁdent in (see Figure 1).

This procedure is repeated until a certain stopping criterion is met. This stopping criterion, the number

of instances to augment in each iteration, and the deﬁnition of conﬁdence are instantiated according to the

problem at hand.

3. Related work

Semi-supervised learning (SSL) methods have been applied in many diﬀerent domains [23, 24]. However,

few eﬀorts have been made in order to generalize SSL algorithms to be suitable for survival analysis.

Bair and Tibshirani [28] combine supervised and unsupervised learning to predict survival times for cancer

patients. They ﬁrst employ a supervised approach to select a subset of genes from a gene expression dataset

that correlates with survival. Then, unsupervised clustering is applied to these gene subsets to identify

cancer subtypes. Once such subtypes are identiﬁed, they apply again supervised learning techniques to

classify future patients into the appropriate subgroup or to predict their survival. Although the authors call

the resulting approach semi-supervised, their setting is clearly diﬀerent from ours.

There has also been some work that models a survival analysis task as a semi-supervised learning problem

by employing a self-training strategy to predict event times from observed and censored data points. Both

[29, 30] treat the censored data points as unlabeled, thereby ignoring the time-to-event information that they

contain. Liang et al [31] do use some information from the censored times, in the sense that they disregard

data points for which the model predicts a value lower than the right-censored time points. They combine

Cox proportional hazard (Cox) and accelerated failure time (AFT) model in a semi-supervised set-up to

predict the treatment risk and the survival time of cancer patients. Regularization is used for gene selection,

which is an essential task in cancer survival analysis. The authors found that many censored data points

always violate the constraint that the predicted survival time should be higher than the censored time,

restricting the full exploitation of the censored data. Therefore, in follow-up work [32], they embedded a

self-paced learning mechanism in their framework to gradually introduce more complex data samples in the

training process, leading to a more accurate estimation for the censored samples. An important diﬀerence

between our work and the discussed studies is that we consider situations where apart from fully observed

and censored instances, we also have a third category, namely extra data points that are unlabeled. To our

knowledge, this is the ﬁrst study to investigate the use of unlabeled instances in the survival context.

4. Methodology

In order to predict event times in the presence of observed, censored, and unlabeled instances, we propose

three approaches.

The ﬁrst approach is a straightforward application of a survival analysis method (in our case, RSF), in which

we add the unlabeled set as censored instances, with the corresponding event time set to zero. We call the

ﬁrst approach random survival forest with unlabeled data (RSF+UD). Figure 2 depicts the block diagram

of the ﬁrst proposed pipeline.

In the second approach, we apply a semi-supervised learning approach called self-trained random survival

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PredictingSurvivalOutcomesinthePresenceofUnlabeledDataFatemeNateghiHaredashta,b,,CelineVensa,baKULeuven,CampusKULAK-DepartmentofPublicHealthandPrimaryCare,EtienneSabbelaan53,8500Kortrijk,BelgiumbITEC-imecandKULeuven,EtienneSabbelaan51,8500Kortrijk,BelgiumAbstractManyclinicalstudiesrequirethefollow-...

展开>> 收起<<

Predicting Survival Outcomes in the Presence of Unlabeled Data.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Predicting Survival Outcomes in the Presence of Unlabeled Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: