[2].
Traditional survival analysis methods include the Cox Proportional Hazards model (CPH) [3]. CPH is
basically a linear regression model that predicts simultaneously the effect of several risk factors on survival
time. However, these standard survival models encounter some challenges when it comes to real-world
datasets. For instance, they cannot easily capture nonlinear relationships between the covariates. In addition,
in many applications, the presence of high-dimensional data is quite common, e.g., gene expression data;
however, these traditional methods are not able to efficiently deal with such high-dimensional data. As a
result, machine learning-based techniques have become increasingly popular in the survival analysis context
over recent years [4]. Applying machine learning methods directly to censored data is challenging since the
value of a measurement or observation is only partially known. Several studies have successfully modified
machine learning algorithms to make use of censored information in survival analysis, e.g., decision trees
[5], artificial neural networks (ANN) [6], and support vector machines (SVM) [7] to name a few. Popular
ensemble-based frameworks include bagging survival trees [8] and random survival forests [9]. Also, more
advanced learning tasks such as active learning [10] and transfer learning [11] have been extended toward
survival analysis.
Long-term follow-up of patients is often expensive, both time- and effort-wise and financially. As a result, the
number of subjects that are included in a study and followed in time is often limited. However, many more
subjects may exist (e.g., through retrospective data collection) that meet the inclusion/exclusion criteria of
the follow-up study. If the study aims to predict outcomes based on variables collected at baseline, then we
hypothesize that these extra (unlabeled) data points might actually boost the predictive performance of the
resulting model, if used wisely. This corresponds to a semi-supervised learning set-up [12], which deals with
scenarios where only a small part of the instances in the training data have an outcome label attached, but
the rest is unlabeled. To our knowledge, such a semi-supervised learning set-up has never been investigated
in the context of survival analysis, and with this article, we aim to fill this gap.
Including unlabeled instances in a survival analysis task leads to three distinct subsets of data, that differ
in the amount of supervised information they contain: a set of (1) fully observed, (2) partially observed
(censored), and (3) unobserved data points. Our goal is to look at these three subsets of data altogether.
In particular, we address two research questions: (1) can the predictive performance over an independent
test set be increased by including unlabeled instances (i.e., does the semi-supervised learning setting carry
over to the survival analysis context)?, and (2) what is the best approach to integrate the 3 subsets of data
in the analysis? To address this second question, we propose and compare three different approaches. The
first approach is to treat the unlabeled instances as censored with the censoring time equal to zero and
apply a machine learning-based survival analysis technique. For the second approach, we apply a standard
semi-supervised learning approach. In particular, we use the widely used self-training wrapper technique
2