time-to-event data with dependent censoring. First devel-
oped in the machine learning community [79, 22] and then
translated into the statistical world [23], gradient boost-
ing is one of the most effective machine learning tools cur-
rently available [58]. Especially, recent implementations
like XGBoost [8], LightGBM [43] and CatBoost [18] have
been proved to work extremely well in many situations
[61, 67], and a boosting model which deals with depen-
dent censoring is highly desirable.
We will tackle the problem with the help of copulas
[85]. Copulas are cumulative distribution functions with
uniform marginals, which can be interpreted as the de-
pendence structure between two or more distributions.
In this paper, we use a Clayton copula when modelling
the dependency between the event time and the censor-
ing time. This approach is not new in the literature, and
started with the seminal work of [96]. Here, a generaliza-
tion of the Kaplan-Meier estimator, called copula-graphic,
was developed to handle dependent censoring. Extensions
of that work include [75], which focus on Archimedean
copulas, and [4], which study it in the fixed-design situ-
ation. Among more recent works, we mention [89], that
consider the left-truncation case, and [11], that relax the
assumption that the parameter defining the copula func-
tion is known. For a full description of the use of copula
to model dependent censoring in time-to-event data prob-
lems, we refer the reader to the book of [20].
The rest of the paper is organized as follow. In Section
1 we describe dependent censoring and introduce the basic
concept of our approach, the Clayton Copula, the Acceler-
ated Failure time Model, and the boosting algorithm. The
novel approach is presented in Section 2 and evaluated via
simulation and on real data in Section 3. Section 4 ends
the paper with some remarks.
sectionMethods
1.1 Time-to-event prediction and
likelihood-based inference
In survival analysis, the term survival time, or event time,
refers to the time progressed from an origin to the occur-
rence of an event. While the target variable is commonly
referred to as ”time”, it can also consist of other units such
as cycles, rounds, or even friction, as will be seen in Sec-
tion 3. One common factor for the different applications of
time-to-event prediction is the presence of censored data.
There are three main ways of censoring: In right censor-
ing, the event time is known to be higher than a certain
value, while in left censoring, the event time is known to
be lower than a certain value. In interval censoring, the
event time is known to be between two values. This paper
will focus on right censoring, as this is the most common
way of censoring. However, the proposed methodology is
easily extended to handle left- and interval censoring by
making a smaller change in the likelihood function.
In addition to the tree ways of censoring, there are
some different characteristics with the censoring. A Type
Icensoring means that the event is censored only if it
happens after a pre-specified time. This is also called ad-
ministrative censoring, and all remaining subjects at the
specified time are right censored. A Type II censoring
means that an experiment stops after a pre-specified num-
ber of events has happened, and the rest of the subjects
are right censored. When the censoring is Independent and
non-informative, every subject has a probability of being
censored which is statistically independent of the event
time. On the contrary, if a censoring is referred to as De-
pendent, the subject is censored by a mechanism related to
the event time. This paper concerns the latter censoring
characteristic and proposes a method for making statisti-
cal learning methods useful for this type of censoring, and
not only in the case of independent and non-informative
censoring, which is often assumed in the literature. Before
going into the details, we review the classical terminology
of time-to-event data, which is further used in this paper.
Consider two random variables: Tis the event time
and Uis the censoring time. The two variables are mu-
tually exclusive, and only one of Tor Uis observed. We
observe Tif the event appears earlier than censoring (T≤
U), or we do not observe Tif censoring happens earlier
than the event (U < T ). An observation iin time-to-
event data consists of (t, δ, x), where tis the event time or
censoring time, depending on which comes first, δis the
censoring indicator, which is 1 if event time is observed
and 0 if the observation is censored, and xis the vector
of covariates. To put it differently, t=min{T, U}and
δ=I(T≤U).
To perform likelihood-based inference on time-to-event
data, one should consider both the case of censored and
complete observations, such that the likelihood becomes
L= Pr(T=t, U > t|x)δPr(T > t, U =t|x)1−δ,(1)
which yields a computationally difficult expression, due to
the two joint probability distribution functions. A nor-
mal assumption made to simplify this expression is the
assumption of independent and non-informative censoring
(as defined in [20]):
•Independent censoring: Event time and censoring
time are independent given the covariates
•Non-informative censoring: The censoring distribu-
tion does not involve any parameters related to the
distribution of the survival times
In real world situations, a censoring is usually non-informative
if it is independent, and in the rest of the paper we as-
sume that independent censoring implies non-informative
censoring. Note that the independent censoring assump-
tion states that Tand Uare conditionally independent
given x. In other words, even if there exists dependency
between Tand U, when the covariates contain all infor-
mation about the dependency, the independent censoring
assumption holds. This specific case is not explored fur-
ther in this paper, as most real-world situations rarely
provide all the necessary information in the covariates.
Under the assumption of independent censoring, the
likelihood function can be rewritten as
L= [Pr(T=t|x)Pr(U > t|x)]δ[Pr(T > t|x)Pr(U=t|x)]1−δ.
= [fT(t|x)SU(t|x)]δ[ST(t|x)fU(t|x)]1−δ
= [fT(t|x)δST(t|x)1−δ][fU(t|x)1−δSU(t|x)δ],
(2)
where ST(t|x) = Pr(T > t|x), fT(t|x) = −dST(t|x)/dt,
SU(t|x) = Pr(U > t|x), and fU(t|x) = −dSU(t|x)/dt.
2