Introduction
The distributional comparison of two populations with censored time-to-event data is one of the
most common inferential problems in survival analysis. The log-rank test is used as a standard tool
in many medical or clinical studies. It is known to be optimal under the assumption of proportional
hazards (PH). However, this assumption is often not met in reality due to various forms of derivation
such as crossing hazards, or early/late differences in survival curves. Kristiansen1conducted a
survey revealing that in 70% of studies with crossing survival curves the log-rank test was used
even though this leads to loss in power. Furthermore, Trinquart et al.2revisited 54 phase III
oncology studies from five leading medical journals (New England Journal of Medicine, Lancet,
Lancet Oncology, Journal of Clinical Oncology, Journal of the American Medical Association) and
found that for almost a fourth of the comparisons the proportional hazard assumption was rejected.
Non-proportionality as severe as crossing can appear when the treatment effects change over time.
A common example is seen in immunotherapy which bears an early high risk but a long-term
benefit.3,4 Thus, the question on how to deal with non-proportional hazards is of high interest and
has been investigated by many authors. For example, Royston and Parmar5published a simulation
study comparing nine methods implemented in Stata and showed a preference for modified weighted
log-rank tests6–9. However, they did not include the situation of crossing hazards. Another overview
was given by Lin et al.10, focusing on combined weighted Kaplan-Meier and weighted log-rank tests.
They conclude that as long as we do not have prior knowledge the MaxCombo test showed the most
robust behavior among the tests under consideration.10 Perhaps the most extensive study regarding
crossing hazards was given in Li et al.11 who compared 21 tests designed to handle crossing hazards.
They stated that the two-stage test by Qiu and Sheng12 or the test by Kraus13 are the most suitable
among the studied tests. A general overview of existing methods and recommendations regarding
trial design was created by Ananthakrishnan et al.14 without numerical comparison. None of the
mentioned reviews considered new results on projection type, sample space partition or area under
the survival curve tests.15–18 Recently, some of the new procedures have shown considerable power
advantages in illustrative data analyses.19
We therefore enrich these investigations by comparing the best performers from the above al-
ready existing simulation studies with more recent approaches. Our comprehensive simulation
study covers 20 representative scenarios including four null scenarios, four scenarios with PH, four
scenarios with non-PH (excluding crossing structures) and eight scenarios with a special emphasize
on crossing hazards. Since most procedures exhibit good properties for large samples, our study
focuses on small to moderate sample sizes. In the next section we will review more details on the
tests under study and their implementation. Afterwards, the different simulation and parameter
settings are presented alongside with the results of the simulation study. The utility of the tests
is further evaluated using reconstructed data from a phase III clinical trial with moderate sample
size. The findings are then discussed and conclusions are drawn, particularly focusing on the tests’
power.
2