A comparative study to alternatives to the log-rank test Ina Dormuth1 Tiantian Liu2 Jin Xu3 Markus Pauly14 and Marc Ditzhaus5 1Department of Statistics TU Dortmund University Dortmund Germany

2025-04-30 1 0 1.82MB 33 页 10玖币
侵权投诉
A comparative study to alternatives to the log-rank test
Ina Dormuth1, Tiantian Liu2, Jin Xu3, Markus Pauly1,4, and Marc Ditzhaus5
1Department of Statistics, TU Dortmund University, Dortmund, Germany
2Technion – Israel Institute of Technology, Haifa, Israel
3East China Normal University, Shanghai, China
4Research Center Trustworthy Data Science and Security, UA Ruhr, Dortmund
5Department of Mathematics, Otto von Guericke University Magdeburg,
Magdeburg, Germany
October 25, 2022
Abstract
Studies to compare the survival of two or more groups using time-to-event data are of high
importance in medical research. The gold standard is the log-rank test, which is optimal under
proportional hazards. As the latter is no simple regularity assumption, we are interested in
evaluating the power of various statistical tests under different settings including proportional
and non-proportional hazards with a special emphasize on crossing hazards. This challenge
has been going on for many years now and multiple methods have already been investigated
in extensive simulation studies. However, in recent years new omnibus tests and methods
based on the restricted mean survival time appeared that have been strongly recommended in
biometric literature. Thus, to give updated recommendations, we perform a vast simulation
study to compare tests that showed high power in previous studies with these more recent
approaches. We thereby analyze various simulation settings with varying survival and censoring
distributions, unequal censoring between groups, small sample sizes and unbalanced group
sizes. Overall, omnibus tests are more robust in terms of power against deviations from the
proportional hazards assumption.
KEYWORDS: survival analysis, crossing hazards, non-proportional hazards, simulation study,
log-rank
1
arXiv:2210.13258v1 [stat.ME] 24 Oct 2022
Introduction
The distributional comparison of two populations with censored time-to-event data is one of the
most common inferential problems in survival analysis. The log-rank test is used as a standard tool
in many medical or clinical studies. It is known to be optimal under the assumption of proportional
hazards (PH). However, this assumption is often not met in reality due to various forms of derivation
such as crossing hazards, or early/late differences in survival curves. Kristiansen1conducted a
survey revealing that in 70% of studies with crossing survival curves the log-rank test was used
even though this leads to loss in power. Furthermore, Trinquart et al.2revisited 54 phase III
oncology studies from five leading medical journals (New England Journal of Medicine, Lancet,
Lancet Oncology, Journal of Clinical Oncology, Journal of the American Medical Association) and
found that for almost a fourth of the comparisons the proportional hazard assumption was rejected.
Non-proportionality as severe as crossing can appear when the treatment effects change over time.
A common example is seen in immunotherapy which bears an early high risk but a long-term
benefit.3,4 Thus, the question on how to deal with non-proportional hazards is of high interest and
has been investigated by many authors. For example, Royston and Parmar5published a simulation
study comparing nine methods implemented in Stata and showed a preference for modified weighted
log-rank tests6–9. However, they did not include the situation of crossing hazards. Another overview
was given by Lin et al.10, focusing on combined weighted Kaplan-Meier and weighted log-rank tests.
They conclude that as long as we do not have prior knowledge the MaxCombo test showed the most
robust behavior among the tests under consideration.10 Perhaps the most extensive study regarding
crossing hazards was given in Li et al.11 who compared 21 tests designed to handle crossing hazards.
They stated that the two-stage test by Qiu and Sheng12 or the test by Kraus13 are the most suitable
among the studied tests. A general overview of existing methods and recommendations regarding
trial design was created by Ananthakrishnan et al.14 without numerical comparison. None of the
mentioned reviews considered new results on projection type, sample space partition or area under
the survival curve tests.15–18 Recently, some of the new procedures have shown considerable power
advantages in illustrative data analyses.19
We therefore enrich these investigations by comparing the best performers from the above al-
ready existing simulation studies with more recent approaches. Our comprehensive simulation
study covers 20 representative scenarios including four null scenarios, four scenarios with PH, four
scenarios with non-PH (excluding crossing structures) and eight scenarios with a special emphasize
on crossing hazards. Since most procedures exhibit good properties for large samples, our study
focuses on small to moderate sample sizes. In the next section we will review more details on the
tests under study and their implementation. Afterwards, the different simulation and parameter
settings are presented alongside with the results of the simulation study. The utility of the tests
is further evaluated using reconstructed data from a phase III clinical trial with moderate sample
size. The findings are then discussed and conclusions are drawn, particularly focusing on the tests’
power.
2
Methods
Multiple approaches to test the hypothesis of two equal survival functions have been developed.
For ease of presentation, we categorize them in four groups and review the main ideas of the
recommended ones in each group. Details on the methods can be found in the cited literature as
well as the extended methods section in the Supplement.
Log-rank test and its weighted variants
The standard to compare two survival functions S1and S2is the log-rank test (LR).20 It belongs
to the class of weighted log-rank tests21 that use the difference between the expected and observed
number of events to derive a test statistic. These tests differ in the weight functions that they are
employing. For instance, the log-rank test gives the same weight to all event times. Therefore, it
is optimal under proportional hazards. The Peto-Peto test (PP) uses the Kaplan-Meier estimator
b
S(t) of the survival function as weight, which leads to a test that is more sensitive to early differ-
ences.22 Various approaches to compute sample sizes for log-rank tests have been introduced, with
Schoenfeld’s formula being the most popular.23
In reality, due to the lack of prior information about the survival behavior of comparing popu-
lations, any mismatch of weight (or test) selection and difference in the true survival functions will
lead to sub-optimal power performance.11
Two-stage test
The two-stage (TS) method introduced by Qiu and Sheng12 provides a solution to the weight
selection problem in dealing with possible non-PH situations. The procedure gets its name from
the sequential testing approach. More specifically, it conducts the standard log-rank test in the
first stage. If the LR test does not reject the null hypothesis, an asymptotically independent test
for crossing hazards is carried out. It is shown to be efficient with good adaptation and reliable in
power performance under both PH and non-PH situations.12,18 The approach was extended to the
k-sample case employing asymptotically independent tests.24
Omnibus tests
Another remedy to avoid potentially sub-optimal power performance is to use an omnibus test that
does not have any inclination of the alternative hypothesis.
The mdir test proposed by Brendel et al.15 and revisited by Ditzhaus and Friedrich16 uses
a quadratic form-type statistic in multiple weighted LR statistics to cover broader alternatives.
The test has high power for all alternatives corresponding to the chosen weights and combinations
thereof. The test should be used especially when no prior information is available, because with
prior knowledge a weighted test with only one suitable weight would have a higher power. A
notable feature of the mdir test is that its permuted version allows to handle small sample cases
3
with satisfactory type-I error and power performance.15,16 The mdir test was extended to handle the
one-sided testing problem as well as factorial designs25,26. A procedure for sample size calculation
does not exist yet.
The class of maximum weighted log-rank tests bears a different approach to combine multiple
weighted log-rank tests. Here, multiple test statistics with different weights are considered and the
final test statistic is defined as the maximum over all of them. The MaxCombo test (MC) pro-
posed by Lin et al.10 combines four weighted log-rank tests with Flemming-Harrington type weights
targeting difference in survival functions with PH, late difference, middle difference, and early dif-
ference, respectively. An iterative sample size calculation approach was provided by Roychoudhury
et al.27. The test can also be used for one-sided hypotheses.
Gorfine et al.17 introduced K-sample omnibus non-proportional hazards (KONP) tests based
on sample space partition that also tackles right censored data. P-values are obtained employ-
ing a censoring-friendly permutation procedure. The provided tests are based on two different test
statistics, namely the log-likelihood ratio (KONP llr) and the chi-squared test statistic (KONP chi).
Extensive simulation studies17 showed that the choice of test statistic does not influence the per-
formance. Hence, we only consider the KONP chi test in our study.
Tests based on the area under the survival curve
Tests based on restricted mean survival times (RMST) are often advocated in the context of crossing
hazards.2,28–30 The RMST can be interpreted as the mean of event-free survival time up to τ, where
τis a pre-defined time till which the truncated mean is of interest. In practice, τis recommended
to be 90% of the minimum of the largest censored or uncensored event-time in the two groups.31
The RMST-based test enjoys the merit of easy interpretation and is distribution free.29 Moreover,
it can be used to test superiority or non-inferiority.
The test proposed by Liu et al.18 aims to detect crossing survival curves based on the area
between the curves (ABC). It can capture the alternative of two crossing survival functions that
produce the same RMST. The test obtains its p-value by (group-wise) bootstrapping, which allows
different censoring distributions between groups. This test is shown to be more powerful than other
distance-based tests such as the modified Kolmogorov-Smirnov test32 and the generalized Cram´er-
von Mises test33. Since the test statistic quantifies the difference in absolute value, it cannot be
used for superiority or non-inferiority testing.
Simulation study
To evaluate the performance of the presented methods, we employed extensive Monte Carlo simu-
lations for different scenarios and settings. We simulated data for two groups under exponential,
Weibull, Gompertz and log-normal distributions. Thus, we follow well-established recommendations
on the choice of survival distributions for simulation studies.34
4
Scenarios
We considered four null scenarios, each with a different distribution function. For alternatives, we
considered (i) four scenarios with proportional hazards, (ii) four scenarios with non-proportional
and non-crossing hazards, and (iii) eight scenarios with crossing hazards. The concrete survival
and hazard functions can be found in the Supplement, see Tables S2-S6 therein. For each scenario
we vary the group sizes (from 20 to 100), the censoring rates (from 0% to 60%) and the censoring
distributions (uniform, exponential) as listed in Table S1 in the supplements. Thus, we studied
20(scenarios)x5(sample sizes)x4(censoring rates)x2(censoring distributions) = 800 different settings.
We list three exemplary scenarios in Table 1.
For each setting 5,000 replications were performed. Throughout, we set the nominal size to be
0.05. The actual type-I error and power were estimated by the rejection rates. For 10 out of 800
scenarios (all with small sample sizes) the MC test fails to provide a result in less than 0.5% of
the replications. In these cases, the mean rejection rate was thus computed for a slightly smaller
number of observed results. Throughout, we used R 4.0.035 for all simulation.
Implementation details
The LR as well as the PP can be called in R using the function survdiff from the survival 36
package. The concrete execution depends on the choice of rho (rho = 0 for LR and 1 for PP).
The R package TSHRC 37 contains the implementation of the TS test via the function twostage.
The mdir is included in the R package mdir.logrank 38. Later, we refer to the test, mdir-x, where
‘x’ stands for the number of weights considered. For the MC we use the weights proposed by Lin
et al.10 and its implementation in the R package nphsim39. The KONP is implemented in the
R package KONPsurv40. The packages provide tests based on two different test statistics, namely
the log-likelihood ratio and the chi-squared test statistic. Since the authors did not detect any
difference in performance we only consider the chi-squared test statistic (KONP). An RMST-based
test for two group comparisons is given in the R package survRM2 41. The function used here is
rmst2, where we need to define a truncation time tau. The published R code for the ABC test is
provided on Github (https://github.com/LTTGH/RBT4TCSC). For both tests, τwas set to 90%
of the minimum of largest censored or uncensored event-time in two groups.31
5
摘要:

Acomparativestudytoalternativestothelog-ranktestInaDormuth1,TiantianLiu2,JinXu3,MarkusPauly1,4,andMarcDitzhaus51DepartmentofStatistics,TUDortmundUniversity,Dortmund,Germany2Technion{IsraelInstituteofTechnology,Haifa,Israel3EastChinaNormalUniversity,Shanghai,China4ResearchCenterTrustworthyDataScience...

展开>> 收起<<
A comparative study to alternatives to the log-rank test Ina Dormuth1 Tiantian Liu2 Jin Xu3 Markus Pauly14 and Marc Ditzhaus5 1Department of Statistics TU Dortmund University Dortmund Germany.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:1.82MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注