A comparative study to alternatives to the log-rank test Ina Dormuth1 Tiantian Liu2 Jin Xu3 Markus Pauly14 and Marc Ditzhaus5 1Department of Statistics TU Dortmund University Dortmund Germany

2025-04-30 1 0 1.82MB 33 页 10玖币

侵权投诉

A comparative study to alternatives to the log-rank test

Ina Dormuth1, Tiantian Liu2, Jin Xu3, Markus Pauly1,4, and Marc Ditzhaus5

1Department of Statistics, TU Dortmund University, Dortmund, Germany

2Technion – Israel Institute of Technology, Haifa, Israel

3East China Normal University, Shanghai, China

4Research Center Trustworthy Data Science and Security, UA Ruhr, Dortmund

5Department of Mathematics, Otto von Guericke University Magdeburg,

Magdeburg, Germany

October 25, 2022

Abstract

Studies to compare the survival of two or more groups using time-to-event data are of high

importance in medical research. The gold standard is the log-rank test, which is optimal under

proportional hazards. As the latter is no simple regularity assumption, we are interested in

evaluating the power of various statistical tests under diﬀerent settings including proportional

and non-proportional hazards with a special emphasize on crossing hazards. This challenge

has been going on for many years now and multiple methods have already been investigated

in extensive simulation studies. However, in recent years new omnibus tests and methods

based on the restricted mean survival time appeared that have been strongly recommended in

biometric literature. Thus, to give updated recommendations, we perform a vast simulation

study to compare tests that showed high power in previous studies with these more recent

approaches. We thereby analyze various simulation settings with varying survival and censoring

distributions, unequal censoring between groups, small sample sizes and unbalanced group

sizes. Overall, omnibus tests are more robust in terms of power against deviations from the

proportional hazards assumption.

KEYWORDS: survival analysis, crossing hazards, non-proportional hazards, simulation study,

log-rank

arXiv:2210.13258v1 [stat.ME] 24 Oct 2022

Introduction

The distributional comparison of two populations with censored time-to-event data is one of the

most common inferential problems in survival analysis. The log-rank test is used as a standard tool

in many medical or clinical studies. It is known to be optimal under the assumption of proportional

hazards (PH). However, this assumption is often not met in reality due to various forms of derivation

such as crossing hazards, or early/late diﬀerences in survival curves. Kristiansen1conducted a

survey revealing that in 70% of studies with crossing survival curves the log-rank test was used

even though this leads to loss in power. Furthermore, Trinquart et al.2revisited 54 phase III

oncology studies from ﬁve leading medical journals (New England Journal of Medicine, Lancet,

Lancet Oncology, Journal of Clinical Oncology, Journal of the American Medical Association) and

found that for almost a fourth of the comparisons the proportional hazard assumption was rejected.

Non-proportionality as severe as crossing can appear when the treatment eﬀects change over time.

A common example is seen in immunotherapy which bears an early high risk but a long-term

beneﬁt.3,4 Thus, the question on how to deal with non-proportional hazards is of high interest and

has been investigated by many authors. For example, Royston and Parmar5published a simulation

study comparing nine methods implemented in Stata and showed a preference for modiﬁed weighted

log-rank tests6–9. However, they did not include the situation of crossing hazards. Another overview

was given by Lin et al.10, focusing on combined weighted Kaplan-Meier and weighted log-rank tests.

They conclude that as long as we do not have prior knowledge the MaxCombo test showed the most

robust behavior among the tests under consideration.10 Perhaps the most extensive study regarding

crossing hazards was given in Li et al.11 who compared 21 tests designed to handle crossing hazards.

They stated that the two-stage test by Qiu and Sheng12 or the test by Kraus13 are the most suitable

among the studied tests. A general overview of existing methods and recommendations regarding

trial design was created by Ananthakrishnan et al.14 without numerical comparison. None of the

mentioned reviews considered new results on projection type, sample space partition or area under

the survival curve tests.15–18 Recently, some of the new procedures have shown considerable power

advantages in illustrative data analyses.19

We therefore enrich these investigations by comparing the best performers from the above al-

ready existing simulation studies with more recent approaches. Our comprehensive simulation

study covers 20 representative scenarios including four null scenarios, four scenarios with PH, four

scenarios with non-PH (excluding crossing structures) and eight scenarios with a special emphasize

on crossing hazards. Since most procedures exhibit good properties for large samples, our study

focuses on small to moderate sample sizes. In the next section we will review more details on the

tests under study and their implementation. Afterwards, the diﬀerent simulation and parameter

settings are presented alongside with the results of the simulation study. The utility of the tests

is further evaluated using reconstructed data from a phase III clinical trial with moderate sample

size. The ﬁndings are then discussed and conclusions are drawn, particularly focusing on the tests’

power.

Methods

Multiple approaches to test the hypothesis of two equal survival functions have been developed.

For ease of presentation, we categorize them in four groups and review the main ideas of the

recommended ones in each group. Details on the methods can be found in the cited literature as

well as the extended methods section in the Supplement.

Log-rank test and its weighted variants

The standard to compare two survival functions S1and S2is the log-rank test (LR).20 It belongs

to the class of weighted log-rank tests21 that use the diﬀerence between the expected and observed

number of events to derive a test statistic. These tests diﬀer in the weight functions that they are

employing. For instance, the log-rank test gives the same weight to all event times. Therefore, it

is optimal under proportional hazards. The Peto-Peto test (PP) uses the Kaplan-Meier estimator

S(t) of the survival function as weight, which leads to a test that is more sensitive to early diﬀer-

ences.22 Various approaches to compute sample sizes for log-rank tests have been introduced, with

Schoenfeld’s formula being the most popular.23

In reality, due to the lack of prior information about the survival behavior of comparing popu-

lations, any mismatch of weight (or test) selection and diﬀerence in the true survival functions will

lead to sub-optimal power performance.11

Two-stage test

The two-stage (TS) method introduced by Qiu and Sheng12 provides a solution to the weight

selection problem in dealing with possible non-PH situations. The procedure gets its name from

the sequential testing approach. More speciﬁcally, it conducts the standard log-rank test in the

ﬁrst stage. If the LR test does not reject the null hypothesis, an asymptotically independent test

for crossing hazards is carried out. It is shown to be eﬃcient with good adaptation and reliable in

power performance under both PH and non-PH situations.12,18 The approach was extended to the

k-sample case employing asymptotically independent tests.24

Omnibus tests

Another remedy to avoid potentially sub-optimal power performance is to use an omnibus test that

does not have any inclination of the alternative hypothesis.

The mdir test proposed by Brendel et al.15 and revisited by Ditzhaus and Friedrich16 uses

a quadratic form-type statistic in multiple weighted LR statistics to cover broader alternatives.

The test has high power for all alternatives corresponding to the chosen weights and combinations

thereof. The test should be used especially when no prior information is available, because with

prior knowledge a weighted test with only one suitable weight would have a higher power. A

notable feature of the mdir test is that its permuted version allows to handle small sample cases

with satisfactory type-I error and power performance.15,16 The mdir test was extended to handle the

one-sided testing problem as well as factorial designs25,26. A procedure for sample size calculation

does not exist yet.

The class of maximum weighted log-rank tests bears a diﬀerent approach to combine multiple

weighted log-rank tests. Here, multiple test statistics with diﬀerent weights are considered and the

ﬁnal test statistic is deﬁned as the maximum over all of them. The MaxCombo test (MC) pro-

posed by Lin et al.10 combines four weighted log-rank tests with Flemming-Harrington type weights

targeting diﬀerence in survival functions with PH, late diﬀerence, middle diﬀerence, and early dif-

ference, respectively. An iterative sample size calculation approach was provided by Roychoudhury

et al.27. The test can also be used for one-sided hypotheses.

Gorﬁne et al.17 introduced K-sample omnibus non-proportional hazards (KONP) tests based

on sample space partition that also tackles right censored data. P-values are obtained employ-

ing a censoring-friendly permutation procedure. The provided tests are based on two diﬀerent test

statistics, namely the log-likelihood ratio (KONP llr) and the chi-squared test statistic (KONP chi).

Extensive simulation studies17 showed that the choice of test statistic does not inﬂuence the per-

formance. Hence, we only consider the KONP chi test in our study.

Tests based on the area under the survival curve

Tests based on restricted mean survival times (RMST) are often advocated in the context of crossing

hazards.2,28–30 The RMST can be interpreted as the mean of event-free survival time up to τ, where

τis a pre-deﬁned time till which the truncated mean is of interest. In practice, τis recommended

to be 90% of the minimum of the largest censored or uncensored event-time in the two groups.31

The RMST-based test enjoys the merit of easy interpretation and is distribution free.29 Moreover,

it can be used to test superiority or non-inferiority.

The test proposed by Liu et al.18 aims to detect crossing survival curves based on the area

between the curves (ABC). It can capture the alternative of two crossing survival functions that

produce the same RMST. The test obtains its p-value by (group-wise) bootstrapping, which allows

diﬀerent censoring distributions between groups. This test is shown to be more powerful than other

distance-based tests such as the modiﬁed Kolmogorov-Smirnov test32 and the generalized Cram´er-

von Mises test33. Since the test statistic quantiﬁes the diﬀerence in absolute value, it cannot be

used for superiority or non-inferiority testing.

Simulation study

To evaluate the performance of the presented methods, we employed extensive Monte Carlo simu-

lations for diﬀerent scenarios and settings. We simulated data for two groups under exponential,

Weibull, Gompertz and log-normal distributions. Thus, we follow well-established recommendations

on the choice of survival distributions for simulation studies.34

Scenarios

We considered four null scenarios, each with a diﬀerent distribution function. For alternatives, we

considered (i) four scenarios with proportional hazards, (ii) four scenarios with non-proportional

and non-crossing hazards, and (iii) eight scenarios with crossing hazards. The concrete survival

and hazard functions can be found in the Supplement, see Tables S2-S6 therein. For each scenario

we vary the group sizes (from 20 to 100), the censoring rates (from 0% to 60%) and the censoring

distributions (uniform, exponential) as listed in Table S1 in the supplements. Thus, we studied

20(scenarios)x5(sample sizes)x4(censoring rates)x2(censoring distributions) = 800 diﬀerent settings.

We list three exemplary scenarios in Table 1.

For each setting 5,000 replications were performed. Throughout, we set the nominal size to be

0.05. The actual type-I error and power were estimated by the rejection rates. For 10 out of 800

scenarios (all with small sample sizes) the MC test fails to provide a result in less than 0.5% of

the replications. In these cases, the mean rejection rate was thus computed for a slightly smaller

number of observed results. Throughout, we used R 4.0.035 for all simulation.

Implementation details

The LR as well as the PP can be called in R using the function survdiff from the survival 36

package. The concrete execution depends on the choice of rho (rho = 0 for LR and 1 for PP).

The R package TSHRC 37 contains the implementation of the TS test via the function twostage.

The mdir is included in the R package mdir.logrank 38. Later, we refer to the test, mdir-x, where

‘x’ stands for the number of weights considered. For the MC we use the weights proposed by Lin

et al.10 and its implementation in the R package nphsim39. The KONP is implemented in the

R package KONPsurv40. The packages provide tests based on two diﬀerent test statistics, namely

the log-likelihood ratio and the chi-squared test statistic. Since the authors did not detect any

diﬀerence in performance we only consider the chi-squared test statistic (KONP). An RMST-based

test for two group comparisons is given in the R package survRM2 41. The function used here is

rmst2, where we need to deﬁne a truncation time tau. The published R code for the ABC test is

provided on Github (https://github.com/LTTGH/RBT4TCSC). For both tests, τwas set to 90%

of the minimum of largest censored or uncensored event-time in two groups.31

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Acomparativestudytoalternativestothelog-ranktestInaDormuth1,TiantianLiu2,JinXu3,MarkusPauly1,4,andMarcDitzhaus51DepartmentofStatistics,TUDortmundUniversity,Dortmund,Germany2Technion{IsraelInstituteofTechnology,Haifa,Israel3EastChinaNormalUniversity,Shanghai,China4ResearchCenterTrustworthyDataScience...

展开>> 收起<<

A comparative study to alternatives to the log-rank test Ina Dormuth1 Tiantian Liu2 Jin Xu3 Markus Pauly14 and Marc Ditzhaus5 1Department of Statistics TU Dortmund University Dortmund Germany.pdf

共33页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A comparative study to alternatives to the log-rank test Ina Dormuth1 Tiantian Liu2 Jin Xu3 Markus Pauly14 and Marc Ditzhaus5 1Department of Statistics TU Dortmund University Dortmund Germany

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: