Post-clustering diﬀerence testing valid inference and practical considerations Benjamin Hivert123 Denis Agniel4 Rodolphe Thiébaut1235 and Boris P Hejblum123

2025-05-02 1 0 3.44MB 25 页 10玖币

侵权投诉

Post-clustering diﬀerence testing: valid inference and practical

considerations

Benjamin Hivert∗,1,2,3, Denis Agniel4, Rodolphe Thiébaut1,2,3,5, and Boris P Hejblum1,2,3

1Univ. Bordeaux, Inserm Bordeaux Population Health Research Center, SISTM team,

UMR 1219, Bordeaux F33076, France

2INRIA Bordeaux Sud Ouest, SISTM team Talence F-33400, France

3Vaccine Research Institute, VRI, Hôpital Henri Mondor, Créteil F-94000, France

4Rand Corporation, Santa Monica, CA 90401, USA

5CHU Pellegrin, Groupe Hospitalier Pellegrin, Bordeaux F-33076, France

October 25, 2022

Abstract

Clustering is part of unsupervised analysis methods that consist in grouping samples into homogeneous

and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis

testing is often used to infer the variables that signiﬁcantly separate the estimated clusters from each

other. However, data-driven hypotheses are considered for the inference process, since the hypotheses

are derived from the clustering results. This double use of the data leads traditional hypothesis test to

fail to control the Type I error rate particularly because of uncertainty in the clustering process and the

potential artiﬁcial diﬀerences it could create. We propose three novel statistical hypothesis tests which

account for the clustering process. Our tests eﬃciently control the Type I error rate by identifying only

variables that contain a true signal separating groups of observations.

Key words: Clustering, hypothesis testing, double-dipping, circular analysis, selective inference,

multimodality test, Dip Test

∗Corresponding author: benjamin.hivert@u-bordeaux.fr

arXiv:2210.13172v1 [stat.ME] 24 Oct 2022

1 Introduction

Cluster analysis is ubiquitous in medical research (see McLachlan [1992] for a comprehensive overview) to

perform data classiﬁcation, data exploration, and hypothesis generation [Xu and Wunsch,2008]. Clustering

works by grouping homogeneous observations into disjoint subgroups or clusters. When multivariate data are

clustered, it is common to seek to identify which variables distinguish two or more of the estimated clusters,

in order to interpret the clustering structure and characterise observation groups and how they diﬀer from

each other.

Despite the widespread use of clustering, Hennig et al. [2015] state there is no commonly accepted and

formal deﬁnition of what clusters are. In fact, the deﬁnition of what a cluster should be varies depending

on the context and the analysis speciﬁcs. Here we will use the deﬁnition from Everitt and Hothorn [2006],

which includes only two criteria: i) homogeneity of observations within a cluster and ii) separability of

observations between two diﬀerent clusters. These two criteria are general enough to encompass the majority

of the working deﬁnitions of clusters. Both can be quantiﬁed using various approaches such as distances or

similarity metrics, shape of distribution [Steinbach et al.,2004], multimodality [Kalogeratos and Likas,2012,

Siﬀer et al.,2018], or distributional assumptions[Liu et al.,2008,Kimes et al.,2017].

While clustering is a multivariate methodology that takes into account all variables, only a set of variables

can be expected to diﬀerentiate two particular clusters (i.e. separate their observations, according to the

second criterion of our deﬁnition above). This question of which variable separate clusters of individuals

is particularly relevant for high-dimensional data such as omics data [Ntranos et al.,2019,Vandenbon and

Diez,2020]. The current practice to identify such variables is often based on post-clustering hypothesis

testing. It leads to a two-step pipeline (a ﬁrst step of clustering and a second step of inference) that is

actually testing data-driven hypotheses in a process sometimes referred to as “double dipping”[Kriegeskorte

et al.,2009]. This approach does not eﬃciently control the type I error rate when testing for diﬀerences

between clusters. In fact, it is always possible to cluster the data using a clustering method, even if there

is no real process separating groups of observations. In this case, the clustering method artiﬁcially enforces

the diﬀerences between the observations by dividing them into diﬀerent clusters. The signiﬁcant diﬀerences

between clusters identiﬁed during the inference process could just be an artifact of the previous clustering

step. To illustrate this phenomenon, we consider data generated from a univariate Gaussian distribution with

mean 0and variance 1(Figure 1 panel A). Two clusters can be built, e.g., using hierarchical clustering with

Ward’s method and Euclidean distance (Figure 1 panel B). These two estimated clusters are not separated

clusters, since all observations come from the same Gaussian distribution. One way to infer their separation

is to test for a mean shift between them, for example using the classical t-test. Since there is no real process

separating these two clusters, the resulting p-values should be uniformly distributed. However, when we look

at the p-values of the t-test for 2000 simulations of the data, the resulting p-values are too small, leading to

false positives (Figure 1 panel B). This simple example illustrates how it is possible to infer a separation of

two clusters, even if this separation is not explained by a real process in the data. Classical inference requires

a priori hypothesis. In this toy example, the hypothesis, i.e the lack of separation of the two clusters, is based

on clusters derived from the data. Moreover, here we force diﬀerences between groups of observations by

clustering them, so the clustering results do not represent the true structure of the data. Thus, the discoveries

are only the results of clustering algorithms and not those of a true biological signal due to this double use of

the data and the bad structures forced by clustering. For example, in the context of RNA-seq data analysis,

accounting for this clustering step during the inference step is one of the open problems in the eleven grand

challenges in single-cell data science mentioned by Lähnemann et al. [2020].

Our goal is to propose new methods for post-clustering inference that take into account the clustering

step and the potential artiﬁcial diﬀerences it may introduce. For any clustering method that can be applied

to all features of the data to build clusters, we are interested in testing the null hypothesis that a particular

feature does not truly separate two of the estimated clusters. In particular, this null hypothesis allows that

the feature: i) is not involved in the separation of the two subgroups and is not aﬀected by the clustering

step, and ii) is only involved in this separation because the clustering method applied to the data forced

diﬀerences.

Recently, some methodological work has been done on post-clustering inference. Since the data is used

twice, many of them use selective inference [Tibshirani et al.,2016,Lee et al.,2016] to account for the

clustering step. Selective inference aims to control the selective type I error. This is deﬁned as the probability

0.0

0.1

0.2

0.3

0.4

−4 −2 0 2 4

Density

0.0

0.2

0.4

0.6

0.8

−4 −2 0 2 4

Density

Cluster

0.00

0.25

0.50

0.75

1.00

0 50 100

−log(p−values)

ECDF

0.00

0.25

0.50

0.75

1.00

Gao et al. t−Test Uniform

Test

p−values

Gao et al. t−Test Uniform

Figure 1: Artiﬁcial diﬀerences created by clustering. panel A Data generated according to 200 realisations

of a Gaussian distribution with mean 0and variance 1.panel B Hierarchical clustering with Ward method

and Euclidean distance is applied to build two clusters. panel C t-test p-values and p-values given by the

test proposed by Gao et al. [2022] for separating the two estimated clusters. The uniform distribution is also

shown for comparison

under the null of rejecting the null hypothesis, given that the model and the null hypothesis have been selected

thanks to the data. When data splitting is not possible, Fithian et al. [2014] has proposed to condition on this

selection event during statistical hypothesis testing. In applying this approach, we use two diﬀerent types of

data: the data, to construct the model and the hypothesis, and the data given the fact that it has been used,

i.e. data not yet observed, to perform the test. This leads to statistical hypothesis tests that eﬃciently control

the selective type I error. Selective inference was ﬁrst proposed for linear regression, change point detection

[Jewell et al.,2019] and more recently for tree regression [Neufeld et al.,2021]. Clustering is also a framework

in which selective inference has been applied recently. For post-clustering inference applied on RNA-seq data,

Zhang et al. [2019] have developed a truncated-normal statistic that use selective inference and leads to valid

p-values under their null of no diﬀerential expression. However, in addition to selective inference, they use

data splitting, which is only possible if the number of observations is large enough. They also use a supervised

approach to predict the partition formed on half of the data on the remaining half. Instead of conditioning

on the clustering event in their statistical hypothesis test, they condition on the fact that in the remaining

half of the data, the labels of the observations are predicted thanks to a supervised approach. More recently,

Gao et al. [2022] have developed a multivariate selective test to investigate whether two estimated clusters

are truly separated or whether the observations they contain come from a single cluster. By using selective

inference, they account for the clustering step. Their approach is suitable for cluster validation because their

null hypothesis is the equality of two cluster centers. This method also leads to valid p-values under the null

hypothesis (Figure 1 panel c). However, this method is not suitable for our purpose, since in this particular

context the goal is to study the separation of two clusters at the feature level, i.e., in a univariate setting.

In this paper, we introduce three new methods for post-clustering inference. First, we adapt the method

proposed by Gao et al. [2022] for univariate hypotheses to investigate whether individual features contain

information about group (clustering) structure. In doing so, we use a data-driven and ﬁxed clustering of

the data to ensure interpretations. To deal with the multiple clusters case, we also present an extension of

this ﬁrst test based on an aggregation of its p-values. Second, we propose another approach using a test

of multimodality that account for the clustering step by investigating the presence of a continuum in the

distribution of the variable. The paper proceeds as follows. In the Methods section, we describe the methods

we proposed for post-clustering inference. These approaches are then evaluated and compared in the Results

section using extensive numerical simulations and a real ecological dataset. Some ﬁnal comments can be

found in the Discussion section.

2 Methods

In the following, let Xbe a n×prandom variable of nobservations of pfeatures, with gth column Xg.

On Xwe apply a clustering method c() to create c(X), a partition of the nobservations into Kdisjoint

clusters C1, . . . , CK. We are interested in the ability of a given variable Xgto separate two clusters Ckand

Clestimated using all the information contained in Xwith the clustering method c().

2.1 Selective test

To develop our statistical hypothesis testing, we ﬁrst specify a generative model to the observations along

Xg. We assume that each of the nobservations of Xgcomes from independent Gaussian distributions with

unknown mean µgi and known variance σ2

g. Then, for all i∈ {1, . . . , n},Xgi ∼ N (µgi, σ2

g). Because of the

independence between each Xgi, the multivariate distribution of Xgis a multivariate Gaussian distribution

Nnµg, σ2

gInwith mean µg= (µg1, . . . , µgn)tand covariance matrix Σ=σ2

gIn. Let xgbe the realisation

of Xgobserved in X. Now, for a cluster Ck, let

µCk

g=1

|Ck|X

i∈Ck

µgi and XCk

g=1

|Ck|X

i∈Ck

Xgi

be the true mean and empirical mean, respectively, of the variable Xgin cluster Ck. Testing for a mean shift

between the two clusters is a straightfoward way to evaluate the separation of two clusters along Xg. Thus,

we deﬁne the two following hypotheses:

H0:µCk

g=µCl

gvs H1:µCk

g6=µCl

g(1)

By introducing a contrast vector η∈Rndeﬁned by: ηi=1i∈Ck

|Ck|−1i∈Cl

|Cl|∀i= 1, . . . , n following Jewell et al.

[2019], Gao et al. [2022], we can rewrite (1) above as:

H0:µt

gη= 0 vs H1:µt

gη6= 0 (2)

H0in (2) is actually generated by a function of the data c(X), which clearly sets us in the context

of selective inference. Conditioning on this clustering event within statistical inference procedures is thus

required. In particular, we derive an adaptation of the p-value proposed by Jewell et al. [2019] (originally

intended for change point detection) for our purposes of clustering:

pCk,Cl

g≡PH0|Xt

gη|>|xt

gη||Ck, Cl∈c(X)(3)

Here we condition on the estimation of Ckand Clby c(X), which leads to the deﬁnition of H0, and the

resulting p-values (3) account for the clustering as well as the uncertainty associated with the estimation of

these two clusters. pCk,Cl

gquantiﬁes the probability that the mean diﬀerence between Ckand Clis as large as

the observed diﬀerence under H0given the observed clustering structure. Its calculation relies on all possible

realisations of Xgresulting in the same estimation of Ckand Clwhen we apply c() to X. Yet, enumerating

all such data sets Xis hard. To get more tractable p-values, we follow Jewell et al. [2019] and Gao et al.

[2022] in constraining the randomness in the random variable Xgand we deﬁne our p-value as follows:

˜pCk,Cl

g≡PH0|Xt

gη|>|xt

gη| |Ck, Cl∈c(X),πη

⊥Xg=πη

⊥xg(4)

where π⊥

η=In−ηηt

kηk2

restricts the random variable Xgto a space deﬁned by the scalar πη

⊥xgwithout

losing control of type I error [Gao et al.,2022]. The p-value (4) can be rewritten as (see Supplementary

Materials for the proof):

˜pCk,Cl

g=PH0|φg|>|xt

gη||φg∈Sg(5)

where Sg={φg:Ck, Cl∈c(x(φg))}is the set of perturbations of the gth variable from Xwhere both Ck

and Clare conserved by c(), and φg=Xt

gηH0

∼ N 0, σ2

gkηk2

2.X(φg)thus represents a perturbed version of

the data X, where only the gth variable is perturbed:

xg−ηηtxg

kηk2

+ηφg

kηk2

This perturbation has a clear interpretation: if |φg|>|xt

gη|data from the two clusters are split further apart

along Xgthan is observed in the data; whereas if |φg|<|xt

gη|instead, they are brought closer together

along Xg(and if φg=xt

gηthe data are actually not perturbed because in this case X(φg) = X). Note that

(5) can be rewritten as PH0|φg|>|xt

gη|, φg∈Sg/PH0(φg∈Sg). So if Ckand Clcan only be preserved

when the observation are perturbed further apart, then (5) will be large since PH0|φg|>|xt

gη|, φg∈Sg'

PH0(φg∈Sg). In conclusion, this selective test can be interpreted in terms of separability of the two clusters

considered (even though it is based on a diﬀerence in means) as it boils down to quantifying the possibility

to bring closer together the observations from the two clusters while preserving their separation.

In order to explicitly describe the set Sgwhile retaining as much generality as possible about c(), we

follow Gao et al. [2022] and use Monte-Carlo simulations to approximate ˜pCk,Cl

g. This strategy relies on (5)

being rewritten as:

˜pCk,Cl

E1|φg|>|xt

gη|, φg∈Sg

E[1{φg∈Sg}](6)

Namely, we sample φ1

g, . . . , φN

i.i.d

∼ N 0, σ2

gkηk2

2for some large value N, and replace the expectations in

(6) with the sums over all samples. This Monte-Carlo procedure avoids the need to formally describe Sg.

In order to enhance numerical eﬃciency, Gao et al. [2022] use an importance sampling approach originally

proposed by Yang et al. [2016] to improve the likelihood of preserving the clustering in the perturbed data.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Post-clusteringdierencetesting:validinferenceandpracticalconsiderationsBenjaminHivert*,1,2,3,DenisAgniel4,RodolpheThiébaut1,2,3,5,andBorisPHejblum1,2,31Univ.Bordeaux,InsermBordeauxPopulationHealthResearchCenter,SISTMteam,UMR1219,BordeauxF33076,France2INRIABordeauxSudOuest,SISTMteamTalenceF-33400,Fr...

展开>> 收起<<

Post-clustering diﬀerence testing valid inference and practical considerations Benjamin Hivert123 Denis Agniel4 Rodolphe Thiébaut1235 and Boris P Hejblum123.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Post-clustering diﬀerence testing valid inference and practical considerations Benjamin Hivert123 Denis Agniel4 Rodolphe Thiébaut1235 and Boris P Hejblum123

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: