Post-clustering difference testing valid inference and practical considerations Benjamin Hivert123 Denis Agniel4 Rodolphe Thiébaut1235 and Boris P Hejblum123

2025-05-02 0 0 3.44MB 25 页 10玖币
侵权投诉
Post-clustering difference testing: valid inference and practical
considerations
Benjamin Hivert,1,2,3, Denis Agniel4, Rodolphe Thiébaut1,2,3,5, and Boris P Hejblum1,2,3
1Univ. Bordeaux, Inserm Bordeaux Population Health Research Center, SISTM team,
UMR 1219, Bordeaux F33076, France
2INRIA Bordeaux Sud Ouest, SISTM team Talence F-33400, France
3Vaccine Research Institute, VRI, Hôpital Henri Mondor, Créteil F-94000, France
4Rand Corporation, Santa Monica, CA 90401, USA
5CHU Pellegrin, Groupe Hospitalier Pellegrin, Bordeaux F-33076, France
October 25, 2022
Abstract
Clustering is part of unsupervised analysis methods that consist in grouping samples into homogeneous
and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis
testing is often used to infer the variables that significantly separate the estimated clusters from each
other. However, data-driven hypotheses are considered for the inference process, since the hypotheses
are derived from the clustering results. This double use of the data leads traditional hypothesis test to
fail to control the Type I error rate particularly because of uncertainty in the clustering process and the
potential artificial differences it could create. We propose three novel statistical hypothesis tests which
account for the clustering process. Our tests efficiently control the Type I error rate by identifying only
variables that contain a true signal separating groups of observations.
Key words: Clustering, hypothesis testing, double-dipping, circular analysis, selective inference,
multimodality test, Dip Test
Corresponding author: benjamin.hivert@u-bordeaux.fr
1
arXiv:2210.13172v1 [stat.ME] 24 Oct 2022
1 Introduction
Cluster analysis is ubiquitous in medical research (see McLachlan [1992] for a comprehensive overview) to
perform data classification, data exploration, and hypothesis generation [Xu and Wunsch,2008]. Clustering
works by grouping homogeneous observations into disjoint subgroups or clusters. When multivariate data are
clustered, it is common to seek to identify which variables distinguish two or more of the estimated clusters,
in order to interpret the clustering structure and characterise observation groups and how they differ from
each other.
Despite the widespread use of clustering, Hennig et al. [2015] state there is no commonly accepted and
formal definition of what clusters are. In fact, the definition of what a cluster should be varies depending
on the context and the analysis specifics. Here we will use the definition from Everitt and Hothorn [2006],
which includes only two criteria: i) homogeneity of observations within a cluster and ii) separability of
observations between two different clusters. These two criteria are general enough to encompass the majority
of the working definitions of clusters. Both can be quantified using various approaches such as distances or
similarity metrics, shape of distribution [Steinbach et al.,2004], multimodality [Kalogeratos and Likas,2012,
Siffer et al.,2018], or distributional assumptions[Liu et al.,2008,Kimes et al.,2017].
While clustering is a multivariate methodology that takes into account all variables, only a set of variables
can be expected to differentiate two particular clusters (i.e. separate their observations, according to the
second criterion of our definition above). This question of which variable separate clusters of individuals
is particularly relevant for high-dimensional data such as omics data [Ntranos et al.,2019,Vandenbon and
Diez,2020]. The current practice to identify such variables is often based on post-clustering hypothesis
testing. It leads to a two-step pipeline (a first step of clustering and a second step of inference) that is
actually testing data-driven hypotheses in a process sometimes referred to as “double dipping”[Kriegeskorte
et al.,2009]. This approach does not efficiently control the type I error rate when testing for differences
between clusters. In fact, it is always possible to cluster the data using a clustering method, even if there
is no real process separating groups of observations. In this case, the clustering method artificially enforces
the differences between the observations by dividing them into different clusters. The significant differences
between clusters identified during the inference process could just be an artifact of the previous clustering
step. To illustrate this phenomenon, we consider data generated from a univariate Gaussian distribution with
mean 0and variance 1(Figure 1 panel A). Two clusters can be built, e.g., using hierarchical clustering with
Ward’s method and Euclidean distance (Figure 1 panel B). These two estimated clusters are not separated
clusters, since all observations come from the same Gaussian distribution. One way to infer their separation
is to test for a mean shift between them, for example using the classical t-test. Since there is no real process
separating these two clusters, the resulting p-values should be uniformly distributed. However, when we look
at the p-values of the t-test for 2000 simulations of the data, the resulting p-values are too small, leading to
false positives (Figure 1 panel B). This simple example illustrates how it is possible to infer a separation of
two clusters, even if this separation is not explained by a real process in the data. Classical inference requires
a priori hypothesis. In this toy example, the hypothesis, i.e the lack of separation of the two clusters, is based
on clusters derived from the data. Moreover, here we force differences between groups of observations by
clustering them, so the clustering results do not represent the true structure of the data. Thus, the discoveries
are only the results of clustering algorithms and not those of a true biological signal due to this double use of
the data and the bad structures forced by clustering. For example, in the context of RNA-seq data analysis,
accounting for this clustering step during the inference step is one of the open problems in the eleven grand
challenges in single-cell data science mentioned by Lähnemann et al. [2020].
Our goal is to propose new methods for post-clustering inference that take into account the clustering
step and the potential artificial differences it may introduce. For any clustering method that can be applied
to all features of the data to build clusters, we are interested in testing the null hypothesis that a particular
feature does not truly separate two of the estimated clusters. In particular, this null hypothesis allows that
the feature: i) is not involved in the separation of the two subgroups and is not affected by the clustering
step, and ii) is only involved in this separation because the clustering method applied to the data forced
differences.
Recently, some methodological work has been done on post-clustering inference. Since the data is used
twice, many of them use selective inference [Tibshirani et al.,2016,Lee et al.,2016] to account for the
clustering step. Selective inference aims to control the selective type I error. This is defined as the probability
2
0.0
0.1
0.2
0.3
0.4
−4 −2 0 2 4
X1
Density
A
0.0
0.2
0.4
0.6
0.8
−4 −2 0 2 4
X1
Density
Cluster
1
2
B
0.00
0.25
0.50
0.75
1.00
0 50 100
−log(p−values)
ECDF
0.00
0.25
0.50
0.75
1.00
Gao et al. t−Test Uniform
Test
p−values
Gao et al. t−Test Uniform
C
Figure 1: Artificial differences created by clustering. panel A Data generated according to 200 realisations
of a Gaussian distribution with mean 0and variance 1.panel B Hierarchical clustering with Ward method
and Euclidean distance is applied to build two clusters. panel C t-test p-values and p-values given by the
test proposed by Gao et al. [2022] for separating the two estimated clusters. The uniform distribution is also
shown for comparison
3
under the null of rejecting the null hypothesis, given that the model and the null hypothesis have been selected
thanks to the data. When data splitting is not possible, Fithian et al. [2014] has proposed to condition on this
selection event during statistical hypothesis testing. In applying this approach, we use two different types of
data: the data, to construct the model and the hypothesis, and the data given the fact that it has been used,
i.e. data not yet observed, to perform the test. This leads to statistical hypothesis tests that efficiently control
the selective type I error. Selective inference was first proposed for linear regression, change point detection
[Jewell et al.,2019] and more recently for tree regression [Neufeld et al.,2021]. Clustering is also a framework
in which selective inference has been applied recently. For post-clustering inference applied on RNA-seq data,
Zhang et al. [2019] have developed a truncated-normal statistic that use selective inference and leads to valid
p-values under their null of no differential expression. However, in addition to selective inference, they use
data splitting, which is only possible if the number of observations is large enough. They also use a supervised
approach to predict the partition formed on half of the data on the remaining half. Instead of conditioning
on the clustering event in their statistical hypothesis test, they condition on the fact that in the remaining
half of the data, the labels of the observations are predicted thanks to a supervised approach. More recently,
Gao et al. [2022] have developed a multivariate selective test to investigate whether two estimated clusters
are truly separated or whether the observations they contain come from a single cluster. By using selective
inference, they account for the clustering step. Their approach is suitable for cluster validation because their
null hypothesis is the equality of two cluster centers. This method also leads to valid p-values under the null
hypothesis (Figure 1 panel c). However, this method is not suitable for our purpose, since in this particular
context the goal is to study the separation of two clusters at the feature level, i.e., in a univariate setting.
In this paper, we introduce three new methods for post-clustering inference. First, we adapt the method
proposed by Gao et al. [2022] for univariate hypotheses to investigate whether individual features contain
information about group (clustering) structure. In doing so, we use a data-driven and fixed clustering of
the data to ensure interpretations. To deal with the multiple clusters case, we also present an extension of
this first test based on an aggregation of its p-values. Second, we propose another approach using a test
of multimodality that account for the clustering step by investigating the presence of a continuum in the
distribution of the variable. The paper proceeds as follows. In the Methods section, we describe the methods
we proposed for post-clustering inference. These approaches are then evaluated and compared in the Results
section using extensive numerical simulations and a real ecological dataset. Some final comments can be
found in the Discussion section.
2 Methods
In the following, let Xbe a n×prandom variable of nobservations of pfeatures, with gth column Xg.
On Xwe apply a clustering method c() to create c(X), a partition of the nobservations into Kdisjoint
clusters C1, . . . , CK. We are interested in the ability of a given variable Xgto separate two clusters Ckand
Clestimated using all the information contained in Xwith the clustering method c().
2.1 Selective test
To develop our statistical hypothesis testing, we first specify a generative model to the observations along
Xg. We assume that each of the nobservations of Xgcomes from independent Gaussian distributions with
unknown mean µgi and known variance σ2
g. Then, for all i∈ {1, . . . , n},Xgi N (µgi, σ2
g). Because of the
independence between each Xgi, the multivariate distribution of Xgis a multivariate Gaussian distribution
Nnµg, σ2
gInwith mean µg= (µg1, . . . , µgn)tand covariance matrix Σ=σ2
gIn. Let xgbe the realisation
of Xgobserved in X. Now, for a cluster Ck, let
µCk
g=1
|Ck|X
iCk
µgi and XCk
g=1
|Ck|X
iCk
Xgi
be the true mean and empirical mean, respectively, of the variable Xgin cluster Ck. Testing for a mean shift
between the two clusters is a straightfoward way to evaluate the separation of two clusters along Xg. Thus,
we define the two following hypotheses:
H0:µCk
g=µCl
gvs H1:µCk
g6=µCl
g(1)
4
By introducing a contrast vector ηRndefined by: ηi=1iCk
|Ck|1iCl
|Cl|i= 1, . . . , n following Jewell et al.
[2019], Gao et al. [2022], we can rewrite (1) above as:
H0:µt
gη= 0 vs H1:µt
gη6= 0 (2)
H0in (2) is actually generated by a function of the data c(X), which clearly sets us in the context
of selective inference. Conditioning on this clustering event within statistical inference procedures is thus
required. In particular, we derive an adaptation of the p-value proposed by Jewell et al. [2019] (originally
intended for change point detection) for our purposes of clustering:
pCk,Cl
gPH0|Xt
gη|>|xt
gη||Ck, Clc(X)(3)
Here we condition on the estimation of Ckand Clby c(X), which leads to the definition of H0, and the
resulting p-values (3) account for the clustering as well as the uncertainty associated with the estimation of
these two clusters. pCk,Cl
gquantifies the probability that the mean difference between Ckand Clis as large as
the observed difference under H0given the observed clustering structure. Its calculation relies on all possible
realisations of Xgresulting in the same estimation of Ckand Clwhen we apply c() to X. Yet, enumerating
all such data sets Xis hard. To get more tractable p-values, we follow Jewell et al. [2019] and Gao et al.
[2022] in constraining the randomness in the random variable Xgand we define our p-value as follows:
˜pCk,Cl
gPH0|Xt
gη|>|xt
gη| |Ck, Clc(X),πη
Xg=πη
xg(4)
where π
η=Inηηt
kηk2
2
restricts the random variable Xgto a space defined by the scalar πη
xgwithout
losing control of type I error [Gao et al.,2022]. The p-value (4) can be rewritten as (see Supplementary
Materials for the proof):
˜pCk,Cl
g=PH0|φg|>|xt
gη||φgSg(5)
where Sg={φg:Ck, Clc(x(φg))}is the set of perturbations of the gth variable from Xwhere both Ck
and Clare conserved by c(), and φg=Xt
gηH0
N 0, σ2
gkηk2
2.X(φg)thus represents a perturbed version of
the data X, where only the gth variable is perturbed:
xgηηtxg
kηk2
2
+ηφg
kηk2
2
This perturbation has a clear interpretation: if |φg|>|xt
gη|data from the two clusters are split further apart
along Xgthan is observed in the data; whereas if |φg|<|xt
gη|instead, they are brought closer together
along Xg(and if φg=xt
gηthe data are actually not perturbed because in this case X(φg) = X). Note that
(5) can be rewritten as PH0|φg|>|xt
gη|, φgSg/PH0(φgSg). So if Ckand Clcan only be preserved
when the observation are perturbed further apart, then (5) will be large since PH0|φg|>|xt
gη|, φgSg'
PH0(φgSg). In conclusion, this selective test can be interpreted in terms of separability of the two clusters
considered (even though it is based on a difference in means) as it boils down to quantifying the possibility
to bring closer together the observations from the two clusters while preserving their separation.
In order to explicitly describe the set Sgwhile retaining as much generality as possible about c(), we
follow Gao et al. [2022] and use Monte-Carlo simulations to approximate ˜pCk,Cl
g. This strategy relies on (5)
being rewritten as:
˜pCk,Cl
g=
E1|φg|>|xt
gη|, φgSg
E[1{φgSg}](6)
Namely, we sample φ1
g, . . . , φN
g
i.i.d
N 0, σ2
gkηk2
2for some large value N, and replace the expectations in
(6) with the sums over all samples. This Monte-Carlo procedure avoids the need to formally describe Sg.
In order to enhance numerical efficiency, Gao et al. [2022] use an importance sampling approach originally
proposed by Yang et al. [2016] to improve the likelihood of preserving the clustering in the perturbed data.
5
摘要:

Post-clusteringdierencetesting:validinferenceandpracticalconsiderationsBenjaminHivert*,1,2,3,DenisAgniel4,RodolpheThiébaut1,2,3,5,andBorisPHejblum1,2,31Univ.Bordeaux,InsermBordeauxPopulationHealthResearchCenter,SISTMteam,UMR1219,BordeauxF33076,France2INRIABordeauxSudOuest,SISTMteamTalenceF-33400,Fr...

展开>> 收起<<
Post-clustering difference testing valid inference and practical considerations Benjamin Hivert123 Denis Agniel4 Rodolphe Thiébaut1235 and Boris P Hejblum123.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:3.44MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注