1 Introduction
Cluster analysis is ubiquitous in medical research (see McLachlan [1992] for a comprehensive overview) to
perform data classification, data exploration, and hypothesis generation [Xu and Wunsch,2008]. Clustering
works by grouping homogeneous observations into disjoint subgroups or clusters. When multivariate data are
clustered, it is common to seek to identify which variables distinguish two or more of the estimated clusters,
in order to interpret the clustering structure and characterise observation groups and how they differ from
each other.
Despite the widespread use of clustering, Hennig et al. [2015] state there is no commonly accepted and
formal definition of what clusters are. In fact, the definition of what a cluster should be varies depending
on the context and the analysis specifics. Here we will use the definition from Everitt and Hothorn [2006],
which includes only two criteria: i) homogeneity of observations within a cluster and ii) separability of
observations between two different clusters. These two criteria are general enough to encompass the majority
of the working definitions of clusters. Both can be quantified using various approaches such as distances or
similarity metrics, shape of distribution [Steinbach et al.,2004], multimodality [Kalogeratos and Likas,2012,
Siffer et al.,2018], or distributional assumptions[Liu et al.,2008,Kimes et al.,2017].
While clustering is a multivariate methodology that takes into account all variables, only a set of variables
can be expected to differentiate two particular clusters (i.e. separate their observations, according to the
second criterion of our definition above). This question of which variable separate clusters of individuals
is particularly relevant for high-dimensional data such as omics data [Ntranos et al.,2019,Vandenbon and
Diez,2020]. The current practice to identify such variables is often based on post-clustering hypothesis
testing. It leads to a two-step pipeline (a first step of clustering and a second step of inference) that is
actually testing data-driven hypotheses in a process sometimes referred to as “double dipping”[Kriegeskorte
et al.,2009]. This approach does not efficiently control the type I error rate when testing for differences
between clusters. In fact, it is always possible to cluster the data using a clustering method, even if there
is no real process separating groups of observations. In this case, the clustering method artificially enforces
the differences between the observations by dividing them into different clusters. The significant differences
between clusters identified during the inference process could just be an artifact of the previous clustering
step. To illustrate this phenomenon, we consider data generated from a univariate Gaussian distribution with
mean 0and variance 1(Figure 1 panel A). Two clusters can be built, e.g., using hierarchical clustering with
Ward’s method and Euclidean distance (Figure 1 panel B). These two estimated clusters are not separated
clusters, since all observations come from the same Gaussian distribution. One way to infer their separation
is to test for a mean shift between them, for example using the classical t-test. Since there is no real process
separating these two clusters, the resulting p-values should be uniformly distributed. However, when we look
at the p-values of the t-test for 2000 simulations of the data, the resulting p-values are too small, leading to
false positives (Figure 1 panel B). This simple example illustrates how it is possible to infer a separation of
two clusters, even if this separation is not explained by a real process in the data. Classical inference requires
a priori hypothesis. In this toy example, the hypothesis, i.e the lack of separation of the two clusters, is based
on clusters derived from the data. Moreover, here we force differences between groups of observations by
clustering them, so the clustering results do not represent the true structure of the data. Thus, the discoveries
are only the results of clustering algorithms and not those of a true biological signal due to this double use of
the data and the bad structures forced by clustering. For example, in the context of RNA-seq data analysis,
accounting for this clustering step during the inference step is one of the open problems in the eleven grand
challenges in single-cell data science mentioned by Lähnemann et al. [2020].
Our goal is to propose new methods for post-clustering inference that take into account the clustering
step and the potential artificial differences it may introduce. For any clustering method that can be applied
to all features of the data to build clusters, we are interested in testing the null hypothesis that a particular
feature does not truly separate two of the estimated clusters. In particular, this null hypothesis allows that
the feature: i) is not involved in the separation of the two subgroups and is not affected by the clustering
step, and ii) is only involved in this separation because the clustering method applied to the data forced
differences.
Recently, some methodological work has been done on post-clustering inference. Since the data is used
twice, many of them use selective inference [Tibshirani et al.,2016,Lee et al.,2016] to account for the
clustering step. Selective inference aims to control the selective type I error. This is defined as the probability
2