algebraic topological construction gives rise to a complex geometry, which hinders their direct applicability to
classical statistical and machine learning methods. The development of techniques to use persistent homology
in statistics and machine learning has been an active area of research in the TDA community. Largely, the
techniques entail either modifying the structure of persistence diagrams to take the form of vectors (referred
to as embedding or vectorization) so that existing statistical and machine learning methodology may be
applied directly (e.g., Bubenik, 2015a; Adams et al.,2017a); or developing new statistical and machine
learning methods to accommodate the algebraic structure of persistence diagrams (e.g., Reininghaus et al.,
2015). The former approach is more widely used in applications, while the latter approach tends to be more
technically challenging and requires taking into account the intricate geometry of persistence diagrams and
the space of all persistence diagrams. Our work studies both approaches experimentally and contributes a
new theoretical result in the latter setting.
In machine learning, a fundamental task is clustering, which groups observations of a similar nature
to determine structure within the data. The k-means clustering algorithm is one of the most well-known
clustering methods. For structured data in linear spaces, the k-means algorithm is widely used, with well-
established theory on stability and convergence. However, for unstructured data in nonlinear spaces, such
as persistence diagrams, it is a nontrivial task to apply the k-means framework. The key challenges are
twofold: (i) how to define the mean of data in a nonlinear space, and (ii) how to compute the distance
between the data points. These challenges are especially significant in the task of obtaining convergence
guarantees for the algorithm. For persistence diagrams specifically, the cost function is not differentiable nor
subdifferentiable in the traditional sense because of their complicated metric geometry.
In this paper, we study the k-means algorithm on the space of persistence diagrams both theoretically and
experimentally. Theoretically, we establish convergence of the k-means algorithm. In particular, we provide a
characterization of the solution to the optimization problem in the Karush–Kuhn–Tucker (KKT) framework
and show that the solution is a partial optimal point, KKT point, as well as a local minimum. Experimentally,
we study various representations of persistent homology in the k-means algorithm, as persistence diagram
embeddings, on persistence diagrams themselves, as well as on their generalizations as persistence measures.
To implement k-means clustering on persistence diagram and persistence measure space, we reinterpret
the algorithm taking into account the appropriate metrics on these spaces as well as constructions of their
respective means. Broadly, our work contributes to the growing body of increasingly important and relevant
non-Euclidean statistics (e.g., Blanchard & Jaffe, 2022) that deals with, for example, spaces of matrices
under nonlinear constraints (Dryden et al.,2009); quotients of Euclidean spaces defined by group actions
(Le & Kume, 2000); and Riemannian manifolds (Miolane et al.,2020).
The remainder of this paper is organized as follows. We end this introductory section with an overview of
existing literature relevant to our study. Section 2provides the background on both persistent homology and
k-means clustering. We give details on the various components of persistent homology needed to implement
the k-means clustering algorithm. In Section 3, we provide main result on the convergence of the k-means
algorithm in persistence diagram space in the KKT framework. Section 4provides results to numerical
experiments comparing the performance of k-means clustering on embedded persistence diagrams versus
persistence diagrams and persistence measures. We close with a discussion on our contributions and ideas
for future research in Section 5.
Related Work A previous study of the k-means clustering in persistence diagram space compares the
performance to other classification algorithms; it also establishes local minimality of the solution to the
optimization problem in the convergence of the algorithm (Marchese et al.,2017). Our convergence study
expands on this previous study to the more general KKT framework, which provides a more detailed con-
vergence analysis, and differs experimentally by studying the performance of k-means clustering specifically
in the context of persistent homology on simulated data as well as a benchmark dataset.
Clustering and TDA more generally has also been recently overviewed, explaining how topological con-
cepts may apply to the problem of clustering (Panagopoulos, 2022). Other work studies topological data
analytic clustering for time series data as well as space-time data (Islambekov & Gel, 2019; Majumdar &
Laha, 2020).
2