k-Means Clustering for Persistent Homology Yueqi Cao1 Prudence Leung1 and Anthea Monod1 1 Department of Mathematics Imperial College London UK

2025-05-03 0 0 997.46KB 21 页 10玖币
侵权投诉
k-Means Clustering for Persistent Homology
Yueqi Cao1,, Prudence Leung1,, and Anthea Monod1,
1 Department of Mathematics, Imperial College London, UK
Corresponding e-mail: y.cao21@imperial.ac.uk
Abstract
Persistent homology is a methodology central to topological data analysis that extracts and summarizes
the topological features within a dataset as a persistence diagram; it has recently gained much popularity
from its myriad successful applications to many domains. However, its algebraic construction induces a
metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence
of the k-means clustering algorithm on persistence diagram space and establish theoretical properties of
the solution to the optimization problem in the Karush–Kuhn–Tucker framework. Additionally, we perform
numerical experiments on various representations of persistent homology, including embeddings of persistence
diagrams as well as diagrams themselves and their generalizations as persistence measures; we find that k-
means clustering performance directly on persistence diagrams and measures outperform their vectorized
representations.
Keywords: Alexandrov geometry; Karush–Kuhn–Tucker optimization; k-means clustering; persistence
diagrams; persistent homology.
1 Introduction
Topological data analysis (TDA) is a recently-emerged field of data science that lies at the intersection of
pure mathematics and statistics. The core approach of the field is to adapt classical algebraic topology to the
computational setting for data analysis. The roots of this approach first appeared in the early 1990s by Frosini
(1992) for image analysis and pattern recognition in computer vision (Verri et al.,1993); soon afterwards,
these earlier ideas by Frosini & Landi (2001) were reinterpreted, developed, and extended concurrently by
Edelsbrunner et al. (2002) and Zomorodian & Carlsson (2005). These developments laid the foundations for
a comprehensive set of computational topological tools which have then enjoyed great success in applications
to a wide range of data settings in recent decades. Some examples include sensor network coverage problems
(de Silva & Ghrist, 2007); biological and biomedical applications, including understanding the structure of
protein data (Gameiro et al.,2014; Kovacev-Nikolic et al.,2016; Xia et al.,2016) and the 3D structure of
DNA (Emmett et al.,2015), and predicting disease outcome in cancer (Crawford et al.,2020); robotics and
the problem of goal-directed path planning (Bhattacharya et al.,2015), as well as other general planning
and motion problems, such as goal-directed path planning (Pokorny et al.,2016) and determining the cost of
bipedal walking (Vasudevan et al.,2013); materials science to understand the structure of amorphous solids
(Hiraoka et al.,2016); characterizing chemical reactions and chemical structure (Murayama et al.,2023);
financial market prediction (Ismail et al.,2020); quantifying the structure of music (Bergomi & Barat`e,
2020); among many others (e.g., Otter et al.,2017).
Persistent homology is the most popular methodology from TDA due to its intuitive construction and
interpretability in applications; it extracts the topological features of data and computes a robust statistical
representation of the data under study. A particular advantage of persistent homology is that it is applicable
to very general data structures, such as point clouds, functions, and networks. Notice that among these
examples, the data may be endowed with only a notion of relatedness (point clouds) or may not even be a
metric space at all (networks).
A major limitation of persistent homology, however, is that its output takes a somewhat unusual form of
a collection of intervals; more specifically, it is a multiset of half-open intervals together with the set of zero-
length intervals with infinite multiplicity, known as a persistence diagram. Statistical methods for interval-
valued data (e.g., Billard & Diday, 2000) cannot be used to analyze persistence diagrams because their
arXiv:2210.10003v4 [stat.AP] 25 Nov 2023
algebraic topological construction gives rise to a complex geometry, which hinders their direct applicability to
classical statistical and machine learning methods. The development of techniques to use persistent homology
in statistics and machine learning has been an active area of research in the TDA community. Largely, the
techniques entail either modifying the structure of persistence diagrams to take the form of vectors (referred
to as embedding or vectorization) so that existing statistical and machine learning methodology may be
applied directly (e.g., Bubenik, 2015a; Adams et al.,2017a); or developing new statistical and machine
learning methods to accommodate the algebraic structure of persistence diagrams (e.g., Reininghaus et al.,
2015). The former approach is more widely used in applications, while the latter approach tends to be more
technically challenging and requires taking into account the intricate geometry of persistence diagrams and
the space of all persistence diagrams. Our work studies both approaches experimentally and contributes a
new theoretical result in the latter setting.
In machine learning, a fundamental task is clustering, which groups observations of a similar nature
to determine structure within the data. The k-means clustering algorithm is one of the most well-known
clustering methods. For structured data in linear spaces, the k-means algorithm is widely used, with well-
established theory on stability and convergence. However, for unstructured data in nonlinear spaces, such
as persistence diagrams, it is a nontrivial task to apply the k-means framework. The key challenges are
twofold: (i) how to define the mean of data in a nonlinear space, and (ii) how to compute the distance
between the data points. These challenges are especially significant in the task of obtaining convergence
guarantees for the algorithm. For persistence diagrams specifically, the cost function is not differentiable nor
subdifferentiable in the traditional sense because of their complicated metric geometry.
In this paper, we study the k-means algorithm on the space of persistence diagrams both theoretically and
experimentally. Theoretically, we establish convergence of the k-means algorithm. In particular, we provide a
characterization of the solution to the optimization problem in the Karush–Kuhn–Tucker (KKT) framework
and show that the solution is a partial optimal point, KKT point, as well as a local minimum. Experimentally,
we study various representations of persistent homology in the k-means algorithm, as persistence diagram
embeddings, on persistence diagrams themselves, as well as on their generalizations as persistence measures.
To implement k-means clustering on persistence diagram and persistence measure space, we reinterpret
the algorithm taking into account the appropriate metrics on these spaces as well as constructions of their
respective means. Broadly, our work contributes to the growing body of increasingly important and relevant
non-Euclidean statistics (e.g., Blanchard & Jaffe, 2022) that deals with, for example, spaces of matrices
under nonlinear constraints (Dryden et al.,2009); quotients of Euclidean spaces defined by group actions
(Le & Kume, 2000); and Riemannian manifolds (Miolane et al.,2020).
The remainder of this paper is organized as follows. We end this introductory section with an overview of
existing literature relevant to our study. Section 2provides the background on both persistent homology and
k-means clustering. We give details on the various components of persistent homology needed to implement
the k-means clustering algorithm. In Section 3, we provide main result on the convergence of the k-means
algorithm in persistence diagram space in the KKT framework. Section 4provides results to numerical
experiments comparing the performance of k-means clustering on embedded persistence diagrams versus
persistence diagrams and persistence measures. We close with a discussion on our contributions and ideas
for future research in Section 5.
Related Work A previous study of the k-means clustering in persistence diagram space compares the
performance to other classification algorithms; it also establishes local minimality of the solution to the
optimization problem in the convergence of the algorithm (Marchese et al.,2017). Our convergence study
expands on this previous study to the more general KKT framework, which provides a more detailed con-
vergence analysis, and differs experimentally by studying the performance of k-means clustering specifically
in the context of persistent homology on simulated data as well as a benchmark dataset.
Clustering and TDA more generally has also been recently overviewed, explaining how topological con-
cepts may apply to the problem of clustering (Panagopoulos, 2022). Other work studies topological data
analytic clustering for time series data as well as space-time data (Islambekov & Gel, 2019; Majumdar &
Laha, 2020).
2
Figure 1: Example of a VR filtration and VR complexes at various increasing scales on a point cloud.
2 Background and Preliminaries
In this section, we provide details on persistent homology as the context in which our study takes place; we
also outline the k-means clustering algorithm. We then provide more technical details on the specifics of
persistent homology that are needed to implement the k-means clustering algorithm, as well as of Karush–
Kuhn–Tucker optimization.
2.1 Persistent Homology
Algebraic topology is a field in pure mathematics that uses abstract algebra to study general topological
spaces. A primary goal of the field is to characterize topological spaces and classify them according to
properties that do not change when the space is subjected to smooth deformations, such as stretching,
compressing, rotating, and reflecting; these properties are referred to as invariants. Invariants and their
associated theory can be seen as relevant to complex and general real-world data settings in the sense that
they provide a way to quantify and rigorously describe the “shape” and “size” of data, which can then be
considered as summaries for complex, non-Euclidean data structures and settings are becoming increasingly
important and challenging to study. A particular advantage of TDA, given its algebraic topological roots, is
its flexibility and applicability both when there is a clear formalism to the data (e.g., rigorous metric space
structure) and when there is a less obvious structure (e.g., networks and text).
Persistent homology adapts the theory of homology from classical algebraic topology to data settings in a
dynamic manner. Homology algebraically identifies and counts invariants as a way to distinguish topological
spaces from one another; various different homology theories exist depending on the underlying topological
space of study. When considering the homology of a dataset as a finite topological space X, it is common to
use simplicial homology over a field. A simplicial complex is a reduction of a general topological space to a
discrete version, as a union of simpler components, glued together in a combinatorial fashion. An advantage
of considering simplicial complexes and studying simplicial homology is that there are efficient algorithms
to compute homology. More formally, these basic components are k-simplices, where each k-simplex is the
convex hull of k+1 affinely independent points x0, x1, . . . , xk, denoted by [x0, x1, . . . , xk]. A set constructed
combinatorially of k-simplices is a simplicial complex, K.
Persistent homology is a dynamic version of homology, which is built on studying homology over a
filtration. A filtration is a nested sequence of topological spaces, X0X1⊆ · · · ⊆ Xn=X; there exist
different kinds of filtrations, which are defined by their nesting rule. In this paper, we study Vietoris–Rips
(VR) filtrations for finite metric spaces (X, dX), which are commonly used in applications and real data
settings in TDA. Let 12 · · · nbe an increasing sequence of parameters where each iR0. The
Vietoris–Rips complex of Xat scale iVR(X, i) is constructed by adding a node for each xjXand a
k-simplex for each set {xj1, xj2, . . . , xjk+1 }with diameter d(xi, xj) smaller than ifor all 0 i, j k. We
then obtain a Vietoris–Rips filtration
VR(X, 1)VR(X, 2) · · · VR(X, n),
which is a sequence of inclusions of sets.
The sequence of inclusions given by the filtration induces maps in homology for any fixed dimension .
Let H(X, i) be the homology group of VR(X, i) with coefficients in a field. This gives rise to the following
sequence of vector spaces:
H(X, 1)H(X, 2)→ · · · → H(X, n).
3
The collection of vector spaces H(X, i), together with their maps (vector space homomorphisms), H(X, i)
H(X, j), is called a persistence module.
The algebraic interpretation of a persistence module and how it corresponds to topological (more specif-
ically, homological) information of data is formally given as follows. For each finite dimensional H(X, i),
the persistence module can be decomposed into rank one summands corresponding to birth and death times
of homology classes (Chazal et al.,2016): Let αH(X, i) be a nontrivial homology class. αis born at
iif it is not in the image of H(X, i1)H(X, i) and similarly, it dies entering jif the image of
αvia H(X, i)H(X, j1) is not in the image H(X, i1)H(X, j1), but the image of αvia
H(X, i)H(X, j) is in the image H(X, i1)H(X, j). The collection of these birth–death inter-
vals [i, j) represents the persistent homology of the Vietoris–Rips filtration of X. Taking each interval as
an ordered pair of birth–death coordinates and plotting each in a plane R2yields a persistence diagram; see
Figure 3a. This algebraic structure and its construction generate the complex geometry of the space of all
persistence diagrams, which must be taken into account when performing statistical analyses on persistence
diagrams representing the topological information of data.
More intuitively, homology studies the holes of a topological space as a kind of invariant, since the
occurrence of holes and how many there are do not change under continuous deformations of the space.
Dimension 0 homology corresponds to connected components; dimension 1 homology corresponds to loops;
and dimension 2 homology corresponds to voids or bubbles. These ideas can be generalized to higher
dimensions and the holes are topological features of the space (or dataset). Persistent homology tracks
the appearance of these features with an interval, where the left endpoint of the interval signifies the first
appearance of a feature within the filtration and the right endpoint signifies its disappearance. The length
of the interval can therefore be thought of as the lifetime of the topological feature. The collection of the
intervals is known as a barcode and summarizes the topological information of the dataset (according to
dimension). A barcode can be equivalently represented as a persistence diagram, which is a scatterplot of
the birth and death points of each interval. See Figure 2for a visual illustration.
Definition 1. Apersistence diagram Dis a locally finite multiset of points in the half-plane Ω = {(x, y)
R2|x < y}together with points on the diagonal Ω = {(x, x)R2}counted with infinite multiplicity.
Points in Ω are called off-diagonal points. The persistence diagram with no off-diagonal points is called the
empty persistence diagram, denoted by D.
Intuitively, points that are far away from the diagonal are the long intervals corresponding to topological
features that have a long lifetime in the filtration. These points are said to have high persistence and
are generally considered to be topological signal, while points closer to the diagonal are considered to be
topological noise.
Persistence Measures A more general representation of persistence diagrams that aims to facilitate
statistical analyses and computations is the persistence measure. The construction of persistence measures
arises from the equivalent representation of persistence diagrams as measures on Ω of the form PxDnxδx,
where xranges over the off-diagonal points in a persistence diagram D,nxis the multiplicity of x, and δxis
the Dirac measure at x(Divol & Chazal, 2019). Considering then all Radon measures supported on Ω gives
the following definition.
Definition 2 (Divol & Lacombe (2021b)).Let µbe a Radon measure supported on Ω. For 1 p < , the
p-total persistence in this measure setting is defined as
Persp(µ) = Z
xxp
qdµ(x),
where xis the projection of xto the diagonal Ω. Any Radon measure with finite p-total persistence is
called a persistence measure.
Persistence measures resemble heat map versions of persistence diagrams on the upper half-plane where
individual persistence points are smoothed, so the intuition on their birth–death lifetimes as described
previously and illustrated in Figure 2are lost. However, they gain the advantage of a linear structure where
statistical quantities are well-defined and much more straightforward to compute (Divol & Lacombe, 2021a).
4
Figure 2: Example illustrating the procedure of persistent homology (Ghrist, 2008). The panel on the left
shows an annulus, which is the underlying space of interest; the topology of an annulus is characterized by
1 connected component (corresponding to H0), 1 loop (corresponding to H1), and 0 voids (corresponding to
H2). 17 points are sampled from the annulus and a simplicial complex (in order to be able to use simplicial
homology) is constructed from these samples in the following way: The sampled points are taken to be centers
of balls with radius , which grows from 0 to . At each value of , whenever two balls intersect, they are
connected by an edge; when three balls mutually intersect, they are connected by a triangle (face); this
construction continues for higher dimensions. The simplicial complex resulting from this overlapping of balls
as the radius grows is shown to the right of each snapshot at different values of . As grows, connected
components merge, cycles are formed, and fill up. These events are tracked by a bar for each dimension of
homology, shown on the right; the lengths of the bars corresponds to the lifetime of each topological feature
as grows. Notice that as reaches infinity, there remains a single connected component tracked by the
longest bar for H0. For H1, there are bars of varying length, including several longer bars, which suggests
irregular sampling of the annulus; the longest bar corresponds to the single loop of the annulus. Notice also
that there is one short H2bar, which is likely to be a spurious topological feature corresponding to noisy
sampling.
5
摘要:

k-MeansClusteringforPersistentHomologyYueqiCao1,†,PrudenceLeung1,†,andAntheaMonod1,†1DepartmentofMathematics,ImperialCollegeLondon,UK†Correspondinge-mail:y.cao21@imperial.ac.ukAbstractPersistenthomologyisamethodologycentraltotopologicaldataanalysisthatextractsandsummarizesthetopologicalfeatureswithi...

展开>> 收起<<
k-Means Clustering for Persistent Homology Yueqi Cao1 Prudence Leung1 and Anthea Monod1 1 Department of Mathematics Imperial College London UK.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:997.46KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注