k-Means Clustering for Persistent Homology Yueqi Cao1 Prudence Leung1 and Anthea Monod1 1 Department of Mathematics Imperial College London UK

2025-05-03 0 0 997.46KB 21 页 10玖币

侵权投诉

k-Means Clustering for Persistent Homology

Yueqi Cao1,†, Prudence Leung1,†, and Anthea Monod1,†

1 Department of Mathematics, Imperial College London, UK

†Corresponding e-mail: y.cao21@imperial.ac.uk

Abstract

Persistent homology is a methodology central to topological data analysis that extracts and summarizes

the topological features within a dataset as a persistence diagram; it has recently gained much popularity

from its myriad successful applications to many domains. However, its algebraic construction induces a

metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence

of the k-means clustering algorithm on persistence diagram space and establish theoretical properties of

the solution to the optimization problem in the Karush–Kuhn–Tucker framework. Additionally, we perform

numerical experiments on various representations of persistent homology, including embeddings of persistence

diagrams as well as diagrams themselves and their generalizations as persistence measures; we ﬁnd that k-

means clustering performance directly on persistence diagrams and measures outperform their vectorized

representations.

Keywords: Alexandrov geometry; Karush–Kuhn–Tucker optimization; k-means clustering; persistence

diagrams; persistent homology.

1 Introduction

Topological data analysis (TDA) is a recently-emerged ﬁeld of data science that lies at the intersection of

pure mathematics and statistics. The core approach of the ﬁeld is to adapt classical algebraic topology to the

computational setting for data analysis. The roots of this approach ﬁrst appeared in the early 1990s by Frosini

(1992) for image analysis and pattern recognition in computer vision (Verri et al.,1993); soon afterwards,

these earlier ideas by Frosini & Landi (2001) were reinterpreted, developed, and extended concurrently by

Edelsbrunner et al. (2002) and Zomorodian & Carlsson (2005). These developments laid the foundations for

a comprehensive set of computational topological tools which have then enjoyed great success in applications

to a wide range of data settings in recent decades. Some examples include sensor network coverage problems

(de Silva & Ghrist, 2007); biological and biomedical applications, including understanding the structure of

protein data (Gameiro et al.,2014; Kovacev-Nikolic et al.,2016; Xia et al.,2016) and the 3D structure of

DNA (Emmett et al.,2015), and predicting disease outcome in cancer (Crawford et al.,2020); robotics and

the problem of goal-directed path planning (Bhattacharya et al.,2015), as well as other general planning

and motion problems, such as goal-directed path planning (Pokorny et al.,2016) and determining the cost of

bipedal walking (Vasudevan et al.,2013); materials science to understand the structure of amorphous solids

(Hiraoka et al.,2016); characterizing chemical reactions and chemical structure (Murayama et al.,2023);

ﬁnancial market prediction (Ismail et al.,2020); quantifying the structure of music (Bergomi & Barat`e,

2020); among many others (e.g., Otter et al.,2017).

Persistent homology is the most popular methodology from TDA due to its intuitive construction and

interpretability in applications; it extracts the topological features of data and computes a robust statistical

representation of the data under study. A particular advantage of persistent homology is that it is applicable

to very general data structures, such as point clouds, functions, and networks. Notice that among these

examples, the data may be endowed with only a notion of relatedness (point clouds) or may not even be a

metric space at all (networks).

A major limitation of persistent homology, however, is that its output takes a somewhat unusual form of

a collection of intervals; more speciﬁcally, it is a multiset of half-open intervals together with the set of zero-

length intervals with inﬁnite multiplicity, known as a persistence diagram. Statistical methods for interval-

valued data (e.g., Billard & Diday, 2000) cannot be used to analyze persistence diagrams because their

arXiv:2210.10003v4 [stat.AP] 25 Nov 2023

algebraic topological construction gives rise to a complex geometry, which hinders their direct applicability to

classical statistical and machine learning methods. The development of techniques to use persistent homology

in statistics and machine learning has been an active area of research in the TDA community. Largely, the

techniques entail either modifying the structure of persistence diagrams to take the form of vectors (referred

to as embedding or vectorization) so that existing statistical and machine learning methodology may be

applied directly (e.g., Bubenik, 2015a; Adams et al.,2017a); or developing new statistical and machine

learning methods to accommodate the algebraic structure of persistence diagrams (e.g., Reininghaus et al.,

2015). The former approach is more widely used in applications, while the latter approach tends to be more

technically challenging and requires taking into account the intricate geometry of persistence diagrams and

the space of all persistence diagrams. Our work studies both approaches experimentally and contributes a

new theoretical result in the latter setting.

In machine learning, a fundamental task is clustering, which groups observations of a similar nature

to determine structure within the data. The k-means clustering algorithm is one of the most well-known

clustering methods. For structured data in linear spaces, the k-means algorithm is widely used, with well-

established theory on stability and convergence. However, for unstructured data in nonlinear spaces, such

as persistence diagrams, it is a nontrivial task to apply the k-means framework. The key challenges are

twofold: (i) how to deﬁne the mean of data in a nonlinear space, and (ii) how to compute the distance

between the data points. These challenges are especially signiﬁcant in the task of obtaining convergence

guarantees for the algorithm. For persistence diagrams speciﬁcally, the cost function is not diﬀerentiable nor

subdiﬀerentiable in the traditional sense because of their complicated metric geometry.

In this paper, we study the k-means algorithm on the space of persistence diagrams both theoretically and

experimentally. Theoretically, we establish convergence of the k-means algorithm. In particular, we provide a

characterization of the solution to the optimization problem in the Karush–Kuhn–Tucker (KKT) framework

and show that the solution is a partial optimal point, KKT point, as well as a local minimum. Experimentally,

we study various representations of persistent homology in the k-means algorithm, as persistence diagram

embeddings, on persistence diagrams themselves, as well as on their generalizations as persistence measures.

To implement k-means clustering on persistence diagram and persistence measure space, we reinterpret

the algorithm taking into account the appropriate metrics on these spaces as well as constructions of their

respective means. Broadly, our work contributes to the growing body of increasingly important and relevant

non-Euclidean statistics (e.g., Blanchard & Jaﬀe, 2022) that deals with, for example, spaces of matrices

under nonlinear constraints (Dryden et al.,2009); quotients of Euclidean spaces deﬁned by group actions

(Le & Kume, 2000); and Riemannian manifolds (Miolane et al.,2020).

The remainder of this paper is organized as follows. We end this introductory section with an overview of

existing literature relevant to our study. Section 2provides the background on both persistent homology and

k-means clustering. We give details on the various components of persistent homology needed to implement

the k-means clustering algorithm. In Section 3, we provide main result on the convergence of the k-means

algorithm in persistence diagram space in the KKT framework. Section 4provides results to numerical

experiments comparing the performance of k-means clustering on embedded persistence diagrams versus

persistence diagrams and persistence measures. We close with a discussion on our contributions and ideas

for future research in Section 5.

Related Work A previous study of the k-means clustering in persistence diagram space compares the

performance to other classiﬁcation algorithms; it also establishes local minimality of the solution to the

optimization problem in the convergence of the algorithm (Marchese et al.,2017). Our convergence study

expands on this previous study to the more general KKT framework, which provides a more detailed con-

vergence analysis, and diﬀers experimentally by studying the performance of k-means clustering speciﬁcally

in the context of persistent homology on simulated data as well as a benchmark dataset.

Clustering and TDA more generally has also been recently overviewed, explaining how topological con-

cepts may apply to the problem of clustering (Panagopoulos, 2022). Other work studies topological data

analytic clustering for time series data as well as space-time data (Islambekov & Gel, 2019; Majumdar &

Laha, 2020).

Figure 1: Example of a VR ﬁltration and VR complexes at various increasing scales on a point cloud.

2 Background and Preliminaries

In this section, we provide details on persistent homology as the context in which our study takes place; we

also outline the k-means clustering algorithm. We then provide more technical details on the speciﬁcs of

persistent homology that are needed to implement the k-means clustering algorithm, as well as of Karush–

Kuhn–Tucker optimization.

2.1 Persistent Homology

Algebraic topology is a ﬁeld in pure mathematics that uses abstract algebra to study general topological

spaces. A primary goal of the ﬁeld is to characterize topological spaces and classify them according to

properties that do not change when the space is subjected to smooth deformations, such as stretching,

compressing, rotating, and reﬂecting; these properties are referred to as invariants. Invariants and their

associated theory can be seen as relevant to complex and general real-world data settings in the sense that

they provide a way to quantify and rigorously describe the “shape” and “size” of data, which can then be

considered as summaries for complex, non-Euclidean data structures and settings are becoming increasingly

important and challenging to study. A particular advantage of TDA, given its algebraic topological roots, is

its ﬂexibility and applicability both when there is a clear formalism to the data (e.g., rigorous metric space

structure) and when there is a less obvious structure (e.g., networks and text).

Persistent homology adapts the theory of homology from classical algebraic topology to data settings in a

dynamic manner. Homology algebraically identiﬁes and counts invariants as a way to distinguish topological

spaces from one another; various diﬀerent homology theories exist depending on the underlying topological

space of study. When considering the homology of a dataset as a ﬁnite topological space X, it is common to

use simplicial homology over a ﬁeld. A simplicial complex is a reduction of a general topological space to a

discrete version, as a union of simpler components, glued together in a combinatorial fashion. An advantage

of considering simplicial complexes and studying simplicial homology is that there are eﬃcient algorithms

to compute homology. More formally, these basic components are k-simplices, where each k-simplex is the

convex hull of k+1 aﬃnely independent points x0, x1, . . . , xk, denoted by [x0, x1, . . . , xk]. A set constructed

combinatorially of k-simplices is a simplicial complex, K.

Persistent homology is a dynamic version of homology, which is built on studying homology over a

ﬁltration. A ﬁltration is a nested sequence of topological spaces, X0⊆X1⊆ · · · ⊆ Xn=X; there exist

diﬀerent kinds of ﬁltrations, which are deﬁned by their nesting rule. In this paper, we study Vietoris–Rips

(VR) ﬁltrations for ﬁnite metric spaces (X, dX), which are commonly used in applications and real data

settings in TDA. Let 1≤2≤ · · · ≤ nbe an increasing sequence of parameters where each i∈R≥0. The

Vietoris–Rips complex of Xat scale iVR(X, i) is constructed by adding a node for each xj∈Xand a

k-simplex for each set {xj1, xj2, . . . , xjk+1 }with diameter d(xi, xj) smaller than ifor all 0 ≤i, j ≤k. We

then obtain a Vietoris–Rips ﬁltration

VR(X, 1)→VR(X, 2)→ · · · →VR(X, n),

which is a sequence of inclusions of sets.

The sequence of inclusions given by the ﬁltration induces maps in homology for any ﬁxed dimension •.

Let H•(X, i) be the homology group of VR(X, i) with coeﬃcients in a ﬁeld. This gives rise to the following

sequence of vector spaces:

H•(X, 1)→H•(X, 2)→ · · · → H•(X, n).

The collection of vector spaces H•(X, i), together with their maps (vector space homomorphisms), H•(X, i)→

H•(X, j), is called a persistence module.

The algebraic interpretation of a persistence module and how it corresponds to topological (more specif-

ically, homological) information of data is formally given as follows. For each ﬁnite dimensional H•(X, i),

the persistence module can be decomposed into rank one summands corresponding to birth and death times

of homology classes (Chazal et al.,2016): Let α∈H•(X, i) be a nontrivial homology class. αis born at

iif it is not in the image of H•(X, i−1)→H•(X, i) and similarly, it dies entering jif the image of

αvia H•(X, i)→H•(X, j−1) is not in the image H•(X, i−1)→H•(X, j−1), but the image of αvia

H•(X, i)→H•(X, j) is in the image H•(X, i−1)→H•(X, j). The collection of these birth–death inter-

vals [i, j) represents the persistent homology of the Vietoris–Rips ﬁltration of X. Taking each interval as

an ordered pair of birth–death coordinates and plotting each in a plane R2yields a persistence diagram; see

Figure 3a. This algebraic structure and its construction generate the complex geometry of the space of all

persistence diagrams, which must be taken into account when performing statistical analyses on persistence

diagrams representing the topological information of data.

More intuitively, homology studies the holes of a topological space as a kind of invariant, since the

occurrence of holes and how many there are do not change under continuous deformations of the space.

Dimension 0 homology corresponds to connected components; dimension 1 homology corresponds to loops;

and dimension 2 homology corresponds to voids or bubbles. These ideas can be generalized to higher

dimensions and the holes are topological features of the space (or dataset). Persistent homology tracks

the appearance of these features with an interval, where the left endpoint of the interval signiﬁes the ﬁrst

appearance of a feature within the ﬁltration and the right endpoint signiﬁes its disappearance. The length

of the interval can therefore be thought of as the lifetime of the topological feature. The collection of the

intervals is known as a barcode and summarizes the topological information of the dataset (according to

dimension). A barcode can be equivalently represented as a persistence diagram, which is a scatterplot of

the birth and death points of each interval. See Figure 2for a visual illustration.

Deﬁnition 1. Apersistence diagram Dis a locally ﬁnite multiset of points in the half-plane Ω = {(x, y)∈

R2|x < y}together with points on the diagonal ∂Ω = {(x, x)∈R2}counted with inﬁnite multiplicity.

Points in Ω are called oﬀ-diagonal points. The persistence diagram with no oﬀ-diagonal points is called the

empty persistence diagram, denoted by D∅.

Intuitively, points that are far away from the diagonal are the long intervals corresponding to topological

features that have a long lifetime in the ﬁltration. These points are said to have high persistence and

are generally considered to be topological signal, while points closer to the diagonal are considered to be

topological noise.

Persistence Measures A more general representation of persistence diagrams that aims to facilitate

statistical analyses and computations is the persistence measure. The construction of persistence measures

arises from the equivalent representation of persistence diagrams as measures on Ω of the form Px∈D∩Ωnxδx,

where xranges over the oﬀ-diagonal points in a persistence diagram D,nxis the multiplicity of x, and δxis

the Dirac measure at x(Divol & Chazal, 2019). Considering then all Radon measures supported on Ω gives

the following deﬁnition.

Deﬁnition 2 (Divol & Lacombe (2021b)).Let µbe a Radon measure supported on Ω. For 1 ≤p < ∞, the

p-total persistence in this measure setting is deﬁned as

Persp(µ) = ZΩ

∥x−x⊤∥p

qdµ(x),

where x⊤is the projection of xto the diagonal ∂Ω. Any Radon measure with ﬁnite p-total persistence is

called a persistence measure.

Persistence measures resemble heat map versions of persistence diagrams on the upper half-plane where

individual persistence points are smoothed, so the intuition on their birth–death lifetimes as described

previously and illustrated in Figure 2are lost. However, they gain the advantage of a linear structure where

statistical quantities are well-deﬁned and much more straightforward to compute (Divol & Lacombe, 2021a).

Figure 2: Example illustrating the procedure of persistent homology (Ghrist, 2008). The panel on the left

shows an annulus, which is the underlying space of interest; the topology of an annulus is characterized by

1 connected component (corresponding to H0), 1 loop (corresponding to H1), and 0 voids (corresponding to

H2). 17 points are sampled from the annulus and a simplicial complex (in order to be able to use simplicial

homology) is constructed from these samples in the following way: The sampled points are taken to be centers

of balls with radius , which grows from 0 to ∞. At each value of , whenever two balls intersect, they are

connected by an edge; when three balls mutually intersect, they are connected by a triangle (face); this

construction continues for higher dimensions. The simplicial complex resulting from this overlapping of balls

as the radius grows is shown to the right of each snapshot at diﬀerent values of . As grows, connected

components merge, cycles are formed, and ﬁll up. These events are tracked by a bar for each dimension of

homology, shown on the right; the lengths of the bars corresponds to the lifetime of each topological feature

as grows. Notice that as reaches inﬁnity, there remains a single connected component tracked by the

longest bar for H0. For H1, there are bars of varying length, including several longer bars, which suggests

irregular sampling of the annulus; the longest bar corresponds to the single loop of the annulus. Notice also

that there is one short H2bar, which is likely to be a spurious topological feature corresponding to noisy

sampling.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

k-MeansClusteringforPersistentHomologyYueqiCao1,†,PrudenceLeung1,†,andAntheaMonod1,†1DepartmentofMathematics,ImperialCollegeLondon,UK†Correspondinge-mail:y.cao21@imperial.ac.ukAbstractPersistenthomologyisamethodologycentraltotopologicaldataanalysisthatextractsandsummarizesthetopologicalfeatureswithi...

展开>> 收起<<

k-Means Clustering for Persistent Homology Yueqi Cao1 Prudence Leung1 and Anthea Monod1 1 Department of Mathematics Imperial College London UK.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

k-Means Clustering for Persistent Homology Yueqi Cao1 Prudence Leung1 and Anthea Monod1 1 Department of Mathematics Imperial College London UK

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: