Topology-Driven Goodness-of-Fit Tests in Arbitrary Dimensions Pawe l D lotko1 Niklas Hellmer1 Lukasz Stettner1 Rafa l Topolnicki1

2025-05-06 5 0 1.39MB 26 页 10玖币

侵权投诉

Topology-Driven Goodness-of-Fit Tests in Arbitrary

Dimensions

Pawe l D lotko1, Niklas Hellmer1, Lukasz Stettner1, Rafa l Topolnicki1

1Institute of Mathematics, Polish Academy of Sciences, ´

Sniadeckich 8, Warsaw, 00-656,

Poland.

Contributing authors: pawel.dlotko@impan.pl;nhellmer@impan.pl;stettner@impan.pl;

rafal.topolnicki@impan.pl;

Abstract

This paper adopts a tool from computational topology, the Euler characteristic curve (ECC) of a

sample, to perform one- and two-sample goodness of ﬁt tests. We call our procedure TopoTests. The

presented tests work for samples of arbitrary dimension, having comparable power to the state-of-

the-art tests in the one-dimensional case. It is demonstrated that the type I error of TopoTests can

be controlled and their type II error vanishes exponentially with increasing sample size. Extensive

numerical simulations of TopoTests are conducted to demonstrate their power for samples of various

sizes.

Keywords: Goodness-of-ﬁt test, Euler characteristic curve, high-dimensional inference, topological data

analysis

1 Introduction

Goodness-of-ﬁt (GoF) testing is one of the stan-

dard tasks in statistics. The testing procedure

can be stated in the one-sample or two-sample

setting. In case of the one-sample problem, we

observe a sample of mindependent realizations

{x1, . . . , xm}of a d-dimensional random vector

Xwith an unknown distribution function G, i.e.

xi∼G. The task is to test whether Gis equal to

a speciﬁc distribution F, i.e. we would like to test

H0:G=Fvs. H1:G̸=F. (1)

In the setting of the two-sample problem we are

given two independent samples consisting of m

and n(m̸=nin general) independent realizations

of d-dimensional random vectors Xand Ywith an

unknown distribution function Fand G, respec-

tively. This means X={x1, . . . , xm}, xi∼Fand

Y={y1, . . . , yn}, yj∼G, while the hypothesis is

the same as in (1).

In this paper, we consider a more general

notion of equivalence, replacing the equal sign

above by the relation of being Euler equivalent (cf.

Deﬁnition 2.1).

We are interested in the setting in which

the underlying distribution is continuous. In this

case, prominent GoF tests for samples from R

rely on the empirical distribution function, see

[1, chapter 4]. These include, in the one dimen-

sional case, the Kolmogorov-Smirnov, Cram´er-

von-Mises and Anderson-Darling tests. In higher

dimensions, Kolmogorov-Smirnov leads to Fasano-

Franceschini[2] and Peacock[3] tests; a general

arXiv:2210.14965v3 [stat.ME] 11 Oct 2023

case was considered by Justel [4]. A multivari-

ate version of Cram´er-von-Mises was proposed

by Chiu and Liu[5]. Since those tests are based

on empirical distribution function, their gener-

alization to Rdfor d≥2 is conceptually and

computationally diﬃcult. Moreover, we are not

aware of an eﬃcient implementation of a general

goodness of ﬁt tests for high dimensional samples.

To tackle this challenge we propose to replace

the cumulative distribution function with Euler

characteristic curves (ECCs) [6–8], a tool from

computational topology that provides a signature

of the considered sample. To a given sample X,

this notion associates a function χ(X): [0,∞)→

Z, which can serve as a stand-in for the empiri-

cal distribution function in arbitrary dimensions.

Subsequently, for one-sample tests, inspired by

the Kolmogorov-Smirnov test, we deﬁne the test

statistic to be the supremum distance between the

ECC of the sample and the expected ECC for

the distribution. This topologically driven testing

scheme will be referred to as “TopoTest” for short.

The key characteristic of any goodness of ﬁt

test is its power, i.e. the type II error should be

small, under the requirement that the type I error

is ﬁxed at level α. We show that the proposed

test satisﬁes this condition and that it performs

very well in practical cases. In particular, even

restricted to one dimensional samples, its power is

comparable to those of the standard GoF tests.

The paper is organized as follows: Section 1.1

reviews the necessary background from topol-

ogy as well as the current work in the topic.

In Section 2we present the theoretical justiﬁca-

tion of our method. In Section 3the algorithms

implementing proposed GoF tests are detailed.

Sections 4and 5present the numerical experi-

ments and comparison of the presented technique

to existing methods. In particular, comparing to

a higher dimensional version of the Kolmogorov-

Smirnov test, we ﬁnd that our procedure provides

better power and takes less time to compute.

Finally, in Section 7the conclusions are drawn.

1.1 Background

Since the seminal work of Edelsbrunner et al. [9]

and Carlsson & Zomorodian [10], topological Data

Analysis (TDA) is a fast growing interdisciplinary

area combining tools and results of such diverse

ﬁelds of science as algebra, topology, statistics

(a) χ= 9 (b) χ= 9 −1=8

Fig. 1 With increasing scale parameter, we draw in edges

and triangles. We keep track of the number of components,

which is here #points −#edges + #triangles.

and machine learning, just to name few. For a

survey from a statistician’s perspective, see [11].

One of the areas in which TDA can contribute

to statistics is related to applications of topolog-

ical summaries of the data to hypothesis testing.

Despite ongoing research and growing interest in

TDA methods, attempts to construct statistical

tests within the classical Neyman-Person hypoth-

esis testing paradigm based on persistent homol-

ogy, the most popular topological summaries of

data, are limited because the distributions of test

statistics under the null hypothesis are unknown.

Therefore, the approaches that are most common

in the literature utilize sampling and permutation

based techniques [12–14]. In this work, a diﬀer-

ent topological summary of the data, namely the

Euler characteristic curve (ECC), is used to con-

struct one-sample and two-sample statistical tests.

The application of ECCs is motivated by recent

theoretical ﬁndings regarding the asymptotic dis-

tribution of ECC, which enables us to construct

tests in rigorous fashion. Since the ﬁnite sample

distributions of ECCs remain unknown, exten-

sive Monte Carlo simulations were conducted to

investigate the properties and performances of the

proposed tests.

Tools from Computational Topology

To start with an example, let us consider the

set Xof nine points in R2(Figure 1a). The

most elementary way of assigning a numeric quan-

tity to them is to simply count them. This is

a topological invariant, the number of connected

components. Now if two points coincide, they

should not be regarded as separate. If they are

very close together, say less than some given ε > 0

apart, we can also connect them. So let us draw

an edge between them (Figure 1b). The number

of connected components is now one less, suggest-

ing we should subtract the number of edges from

the number of points. In order to formalize what

we mean by points that are close to each other,

we introduce a scale parameter r∈R≥0. Then we

draw edges between pairs of points whose distance

is at most r. Letting r= 0 initially and increasing

it, we draw more and more edges, thereby reduc-

ing the number of connected components (Figure

1c). Once three points are within distance rof

each other, according to our intuition they should

be considered as one connected component. But

we have three points and three edges, which yield

a diﬀerence of zero. To correct this mismatch

with our intuition, we add the number of triangles

(Figure 1d). This procedure continues to higher

dimensions: Once kpoints are within distance r

of each other, we add (−1)k−1.

These ideas will now be formalized. For a text-

book reference on these topics, we refer the reader

to [15].

Deﬁnition 1.1.An abstract simplicial complex K

is a collection of nonempty sets which are closed

under the subset operation:

τ∈Kand σ⊆τ⇒σ∈K.

The elements of Kare called simplices. If σ⊊τ∈

K, we say that σis a face of τ. The dimension

of a simplex σ∈Kis dim(σ) = |σ| − 1, where

|·|denotes the cardinality of a set. The dimension

of Kis the the maximal dimension of any of its

simplices.

The construction of drawing edges, triangles

etc. between points which are close to each other

can be formalized in slightly diﬀerent ﬂavours.

Perhaps the simplest is the Vietoris-Rips construc-

tion:

Deﬁnition 1.2.For a ﬁnite subset X⊆Rdand

r≥0 deﬁne the Vietoris-Rips complex at scale r

to be the abstract simplicial complex

Rr(X) = {σ⊆X: diam(σ)≤2r}

where diam is the diameter of the simplex

diam(σ) = max{d(x, x′): x, x′∈σ, x ̸=x′}.

A closely related notion is the ˇ

Cech complex:

Deﬁnition 1.3.For a ﬁnite subset X⊆Rdand

r≥0 deﬁne the ˇ

Cech complex at scale rto be the

abstract simplicial complex

Cr(X) = (σ⊆X:\

x∈σ

Br(x)̸=∅),

where Br(x) is the closed ball of radius rcentered

at x.

Finally, the Alpha complex (which is the

most useful in practice and used in our imple-

mentations), requires the following notion from

computational geometry:

Deﬁnition 1.4.Let X⊆Rdbe a ﬁnite set. The

Voronoi cell of x∈Xis the subset of points in Rd

that have xas a closest point in X,

VX(x) = {y∈Rd:∀x′∈X∥y−x∥≤∥y−x′∥}.

Deﬁnition 1.5.For a ﬁnite subset X⊆Rdand

r≥0 deﬁne the Alpha complex at scale rto be

the abstract simplicial complex

Ar(X) = (σ⊆X:\

x∈σ

Br(x)∩VX(x)̸=∅).

For illustrations of the Alpha, ˇ

Cech and

Vietoris-Rips complex on a small sample, consider

Figures 2a,2b and 2c, respectively. We refer to r

as the scale parameter or the ﬁltration value. The

latter name comes from the fact that for r < r′,

the complex at scale ris a subcomplex of the one

at scale r′.

The main advantage of the Alpha complex

is its small size in low dimensions [16]; namely

the Alpha complex on a random sample scales

exponentially with the dimension of the sample

and linearly with the sample size, see [17] for

a further discussion. This is acceptable for low

dimension, but impractical for higher ones. The

Vietoris-Rips complex does not scale with the

dimension but it scales exponentially with the

sample size. For small samples in high dimensions,

this construction should be preferred.

Counting the simplices with a sign yields the

Euler characteristic, a fundamental topological

invariant.

(a) Alpha (b) ˇ

Cech (c) Vietoris-Rips

Fig. 2 We consider three diﬀerent constructions of ﬁltered simplicial complexes with a ﬁxed sample as vertex set.

Deﬁnition 1.6.Let Kbe a ﬁnite abstract simpli-

cial complex. Its Euler characteristic is

χ(K) = X

σ∈K

(−1)dim(σ).

In the following we use the ˇ

Cech construction

in the theoretical part. Due to its sparse nature

the Alpha construction is used in the implemen-

tation. They are topologically equivalent by the

nerve lemma [15, III.2], hence they give the same

ECC.

It should be noted that, for a given sample X,

the Euler characteristics of its Vietoris-Rips com-

plex, χ(Rr(X)), may be diﬀerent from χ(Ar(X))

and χ(Cr(X)). An example can be found in the

sample presented in Figure 2c in which the 2-

simplex (triangle) on the left is ﬁlled in the

Vietoris-Rips complex, but empty for the ˇ

Cech

and Alpha complex.

Keeping track of how the Euler characteristic

changes with the scale parameter yields the main

tool of our interest:

Deﬁnition 1.7.Given a ﬁnite subset X⊆Rd,

deﬁne its Euler characteristic curve (ECC) as

χ(X): [0,∞)→Z, r 7→ χ(Ar(X)).

The ECC of the sample from Figure 1a is dis-

played in Figure 3. First applications of the ECC

date to back to work of Worsley on astrophysics

and medical imaging [8].

Topology of Random Geometric

Complexes

In the considered setting, the vertex set from

which we build simplicial complexes is sampled

from some unknown distribution. The litera-

ture distinguishes two approaches, Poisson and

Bernoulli sampling; see [18] for a survey. In the

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Filtration r

Euler Characteristic Curve

χ(Ar(X))

(a) (b) (c) (d)

Fig. 3 The ECC of the sample from Figure 1a. The ﬁltra-

tion values (a)-(d) correspond to the complexes in Figure

1a-1d.

ﬁrst setting, the samples are assumed to be gen-

erated by a spatial Poisson process. We focus on

the Bernoulli sampling scheme in this paper. This

means that we consider samples of npoints sam-

pled i.i.d. from some ddimensional distribution.

Furthermore, there are three regimes to be con-

sidered when the sample size goes to inﬁnity [19,

Section 1.4]. We consider the geometric complex

at scale rnfor a sequence rn→0 whose topology

is determined by the behaviour whether

n·rnd→









∞,

λ∈(0,∞) constant,

In the supercritical regime,n·rnd→ ∞, so that

the domain gets densely sampled the geomet-

ric complex is highly connected. Intuitively, this

regime maintains only global topological infor-

mation and forgets about local density. In the

subcritical regime,n·rnd→0, so that the domain

gets sparsely sampled and the geometric complex

is, informally speaking, disconnected (consult [18]

for details). In this paper, we focus on the ther-

modynamic regime, i.e. we keep the quantity n·

rnd=λconstant. Up to a constant factor, the

quantity n·rd

nis the average number of points in a

ball of radius rn[18, Section 1]. This value neither

goes to zero nor to inﬁnity as n→ ∞ in the ther-

modynamic regime, leading to complex topology;

see for instance [19, Chapter 9]. Now it is straight-

forward to observe that a subset of our sample

σ⊆Xforms a simplex in the ˇ

Cech complex at

scale rniﬀ

x∈σ

Brn(x)̸=∅ ⇔ \

x∈n1/dσ

Bλ(x)̸=∅.

This is because for any x∈X, x′∈Rd, we have

∥x′−x∥ ≤ rn⇔n1/d∥x′−x∥ ≤ n1/drn

⇔ ∥n1/dx′−n1/dx∥ ≤ λ1/d

This observation motivates us to scale a sample of

size nby n1/d. In fact, this setup aligns with the

approach of [20]. Due to this scaling, the aver-

age number of points in a ball of radius r=λ1/d

stays the same as we increase n→ ∞. There-

fore, it makes sense to compare ECCs at ﬁxed

radius r=λ1/d for samples of diﬀerent sizes. Visu-

ally speaking, we can compare (expected) ECCs

from samples of diﬀerent sizes in a common coor-

dinate system using the r-axis scaled in this way.

In particular, one can study the point-wise limit of

the expected ECC; that is, when the sample size

approaches inﬁnity for a ﬁxed r. Moreover, this

rescaling allows us to conduct two sample tests

with samples of diﬀerent sizes, cf. Section 2.2.

1.2 Previous Work

Let us brieﬂy review some related work on the

intersection of topology and statistics. The most

popular tool of TDA is persistent homology. Its

key property is stability [21]; informally speak-

ing, a small perturbation of the input yields a

small change in the output. However, persistent

homology is a complicated setting for statistics;

for example, there are no unique means [22].

For a survey on the topology of random geo-

metric complexes see [18]. A text book for the

case of one-dimensional complexes, i.e. graphs,

is [19]. The Euler characteristic of random geo-

metric complexes has been studied in [23,24].

Notably, in [24], the limiting ECC in the ther-

modynamic regime is computed for the uniform

distribution on [0,1]3. More recently, [25] provided

a functional central limit theorem for ECCs, which

was subsequently generalized by [20]. The Euler

characteristic has been studied in the context of

random ﬁelds [26] by Adler and Taylor. Adler

suggested to use it for model selection purposes

and normality testing [27, Section 7]. Building

on this work, such a normality test has been

extensively studied in [28]. Using topological sum-

maries for statistical testing has moreover been

suggested by [29] for persistence vineyards, [30]

for persistent Betti numbers and [31] for multipa-

rameter persistent Betti numbers. Mukherjee and

Vejdemo-Johansson [14] describe a framework for

multiple hypothesis testing for persistent homol-

ogy. Very recently, Vishwanath et al. [32] provided

criteria to check the injectivity of topological

summary statistics including ECCs.

1.3 Our Contributions

In this paper, to the best of our knowledge, we

present the ﬁrst mathematically rigorous approach

using the Euler characteristic curves to perform

general goodness-of-ﬁt testing. Our procedure

is theoretically justiﬁed by Theorem 2.4. The

concentration inequality for Gaussian processes

(Lemma 2.2) might be of independent interest.

Simulations conducted in Section 4and 5

indicate that TopoTest outperforms the Kolmogo-

rov-Smirnov test we used as a baseline in arbitrary

dimension both in terms of the test power but

also in terms of computational time for moderate

sample sizes and dimensions.

The implementation of TopoTest is publicly

available at https://github.com/dioscuri-tda/

topotests.

2 Method

2.1 One-sample test

While topological descriptors are computable and

have a strong theory underlying them, they are

not complete invariants of the underlying distri-

butions, as recently pointed out in [32]. Hence

the statement of the null hypothesis and the

alternative require some care.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Topology-DrivenGoodness-of-FitTestsinArbitraryDimensionsPawelDlotko1,NiklasHellmer1,LukaszStettner1,RafalTopolnicki11InstituteofMathematics,PolishAcademyofSciences,´Sniadeckich8,Warsaw,00-656,Poland.Contributingauthors:pawel.dlotko@impan.pl;nhellmer@impan.pl;stettner@impan.pl;rafal.topolnicki@impan....

展开>> 收起<<

Topology-Driven Goodness-of-Fit Tests in Arbitrary Dimensions Pawe l D lotko1 Niklas Hellmer1 Lukasz Stettner1 Rafa l Topolnicki1.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Topology-Driven Goodness-of-Fit Tests in Arbitrary Dimensions Pawe l D lotko1 Niklas Hellmer1 Lukasz Stettner1 Rafa l Topolnicki1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: