Topology-Driven Goodness-of-Fit Tests in Arbitrary Dimensions Pawe l D lotko1 Niklas Hellmer1 Lukasz Stettner1 Rafa l Topolnicki1

2025-05-06 0 0 1.39MB 26 页 10玖币
侵权投诉
Topology-Driven Goodness-of-Fit Tests in Arbitrary
Dimensions
Pawe l D lotko1, Niklas Hellmer1, Lukasz Stettner1, Rafa l Topolnicki1
1Institute of Mathematics, Polish Academy of Sciences, ´
Sniadeckich 8, Warsaw, 00-656,
Poland.
Contributing authors: pawel.dlotko@impan.pl;nhellmer@impan.pl;stettner@impan.pl;
rafal.topolnicki@impan.pl;
Abstract
This paper adopts a tool from computational topology, the Euler characteristic curve (ECC) of a
sample, to perform one- and two-sample goodness of fit tests. We call our procedure TopoTests. The
presented tests work for samples of arbitrary dimension, having comparable power to the state-of-
the-art tests in the one-dimensional case. It is demonstrated that the type I error of TopoTests can
be controlled and their type II error vanishes exponentially with increasing sample size. Extensive
numerical simulations of TopoTests are conducted to demonstrate their power for samples of various
sizes.
Keywords: Goodness-of-fit test, Euler characteristic curve, high-dimensional inference, topological data
analysis
1 Introduction
Goodness-of-fit (GoF) testing is one of the stan-
dard tasks in statistics. The testing procedure
can be stated in the one-sample or two-sample
setting. In case of the one-sample problem, we
observe a sample of mindependent realizations
{x1, . . . , xm}of a d-dimensional random vector
Xwith an unknown distribution function G, i.e.
xiG. The task is to test whether Gis equal to
a specific distribution F, i.e. we would like to test
H0:G=Fvs. H1:G̸=F. (1)
In the setting of the two-sample problem we are
given two independent samples consisting of m
and n(m̸=nin general) independent realizations
of d-dimensional random vectors Xand Ywith an
unknown distribution function Fand G, respec-
tively. This means X={x1, . . . , xm}, xiFand
Y={y1, . . . , yn}, yjG, while the hypothesis is
the same as in (1).
In this paper, we consider a more general
notion of equivalence, replacing the equal sign
above by the relation of being Euler equivalent (cf.
Definition 2.1).
We are interested in the setting in which
the underlying distribution is continuous. In this
case, prominent GoF tests for samples from R
rely on the empirical distribution function, see
[1, chapter 4]. These include, in the one dimen-
sional case, the Kolmogorov-Smirnov, Cram´er-
von-Mises and Anderson-Darling tests. In higher
dimensions, Kolmogorov-Smirnov leads to Fasano-
Franceschini[2] and Peacock[3] tests; a general
1
arXiv:2210.14965v3 [stat.ME] 11 Oct 2023
case was considered by Justel [4]. A multivari-
ate version of Cram´er-von-Mises was proposed
by Chiu and Liu[5]. Since those tests are based
on empirical distribution function, their gener-
alization to Rdfor d2 is conceptually and
computationally difficult. Moreover, we are not
aware of an efficient implementation of a general
goodness of fit tests for high dimensional samples.
To tackle this challenge we propose to replace
the cumulative distribution function with Euler
characteristic curves (ECCs) [68], a tool from
computational topology that provides a signature
of the considered sample. To a given sample X,
this notion associates a function χ(X): [0,)
Z, which can serve as a stand-in for the empiri-
cal distribution function in arbitrary dimensions.
Subsequently, for one-sample tests, inspired by
the Kolmogorov-Smirnov test, we define the test
statistic to be the supremum distance between the
ECC of the sample and the expected ECC for
the distribution. This topologically driven testing
scheme will be referred to as “TopoTest” for short.
The key characteristic of any goodness of fit
test is its power, i.e. the type II error should be
small, under the requirement that the type I error
is fixed at level α. We show that the proposed
test satisfies this condition and that it performs
very well in practical cases. In particular, even
restricted to one dimensional samples, its power is
comparable to those of the standard GoF tests.
The paper is organized as follows: Section 1.1
reviews the necessary background from topol-
ogy as well as the current work in the topic.
In Section 2we present the theoretical justifica-
tion of our method. In Section 3the algorithms
implementing proposed GoF tests are detailed.
Sections 4and 5present the numerical experi-
ments and comparison of the presented technique
to existing methods. In particular, comparing to
a higher dimensional version of the Kolmogorov-
Smirnov test, we find that our procedure provides
better power and takes less time to compute.
Finally, in Section 7the conclusions are drawn.
1.1 Background
Since the seminal work of Edelsbrunner et al. [9]
and Carlsson & Zomorodian [10], topological Data
Analysis (TDA) is a fast growing interdisciplinary
area combining tools and results of such diverse
fields of science as algebra, topology, statistics
(a) χ= 9 (b) χ= 9 1=8
(c) χ= 9 4 = 5 (d) χ= 9 6 + 1 = 4
Fig. 1 With increasing scale parameter, we draw in edges
and triangles. We keep track of the number of components,
which is here #points #edges + #triangles.
and machine learning, just to name few. For a
survey from a statistician’s perspective, see [11].
One of the areas in which TDA can contribute
to statistics is related to applications of topolog-
ical summaries of the data to hypothesis testing.
Despite ongoing research and growing interest in
TDA methods, attempts to construct statistical
tests within the classical Neyman-Person hypoth-
esis testing paradigm based on persistent homol-
ogy, the most popular topological summaries of
data, are limited because the distributions of test
statistics under the null hypothesis are unknown.
Therefore, the approaches that are most common
in the literature utilize sampling and permutation
based techniques [1214]. In this work, a differ-
ent topological summary of the data, namely the
Euler characteristic curve (ECC), is used to con-
struct one-sample and two-sample statistical tests.
The application of ECCs is motivated by recent
theoretical findings regarding the asymptotic dis-
tribution of ECC, which enables us to construct
tests in rigorous fashion. Since the finite sample
distributions of ECCs remain unknown, exten-
sive Monte Carlo simulations were conducted to
investigate the properties and performances of the
proposed tests.
Tools from Computational Topology
To start with an example, let us consider the
set Xof nine points in R2(Figure 1a). The
2
most elementary way of assigning a numeric quan-
tity to them is to simply count them. This is
a topological invariant, the number of connected
components. Now if two points coincide, they
should not be regarded as separate. If they are
very close together, say less than some given ε > 0
apart, we can also connect them. So let us draw
an edge between them (Figure 1b). The number
of connected components is now one less, suggest-
ing we should subtract the number of edges from
the number of points. In order to formalize what
we mean by points that are close to each other,
we introduce a scale parameter rR0. Then we
draw edges between pairs of points whose distance
is at most r. Letting r= 0 initially and increasing
it, we draw more and more edges, thereby reduc-
ing the number of connected components (Figure
1c). Once three points are within distance rof
each other, according to our intuition they should
be considered as one connected component. But
we have three points and three edges, which yield
a difference of zero. To correct this mismatch
with our intuition, we add the number of triangles
(Figure 1d). This procedure continues to higher
dimensions: Once kpoints are within distance r
of each other, we add (1)k1.
These ideas will now be formalized. For a text-
book reference on these topics, we refer the reader
to [15].
Definition 1.1.An abstract simplicial complex K
is a collection of nonempty sets which are closed
under the subset operation:
τKand στσK.
The elements of Kare called simplices. If στ
K, we say that σis a face of τ. The dimension
of a simplex σKis dim(σ) = |σ| − 1, where
|·|denotes the cardinality of a set. The dimension
of Kis the the maximal dimension of any of its
simplices.
The construction of drawing edges, triangles
etc. between points which are close to each other
can be formalized in slightly different flavours.
Perhaps the simplest is the Vietoris-Rips construc-
tion:
Definition 1.2.For a finite subset XRdand
r0 define the Vietoris-Rips complex at scale r
to be the abstract simplicial complex
Rr(X) = {σX: diam(σ)2r}
where diam is the diameter of the simplex
diam(σ) = max{d(x, x): x, xσ, x ̸=x}.
A closely related notion is the ˇ
Cech complex:
Definition 1.3.For a finite subset XRdand
r0 define the ˇ
Cech complex at scale rto be the
abstract simplicial complex
Cr(X) = (σX:\
xσ
Br(x)̸=),
where Br(x) is the closed ball of radius rcentered
at x.
Finally, the Alpha complex (which is the
most useful in practice and used in our imple-
mentations), requires the following notion from
computational geometry:
Definition 1.4.Let XRdbe a finite set. The
Voronoi cell of xXis the subset of points in Rd
that have xas a closest point in X,
VX(x) = {yRd:xXyx∥≤∥yx∥}.
Definition 1.5.For a finite subset XRdand
r0 define the Alpha complex at scale rto be
the abstract simplicial complex
Ar(X) = (σX:\
xσ
Br(x)VX(x)̸=).
For illustrations of the Alpha, ˇ
Cech and
Vietoris-Rips complex on a small sample, consider
Figures 2a,2b and 2c, respectively. We refer to r
as the scale parameter or the filtration value. The
latter name comes from the fact that for r < r,
the complex at scale ris a subcomplex of the one
at scale r.
The main advantage of the Alpha complex
is its small size in low dimensions [16]; namely
the Alpha complex on a random sample scales
exponentially with the dimension of the sample
and linearly with the sample size, see [17] for
a further discussion. This is acceptable for low
dimension, but impractical for higher ones. The
Vietoris-Rips complex does not scale with the
dimension but it scales exponentially with the
sample size. For small samples in high dimensions,
this construction should be preferred.
Counting the simplices with a sign yields the
Euler characteristic, a fundamental topological
invariant.
3
(a) Alpha (b) ˇ
Cech (c) Vietoris-Rips
Fig. 2 We consider three different constructions of filtered simplicial complexes with a fixed sample as vertex set.
Definition 1.6.Let Kbe a finite abstract simpli-
cial complex. Its Euler characteristic is
χ(K) = X
σK
(1)dim(σ).
In the following we use the ˇ
Cech construction
in the theoretical part. Due to its sparse nature
the Alpha construction is used in the implemen-
tation. They are topologically equivalent by the
nerve lemma [15, III.2], hence they give the same
ECC.
It should be noted that, for a given sample X,
the Euler characteristics of its Vietoris-Rips com-
plex, χ(Rr(X)), may be different from χ(Ar(X))
and χ(Cr(X)). An example can be found in the
sample presented in Figure 2c in which the 2-
simplex (triangle) on the left is filled in the
Vietoris-Rips complex, but empty for the ˇ
Cech
and Alpha complex.
Keeping track of how the Euler characteristic
changes with the scale parameter yields the main
tool of our interest:
Definition 1.7.Given a finite subset XRd,
define its Euler characteristic curve (ECC) as
χ(X): [0,)Z, r 7→ χ(Ar(X)).
The ECC of the sample from Figure 1a is dis-
played in Figure 3. First applications of the ECC
date to back to work of Worsley on astrophysics
and medical imaging [8].
Topology of Random Geometric
Complexes
In the considered setting, the vertex set from
which we build simplicial complexes is sampled
from some unknown distribution. The litera-
ture distinguishes two approaches, Poisson and
Bernoulli sampling; see [18] for a survey. In the
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Filtration r
0
2
4
6
8
10
Euler Characteristic Curve
χ(Ar(X))
(a) (b) (c) (d)
Fig. 3 The ECC of the sample from Figure 1a. The filtra-
tion values (a)-(d) correspond to the complexes in Figure
1a-1d.
first setting, the samples are assumed to be gen-
erated by a spatial Poisson process. We focus on
the Bernoulli sampling scheme in this paper. This
means that we consider samples of npoints sam-
pled i.i.d. from some ddimensional distribution.
Furthermore, there are three regimes to be con-
sidered when the sample size goes to infinity [19,
Section 1.4]. We consider the geometric complex
at scale rnfor a sequence rn0 whose topology
is determined by the behaviour whether
n·rnd
,
λ(0,) constant,
0.
In the supercritical regime,n·rnd→ ∞, so that
the domain gets densely sampled the geomet-
ric complex is highly connected. Intuitively, this
regime maintains only global topological infor-
mation and forgets about local density. In the
subcritical regime,n·rnd0, so that the domain
gets sparsely sampled and the geometric complex
is, informally speaking, disconnected (consult [18]
4
for details). In this paper, we focus on the ther-
modynamic regime, i.e. we keep the quantity n·
rnd=λconstant. Up to a constant factor, the
quantity n·rd
nis the average number of points in a
ball of radius rn[18, Section 1]. This value neither
goes to zero nor to infinity as n→ ∞ in the ther-
modynamic regime, leading to complex topology;
see for instance [19, Chapter 9]. Now it is straight-
forward to observe that a subset of our sample
σXforms a simplex in the ˇ
Cech complex at
scale rniff
\
xσ
Brn(x)̸=∅ ⇔ \
xn1/dσ
Bλ(x)̸=.
This is because for any xX, xRd, we have
xx∥ ≤ rnn1/dxx∥ ≤ n1/drn
⇔ ∥n1/dxn1/dx∥ ≤ λ1/d
This observation motivates us to scale a sample of
size nby n1/d. In fact, this setup aligns with the
approach of [20]. Due to this scaling, the aver-
age number of points in a ball of radius r=λ1/d
stays the same as we increase n→ ∞. There-
fore, it makes sense to compare ECCs at fixed
radius r=λ1/d for samples of different sizes. Visu-
ally speaking, we can compare (expected) ECCs
from samples of different sizes in a common coor-
dinate system using the r-axis scaled in this way.
In particular, one can study the point-wise limit of
the expected ECC; that is, when the sample size
approaches infinity for a fixed r. Moreover, this
rescaling allows us to conduct two sample tests
with samples of different sizes, cf. Section 2.2.
1.2 Previous Work
Let us briefly review some related work on the
intersection of topology and statistics. The most
popular tool of TDA is persistent homology. Its
key property is stability [21]; informally speak-
ing, a small perturbation of the input yields a
small change in the output. However, persistent
homology is a complicated setting for statistics;
for example, there are no unique means [22].
For a survey on the topology of random geo-
metric complexes see [18]. A text book for the
case of one-dimensional complexes, i.e. graphs,
is [19]. The Euler characteristic of random geo-
metric complexes has been studied in [23,24].
Notably, in [24], the limiting ECC in the ther-
modynamic regime is computed for the uniform
distribution on [0,1]3. More recently, [25] provided
a functional central limit theorem for ECCs, which
was subsequently generalized by [20]. The Euler
characteristic has been studied in the context of
random fields [26] by Adler and Taylor. Adler
suggested to use it for model selection purposes
and normality testing [27, Section 7]. Building
on this work, such a normality test has been
extensively studied in [28]. Using topological sum-
maries for statistical testing has moreover been
suggested by [29] for persistence vineyards, [30]
for persistent Betti numbers and [31] for multipa-
rameter persistent Betti numbers. Mukherjee and
Vejdemo-Johansson [14] describe a framework for
multiple hypothesis testing for persistent homol-
ogy. Very recently, Vishwanath et al. [32] provided
criteria to check the injectivity of topological
summary statistics including ECCs.
1.3 Our Contributions
In this paper, to the best of our knowledge, we
present the first mathematically rigorous approach
using the Euler characteristic curves to perform
general goodness-of-fit testing. Our procedure
is theoretically justified by Theorem 2.4. The
concentration inequality for Gaussian processes
(Lemma 2.2) might be of independent interest.
Simulations conducted in Section 4and 5
indicate that TopoTest outperforms the Kolmogo-
rov-Smirnov test we used as a baseline in arbitrary
dimension both in terms of the test power but
also in terms of computational time for moderate
sample sizes and dimensions.
The implementation of TopoTest is publicly
available at https://github.com/dioscuri-tda/
topotests.
2 Method
2.1 One-sample test
While topological descriptors are computable and
have a strong theory underlying them, they are
not complete invariants of the underlying distri-
butions, as recently pointed out in [32]. Hence
the statement of the null hypothesis and the
alternative require some care.
5
摘要:

Topology-DrivenGoodness-of-FitTestsinArbitraryDimensionsPawelDlotko1,NiklasHellmer1,LukaszStettner1,RafalTopolnicki11InstituteofMathematics,PolishAcademyofSciences,´Sniadeckich8,Warsaw,00-656,Poland.Contributingauthors:pawel.dlotko@impan.pl;nhellmer@impan.pl;stettner@impan.pl;rafal.topolnicki@impan....

展开>> 收起<<
Topology-Driven Goodness-of-Fit Tests in Arbitrary Dimensions Pawe l D lotko1 Niklas Hellmer1 Lukasz Stettner1 Rafa l Topolnicki1.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:1.39MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注