Nonparametric testing via partial sorting K. BisewskiH.M. JansenY. Nazarathy Abstract

2025-05-02 0 0 1.79MB 24 页 10玖币
侵权投诉
Nonparametric testing via partial sorting
K. BisewskiH.M. JansenY. Nazarathy
Abstract
In this paper we introduce the idea of partially sorting data to design nonparametric
tests. This approach gives rise to tests that are sensitive to both the order and the
underlying distribution of the data. We focus in particular on a test that uses the
bubble sort algorithm to partially sort the data. We show that a function of the data,
referred to as the empirical bubble sort curve, converges uniformly to a limiting curve.
We define a goodness-of-fit test based on the distance between the empirical curve and
its limit. The asymptotic distribution of the test statistic is a generalization of the
Kolmogorov distribution. We apply the test in several examples and observe that it
outperforms classical nonparametric tests for appropriately chosen sorting levels.
Keywords: Nonparametric statistics; goodness-of-fit; Brownian bridge; partial sorting.
AMS MSC 2010: Primary 62G10; 62G30, Secondary 62G20; 60F99.
1 Introduction
Determining whether a data set constitutes a random sample from a given distribution
is one of the central problems in nonparametric statistics. A common procedure is the
standard Kolmogorov–Smirnov (KS) test for real-valued observations
X1, . . . , Xn
. The KS
test considers the hypotheses
H0:Xi
iid F0,vs. H1: otherwise.(1.1)
Here
F0
is some distribution specified as part of
H0
. The procedure computes the KS test
statistic
b
Dn=nsup
xRb
Fn(x)F0(x)with b
Fn(x) = 1
n
n
X
i=1
1{Xix},
where F0is the cdf and b
Fnis the empirical cumulative distribution function (ecdf).
The KS test does not prescribe or require any sorting of the observations. Even if
X1, . . . , Xn
are fully sorted or partially sorted, this does not affect the test statistic
b
Dn
. In
other words, the KS test is insensitive to the order of the data. In this paper we generalize
KS by introducing and studying a modified procedure that is based on partially sorting
X1, . . . , Xn
. The advantage of using partial sorting is that it makes the modified procedure
sensitive to the order as well as the underlying distribution of the data. We study the
properties of the modified procedure and provide examples in which KS is outperformed
by our generalization. Before describing the main idea in more detail, let us review some
aspects of KS.
For any continuous
F0
under
H0
, as
n
grows, the sequence
b
Dn
converges in distribution
to a random variable Dthat follows the Kolmogorov distribution with cdf
P(Dx)=12
X
k=1
(1)k1e2k2x2.(1.2)
University of Lausanne, Lausanne, Switzerland.
University College Roosevelt, Middelburg, the Netherlands.
The University of Queensland, St. Lucia, Queensland, Australia.
arXiv:2210.14546v1 [math.ST] 26 Oct 2022
Figure 1: iid normal data (
n
= 1
,
000): Increasing levels of partial sorting with no sorted
data (top left) to fully sorted data (bottom right). The scaled normal cdf is plotted in red.
The quantiles of the Kolmogorov distribution are easily computed numerically, so the
p
-value
or the test decision for a pre-specified confidence level
α
can be easily determined. The
distribution of
D
is derived based on a Brownian bridge approximation for the process
nb
Fn
(
x
)
F0
(
x
)
. See for example [
27
, Sec. 19.3] for details or the initial publications
[20, 24, 25]. An overview of additional recent work related to KS tests is provided below.
Importantly, KS tests are not only used for testing the distribution
F0
but also for
the independence stated under
H0
or lack thereof. As an extreme example where this
independence certainly does not hold, assume that
X1
=
X2
=
. . .
=
Xn
, with
X1
distributed
according to F0. In this case
b
Dn=nmax F0(X1),1F0(X1),
which is distributed uniformly in the range [
n/
2
,n
]. Now with a specified type I error of
α
, the null hypothesis
H0
is rejected almost with certainty as
n
grows. For example, with
α
= 0
.
01, the 0
.
99th quantile of the Kolmogorov distribution is approximately 1
.
68. Then
H0
is already rejected with certainty for
n
= 12, since
12/
2
1
.
73. This extreme example
indicates that in general KS statistics may be used to test for independence. Similarly, even
if the lack of independence is not as extreme as having all observations identical, a KS test
may often reject
H0
if the observations exhibit a form of dependence. It is thus common to
use KS tests for testing lack of independence.
Another important statistical test, focusing specifically on independence, is the Wald–
Wolfowitz (WW) test [
28
]. It is based on the runs statistic, which counts the number of
times a data point above the median is followed by a data point below the median or vice
versa. While both KS and WW may be used to test for lack of independence, these two tests
focus on two different aspects of the data. The KS test takes the distribution of the data into
account but is oblivious to its order. This means that it is sensitive to changes in the values
of the data but not to changes in the order of the data. In contrast, the WW test relies on a
statistic that takes the order of the data into account but is oblivious to the distribution.
2
Figure 2: iid uniform data (
β
= 0
.
25): The empirical bubble sort curve is the right frontier
of the points. It converges to a deterministic curve as n→ ∞.
This means that it is sensitive to changes in the order of the data but not to changes in the
values of the data, especially if data points remain on the same side of the median.
Ideally, we would like to use a statistic that is sensitive to both aspects. For this we
present a new nonparametric paradigm based on partial sorting. We believe that the ideas
that we present here may lead to a new suite of tests that may detect in certain cases a lack
of independence better than current methods. We develop and analyze one such specific
test in this paper. We also present examples of cases where partial sorting based tests are
valuable.
The overarching idea of partial sorting based tests is to apply a sorting algorithm to
the data without running the algorithm to completeness. This leads to a partially sorted
sample, say
Xβ
1, . . . , Xβ
n
, where
β
(0
,
1] indicates the level of partial sorting and is referred
to as the sorting level. In the extreme case of
β
= 1 the sample is fully sorted, while for
lower values of
β
the sample is less sorted. A test statistic is then computed based on the
partially sorted sample
Xβ
1, . . . , Xβ
n
. For the test statistic that we present here, there is a
known asymptotic distribution under
H0
that is not influenced by
F0
. Our results then yield
a statistical test similar in nature to KS, but based on the partially sorted sample and often
more sensitive to a lack of independence.
With
β <
1 the partial sorting approach halts the sorting algorithm in mid flight. Thus
the specific sorting algorithm at hand plays a key role in the analysis: different sorting
algorithms generate different probability laws for
Xβ
1, . . . , Xβ
n
. In this paper our sole focus
is on partial sorting with the classic and simple bubble sort algorithm, which is a strong
contender for being the most intuitive sorting algorithm. See for example [
19
, Section 5.2.2]
or [
2
] for an historical overview. As its name may suggest, bubble sort works by ‘bubbling up’
observations based on pairwise comparisons of adjacent values. In each case where adjacent
observations are not in order they are swapped. Each such pass on the data is called an
iteration and the algorithm applies
n
iterations to yield a sorted sample. With bubble sort, a
natural interpretation of
β
is that the number of iterations performed is the nearest integer
to βn.
Our bubble sort based procedure generalizes the KS test, since it exactly agrees with
KS if
β
= 1. Importantly, we show examples where using
β <
1 outperforms both KS
and WW. Our purpose with this paper is to show viability of this method together with
a complete derivation of the properties of associated stochastic processes needed to obtain
the asymptotic distribution of the test statistic. In doing so, we derive a generalization of
the Kolmogorov distribution which we call the generalized Kolmogorov distribution with
random variable denoted
Dβ
. Tabulation of the cdf and/or quantiles of
Dβ
is simply a
matter of computation similar to the Kolmogorov distribution, and like the Kolmogorov
distribution our generalized Kolmogorov distribution is not influenced by
F0
. Further, the
underlying stochastic process that yields the derivation of
Dβ
is based on a generalization
of the Brownian bridge process used for KS. Our analysis includes a Glivenko–Cantelli like
3
(strong law of large numbers) theorem for this process, which is followed by a second order
analysis establishing convergence of probability measures.
To get a feel for potential applicability of partial sorting with bubble sort consider
Figure 1 which is based on an iid sequence of
n
= 1
,
000 observations from a standard normal
distribution. The figure presents the original data, followed by a display of the partially
sorted data for
β
= 0
.
25,
β
= 0
.
5, and finally
β
= 1 (fully sorted). The nature of bubble
sort is that with sorting level
β
the highest valued
βn
(rounded to an integer) observations
are perfectly sorted in the respective top positions of the array but the remaining (1
β
)
n
observations are only partially sorted. Nevertheless as is evident from the figure a pattern
emerges. Specifically observe the right frontier in the case of
β
= 0
.
25 and
β
= 0
.
5 and note
that it follows a regular shape different from the cdf of the observations. We use this regular
pattern to define the bubble sort curve in the next section and make use of it throughout the
paper. In fact, we are able to characterize the bubble sort curve in terms of F0and β.
Our statistical method compares the bubble sort curve expected under
H0
with an
empirical bubble sort curve obtained from the right frontier of the data. This operates in a
manner that resembles the KS test. Remarkably, like in the case of the KS test we are able
to characterize the deviations between the empirical bubble sort curve and the theoretical
curve under
H0
. Specifically, as
n→ ∞
, the scaled deviations between the curves converge
to a well defined Gaussian process. We use this process to develop a test statistic with a
distribution under
H0
which is agnostic to
F0
and thus directly generalizes the Kolmogorov
distribution associated with the KS test.
To get a feel for the large sample asymptotics consider Figure 2 which is based on standard
iid uniform observations, each partially sorted with
β
= 0
.
25. The figure hints that as
n→ ∞
the empirical bubble sort curve (the right frontier of the data) converges. In Theorem 3.1
we establish the associated first order convergence and further in Theorem 3.2 we study
the fluctuations around this frontier and characterize a limiting stochastic process of these
fluctuations. Finally in Theorem 3.4 we compute the distribution of a statistic arising from
the fluctuations, similar to the KS statistic.
Similarly to the analysis that surrounds the KS test, our analysis is based both on a
law of large numbers type theorem and a second order theorem dealing with convergence
of probability measures. The former is an analogue of the celebrated Glivenko–Cantelli
theorem, see [
27
, Ch. 19] and [
12
]. The latter is analogous to Donsker’s theorem for cdfs; see
[
7
]. Hence our work is an extension of the basic development of the asymptotics of the KS
test as in [
7
], [
8
], and [
9
]. Specifically we develop process level convergence that is similar in
spirit to the process level convergence which appears in the analysis of the KS test.
A good historical account of Kolmogorov’s original paper [
20
] can be found in [
26
]. Since
that initial work, there have been dozens of extensions and generalizations, different in nature
to our work here. We now mention a few notable publications in this space. In [
21
] the
authors consider R´enyi-type statistics (see [
23
]) and a adapt them to create modified KS
tests that are sensitive to tail alternatives. In [
22
] the authors continue investigation of
fitting tails of the distribution and study exact properties of the so-called
Mn
statistics of
Berk and Jones [
4
]. Related is the recent work [
10
] dealing with two-sample KS type tests,
using local levels. In [
16
] the authors consider goodness of fit tests via
φ
-divergences. Beyond
these works, many other papers have explored variations and adaptations of KS tests; see
the notable papers [
6
,
11
,
13
,
14
,
15
,
17
,
18
]. Nevertheless, to the best of our knowledge, the
use of partial sorting to enhance goodness of fit procedures, as we present here, is new.
The remainder of this paper is structured as follows. In Section 2 we define the bubble sort
algorithm and the goodness of fit test that uses the partially sorted sample. This also includes
the definition of the generalized Kolmogorov distribution. In Section 3 the main theoretical
results of this paper our presented. Then sections 4, 5, and 6 contain the derivations and
proofs of the results. In Section 4 we establish the law of large numbers type convergence
results. In Section 5 we establish the weak convergence results. In Section 6 these results are
assembled to obtain the distribution of
Dβ
. In Section 7 we present numerical examples of
4
Figure 3: Left: Densities of the generalized Kolmogorov distribution for several sorting
levels (the case β= 1 is the Kolmogorov distribution). Right: Criticial value quantiles as a
function of sorting levels.
our proposed procedure and we conclude in Section 8.
2 The Test Procedure
The bubble sort test considers the null hypothesis that the data consist of independent
observations from a given continuous distribution
F0
. The procedure starts by partially
sorting the data and computing the ecdf of the running maximum of the partially sorted data.
We call this ecdf the empirical bubble sort curve. Then, similarly to the KS test, insight into
the stochastic behavior of the empirical bubble sort curve under
H0
allows us to compare it
to the bubble sort curve associated with
F0
and
β
. The distance between these two curves
yields the bubble sort statistic. Finally, we compare the value of this statistic to the quantiles
of its theoretical distribution under
H0
, which is a generalized Kolmogorov distribution that
only depends on
β
and not on the underlying distribution
F0
. See Figure 3 where we present
densities for a few selected
β
values as well as common quantiles as a function of
β
. Note
that at
β
= 1 the distribution is the Kolmogorov distribution
(1.2)
and as is evident from
the quantile plots, for
β
in the approximate range of 0
.
7–1
.
0 the generalized Kolmogorov
distributions are very close to the Kolmogorov distribution. This is due to the fact that with
bubble sort and such high
β
values while the sample isn’t guaranteed to be fully sorted it
almost always is.
Steps 1–5 below describe the bubble sort test procedure for a data sample
x1, . . . , xn
. Key
objects in this description include the partially sorted sample
(2.1)
, the empirical bubble
sort curve
(2.2)
, the limiting bubble sort curve
(2.3)
, the bubble sort statistic
(2.4)
, and the
generalized Kolmogorov distribution
(2.5)
. This distribution is defined via probabilities of
the form (2.6) associated with Brownian bridges.
5
摘要:

NonparametrictestingviapartialsortingK.Bisewski*H.M.Jansen„Y.Nazarathy…AbstractInthispaperweintroducetheideaofpartiallysortingdatatodesignnonparametrictests.Thisapproachgivesrisetoteststhataresensitivetoboththeorderandtheunderlyingdistributionofthedata.Wefocusinparticularonatestthatusesthebubblesort...

展开>> 收起<<
Nonparametric testing via partial sorting K. BisewskiH.M. JansenY. Nazarathy Abstract.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:1.79MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注