Nonparametric testing via partial sorting K. BisewskiH.M. JansenY. Nazarathy Abstract

2025-05-02 1 0 1.79MB 24 页 10玖币

侵权投诉

Nonparametric testing via partial sorting

K. Bisewski∗H.M. Jansen†Y. Nazarathy‡

Abstract

In this paper we introduce the idea of partially sorting data to design nonparametric

tests. This approach gives rise to tests that are sensitive to both the order and the

underlying distribution of the data. We focus in particular on a test that uses the

bubble sort algorithm to partially sort the data. We show that a function of the data,

referred to as the empirical bubble sort curve, converges uniformly to a limiting curve.

We deﬁne a goodness-of-ﬁt test based on the distance between the empirical curve and

its limit. The asymptotic distribution of the test statistic is a generalization of the

Kolmogorov distribution. We apply the test in several examples and observe that it

outperforms classical nonparametric tests for appropriately chosen sorting levels.

Keywords: Nonparametric statistics; goodness-of-ﬁt; Brownian bridge; partial sorting.

AMS MSC 2010: Primary 62G10; 62G30, Secondary 62G20; 60F99.

1 Introduction

Determining whether a data set constitutes a random sample from a given distribution

is one of the central problems in nonparametric statistics. A common procedure is the

standard Kolmogorov–Smirnov (KS) test for real-valued observations

X1, . . . , Xn

. The KS

test considers the hypotheses

H0:Xi∼

iid F0,vs. H1: otherwise.(1.1)

Here

is some distribution speciﬁed as part of

. The procedure computes the KS test

statistic

Dn=√nsup

x∈Rb

Fn(x)−F0(x)with b

Fn(x) = 1

i=1

1{Xi≤x},

where F0is the cdf and b

Fnis the empirical cumulative distribution function (ecdf).

The KS test does not prescribe or require any sorting of the observations. Even if

X1, . . . , Xn

are fully sorted or partially sorted, this does not aﬀect the test statistic

. In

other words, the KS test is insensitive to the order of the data. In this paper we generalize

KS by introducing and studying a modiﬁed procedure that is based on partially sorting

X1, . . . , Xn

. The advantage of using partial sorting is that it makes the modiﬁed procedure

sensitive to the order as well as the underlying distribution of the data. We study the

properties of the modiﬁed procedure and provide examples in which KS is outperformed

by our generalization. Before describing the main idea in more detail, let us review some

aspects of KS.

For any continuous

under

, as

grows, the sequence

converges in distribution

to a random variable Dthat follows the Kolmogorov distribution with cdf

P(D≤x)=1−2∞

k=1

(−1)k−1e−2k2x2.(1.2)

∗University of Lausanne, Lausanne, Switzerland.

†University College Roosevelt, Middelburg, the Netherlands.

‡The University of Queensland, St. Lucia, Queensland, Australia.

arXiv:2210.14546v1 [math.ST] 26 Oct 2022

Figure 1: iid normal data (

= 1

000): Increasing levels of partial sorting with no sorted

data (top left) to fully sorted data (bottom right). The scaled normal cdf is plotted in red.

The quantiles of the Kolmogorov distribution are easily computed numerically, so the

-value

or the test decision for a pre-speciﬁed conﬁdence level

can be easily determined. The

distribution of

is derived based on a Brownian bridge approximation for the process

√nb

(

)

−F0

(

)



. See for example [

, Sec. 19.3] for details or the initial publications

[20, 24, 25]. An overview of additional recent work related to KS tests is provided below.

Importantly, KS tests are not only used for testing the distribution

but also for

the independence stated under

or lack thereof. As an extreme example where this

independence certainly does not hold, assume that

. . .

, with

distributed

according to F0. In this case

Dn=√nmax F0(X1),1−F0(X1),

which is distributed uniformly in the range [

√n/

,√n

]. Now with a speciﬁed type I error of

, the null hypothesis

is rejected almost with certainty as

grows. For example, with

= 0

01, the 0

99th quantile of the Kolmogorov distribution is approximately 1

68. Then

is already rejected with certainty for

= 12, since

√12/

≈

73. This extreme example

indicates that in general KS statistics may be used to test for independence. Similarly, even

if the lack of independence is not as extreme as having all observations identical, a KS test

may often reject

if the observations exhibit a form of dependence. It is thus common to

use KS tests for testing lack of independence.

Another important statistical test, focusing speciﬁcally on independence, is the Wald–

Wolfowitz (WW) test [

]. It is based on the runs statistic, which counts the number of

times a data point above the median is followed by a data point below the median or vice

versa. While both KS and WW may be used to test for lack of independence, these two tests

focus on two diﬀerent aspects of the data. The KS test takes the distribution of the data into

account but is oblivious to its order. This means that it is sensitive to changes in the values

of the data but not to changes in the order of the data. In contrast, the WW test relies on a

statistic that takes the order of the data into account but is oblivious to the distribution.

Figure 2: iid uniform data (

= 0

25): The empirical bubble sort curve is the right frontier

of the points. It converges to a deterministic curve as n→ ∞.

This means that it is sensitive to changes in the order of the data but not to changes in the

values of the data, especially if data points remain on the same side of the median.

Ideally, we would like to use a statistic that is sensitive to both aspects. For this we

present a new nonparametric paradigm based on partial sorting. We believe that the ideas

that we present here may lead to a new suite of tests that may detect in certain cases a lack

of independence better than current methods. We develop and analyze one such speciﬁc

test in this paper. We also present examples of cases where partial sorting based tests are

valuable.

The overarching idea of partial sorting based tests is to apply a sorting algorithm to

the data without running the algorithm to completeness. This leads to a partially sorted

sample, say

Xβ

1, . . . , Xβ

, where

β∈

1] indicates the level of partial sorting and is referred

to as the sorting level. In the extreme case of

= 1 the sample is fully sorted, while for

lower values of

the sample is less sorted. A test statistic is then computed based on the

partially sorted sample

Xβ

1, . . . , Xβ

. For the test statistic that we present here, there is a

known asymptotic distribution under

that is not inﬂuenced by

. Our results then yield

a statistical test similar in nature to KS, but based on the partially sorted sample and often

more sensitive to a lack of independence.

With

β <

1 the partial sorting approach halts the sorting algorithm in mid ﬂight. Thus

the speciﬁc sorting algorithm at hand plays a key role in the analysis: diﬀerent sorting

algorithms generate diﬀerent probability laws for

Xβ

1, . . . , Xβ

. In this paper our sole focus

is on partial sorting with the classic and simple bubble sort algorithm, which is a strong

contender for being the most intuitive sorting algorithm. See for example [

, Section 5.2.2]

or [

] for an historical overview. As its name may suggest, bubble sort works by ‘bubbling up’

observations based on pairwise comparisons of adjacent values. In each case where adjacent

observations are not in order they are swapped. Each such pass on the data is called an

iteration and the algorithm applies

iterations to yield a sorted sample. With bubble sort, a

natural interpretation of

is that the number of iterations performed is the nearest integer

to βn.

Our bubble sort based procedure generalizes the KS test, since it exactly agrees with

KS if

= 1. Importantly, we show examples where using

β <

1 outperforms both KS

and WW. Our purpose with this paper is to show viability of this method together with

a complete derivation of the properties of associated stochastic processes needed to obtain

the asymptotic distribution of the test statistic. In doing so, we derive a generalization of

the Kolmogorov distribution which we call the generalized Kolmogorov distribution with

random variable denoted

Dβ

. Tabulation of the cdf and/or quantiles of

Dβ

is simply a

matter of computation similar to the Kolmogorov distribution, and like the Kolmogorov

distribution our generalized Kolmogorov distribution is not inﬂuenced by

. Further, the

underlying stochastic process that yields the derivation of

Dβ

is based on a generalization

of the Brownian bridge process used for KS. Our analysis includes a Glivenko–Cantelli like

(strong law of large numbers) theorem for this process, which is followed by a second order

analysis establishing convergence of probability measures.

To get a feel for potential applicability of partial sorting with bubble sort consider

Figure 1 which is based on an iid sequence of

= 1

000 observations from a standard normal

distribution. The ﬁgure presents the original data, followed by a display of the partially

sorted data for

= 0

25,

= 0

5, and ﬁnally

= 1 (fully sorted). The nature of bubble

sort is that with sorting level

the highest valued

βn

(rounded to an integer) observations

are perfectly sorted in the respective top positions of the array but the remaining (1

−β

)

observations are only partially sorted. Nevertheless as is evident from the ﬁgure a pattern

emerges. Speciﬁcally observe the right frontier in the case of

= 0

25 and

= 0

5 and note

that it follows a regular shape diﬀerent from the cdf of the observations. We use this regular

pattern to deﬁne the bubble sort curve in the next section and make use of it throughout the

paper. In fact, we are able to characterize the bubble sort curve in terms of F0and β.

Our statistical method compares the bubble sort curve expected under

with an

empirical bubble sort curve obtained from the right frontier of the data. This operates in a

manner that resembles the KS test. Remarkably, like in the case of the KS test we are able

to characterize the deviations between the empirical bubble sort curve and the theoretical

curve under

. Speciﬁcally, as

n→ ∞

, the scaled deviations between the curves converge

to a well deﬁned Gaussian process. We use this process to develop a test statistic with a

distribution under

which is agnostic to

and thus directly generalizes the Kolmogorov

distribution associated with the KS test.

To get a feel for the large sample asymptotics consider Figure 2 which is based on standard

iid uniform observations, each partially sorted with

= 0

25. The ﬁgure hints that as

n→ ∞

the empirical bubble sort curve (the right frontier of the data) converges. In Theorem 3.1

we establish the associated ﬁrst order convergence and further in Theorem 3.2 we study

the ﬂuctuations around this frontier and characterize a limiting stochastic process of these

ﬂuctuations. Finally in Theorem 3.4 we compute the distribution of a statistic arising from

the ﬂuctuations, similar to the KS statistic.

Similarly to the analysis that surrounds the KS test, our analysis is based both on a

law of large numbers type theorem and a second order theorem dealing with convergence

of probability measures. The former is an analogue of the celebrated Glivenko–Cantelli

theorem, see [

, Ch. 19] and [

]. The latter is analogous to Donsker’s theorem for cdfs; see

[

]. Hence our work is an extension of the basic development of the asymptotics of the KS

test as in [

], [

], and [

]. Speciﬁcally we develop process level convergence that is similar in

spirit to the process level convergence which appears in the analysis of the KS test.

A good historical account of Kolmogorov’s original paper [

] can be found in [

]. Since

that initial work, there have been dozens of extensions and generalizations, diﬀerent in nature

to our work here. We now mention a few notable publications in this space. In [

] the

authors consider R´enyi-type statistics (see [

]) and a adapt them to create modiﬁed KS

tests that are sensitive to tail alternatives. In [

] the authors continue investigation of

ﬁtting tails of the distribution and study exact properties of the so-called

statistics of

Berk and Jones [

]. Related is the recent work [

] dealing with two-sample KS type tests,

using local levels. In [

] the authors consider goodness of ﬁt tests via

-divergences. Beyond

these works, many other papers have explored variations and adaptations of KS tests; see

the notable papers [

]. Nevertheless, to the best of our knowledge, the

use of partial sorting to enhance goodness of ﬁt procedures, as we present here, is new.

The remainder of this paper is structured as follows. In Section 2 we deﬁne the bubble sort

algorithm and the goodness of ﬁt test that uses the partially sorted sample. This also includes

the deﬁnition of the generalized Kolmogorov distribution. In Section 3 the main theoretical

results of this paper our presented. Then sections 4, 5, and 6 contain the derivations and

proofs of the results. In Section 4 we establish the law of large numbers type convergence

results. In Section 5 we establish the weak convergence results. In Section 6 these results are

assembled to obtain the distribution of

Dβ

. In Section 7 we present numerical examples of

Figure 3: Left: Densities of the generalized Kolmogorov distribution for several sorting

levels (the case β= 1 is the Kolmogorov distribution). Right: Criticial value quantiles as a

function of sorting levels.

our proposed procedure and we conclude in Section 8.

2 The Test Procedure

The bubble sort test considers the null hypothesis that the data consist of independent

observations from a given continuous distribution

. The procedure starts by partially

sorting the data and computing the ecdf of the running maximum of the partially sorted data.

We call this ecdf the empirical bubble sort curve. Then, similarly to the KS test, insight into

the stochastic behavior of the empirical bubble sort curve under

allows us to compare it

to the bubble sort curve associated with

and

. The distance between these two curves

yields the bubble sort statistic. Finally, we compare the value of this statistic to the quantiles

of its theoretical distribution under

, which is a generalized Kolmogorov distribution that

only depends on

and not on the underlying distribution

. See Figure 3 where we present

densities for a few selected

values as well as common quantiles as a function of

. Note

that at

= 1 the distribution is the Kolmogorov distribution

(1.2)

and as is evident from

the quantile plots, for

in the approximate range of 0

7–1

0 the generalized Kolmogorov

distributions are very close to the Kolmogorov distribution. This is due to the fact that with

bubble sort and such high

values while the sample isn’t guaranteed to be fully sorted it

almost always is.

Steps 1–5 below describe the bubble sort test procedure for a data sample

x1, . . . , xn

. Key

objects in this description include the partially sorted sample

(2.1)

, the empirical bubble

sort curve

(2.2)

, the limiting bubble sort curve

(2.3)

, the bubble sort statistic

(2.4)

, and the

generalized Kolmogorov distribution

(2.5)

. This distribution is deﬁned via probabilities of

the form (2.6) associated with Brownian bridges.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NonparametrictestingviapartialsortingK.Bisewski*H.M.JansenY.NazarathyAbstractInthispaperweintroducetheideaofpartiallysortingdatatodesignnonparametrictests.Thisapproachgivesrisetoteststhataresensitivetoboththeorderandtheunderlyingdistributionofthedata.Wefocusinparticularonatestthatusesthebubblesort...

展开>> 收起<<

Nonparametric testing via partial sorting K. BisewskiH.M. JansenY. Nazarathy Abstract.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Nonparametric testing via partial sorting K. BisewskiH.M. JansenY. Nazarathy Abstract

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: