A Kernel Measure of Dissimilarity between M Distributions Zhen Huang

2025-04-27 0 0 1.24MB 80 页 10玖币

侵权投诉

A Kernel Measure of Dissimilarity between M

Distributions

Zhen Huang

Department of Statistics, Columbia University

e-mail: zh2395@columbia.edu

and

Bodhisattva Sen∗

Department of Statistics, Columbia University

e-mail: bodhi@stat.columbia.edu

October 18, 2022

Abstract

Given M≥2 distributions deﬁned on a general measurable space, we introduce a

nonparametric (kernel) measure of multi-sample dissimilarity (KMD) — a parameter

that quantiﬁes the diﬀerence between the Mdistributions. The population KMD,

which takes values between 0 and 1, is 0 if and only if all the Mdistributions are

the same, and 1 if and only if all the distributions are mutually singular. Moreover,

KMD possesses many properties commonly associated with f-divergences such as the

data processing inequality and invariance under bijective transformations. The sample

estimate of KMD, based on independent observations from the Mdistributions, can

be computed in near linear time (up to logarithmic factors) using k-nearest neighbor

graphs (for k≥1 ﬁxed). We develop an easily implementable test for the equality of

Mdistributions based on the sample KMD that is consistent against all alternatives

where at least two distributions are not equal. We prove central limit theorems for the

sample KMD, and provide a complete characterization of the asymptotic power of the

test, as well as its detection threshold. The usefulness of our measure is demonstrated

via real and synthetic data examples; our method is also implemented in an Rpackage.

Keywords: Asymptotic power behavior, detection threshold, k-nearest neighbor graph,

multi-distribution f-divergence, nonparametric test for equality of distributions

1 Introduction

Suppose that Zis a general measurable space, and we have Mdistributions P1, . . . , PMon

Zfrom which we observe independent samples. A natural statistical question to ask here

is: “Are these Mdistribution the same? If they are not the same, then how diﬀerent are

they?”. In this paper, we answer these questions by proposing a nonparametric measure

that quantiﬁes the diﬀerences between the multiple samples. To rephrase this in statistical

language, we deﬁne a measure η≡η(P1, . . . , PM) such that:

(i) ηis a deterministic number between [0,1];

∗Supported by NSF grant DMS-2015376

arXiv:2210.00634v2 [math.ST] 15 Oct 2022

Female 1 speaking 1

Female 2 speaking 1

Male 1 speaking 1

Male 1 speaking 2

−4

Time

MFCCs

MFCC 1 2 3 4 5 6 7 8 9 10 11 12 13

Figure 1: Four instances of spoken Arabic digits, each of which contains time series of 13

Mel Frequency Cepstrum Coeﬃcients (MFCCs).

(ii) η= 0 if and only if P1=. . . =PM(i.e., all the Mdistributions are the same), and

(iii) η= 1 if and only if the Mdistributions are mutually singular, i.e., there exist disjoint

measurable sets {Ai}M

i=1 such that Pi(Ai) = 1, for i= 1, . . . , M. Thus, η= 1 quantiﬁes

that the Mdistributions are very diﬀerent.

Moreover, any value between 0 and 1 of ηwould convey an idea of how diﬀerent these M

distributions are. We call our proposed ηas a kernel measure of multi-sample dissimilarity

(KMD) as its deﬁnition involves a positive semi-deﬁnite kernel matrix.

While there is a rich literature on the multi-sample testing problem:

H0:P1=. . . =PMagainst H1:Pi6=Pj,for some 1 ≤i<j≤M, (1)

(see Section 1.1 for a detailed review) most of these tests do not quantify to what ex-

tent these distributions are diﬀerent, when the null is violated. Moreover, many popular

distances or similarities — such as the KL-divergence, Hellinger distance, and other f-

divergences [59,38,7] — that quantify the diﬀerence between two distributions can be

diﬃcult to estimate when Zis not a Euclidean space. In modern statistical applications, it

is quite often necessary to compare more than two distributions, and they can be deﬁned

on a general measurable space. Two motivating applications in this direction are:

Example 1 (Multivariate functional data): In speech recognition, data may be viewed as

functional inputs. For example, Figure 1shows four instances of spoken Arabic digits from

the ArabicDigits data set in Gorecki et al. [32], each of which contains time series of 13 Mel

Frequency Cepstrum Coeﬃcients (MFCCs). Hence, in this scenario, Zis the space of all

13-dimensional time series. A natural problem here is to distinguish the 10 spoken digits,

which can be seen as M= 10 diﬀerent distributions on Z. Knowing how diﬀerent these

Mdistributions are on a scale between 0 to 1 provides a sense of how well any machine

learning algorithm can perform in distinguishing these Mdigits. One can also study how

the digits spoken may diﬀer by gender. For example, how males speak the number 1 should

be diﬀerent from how males speak the number 2, but this diﬀerence should be smaller

than that between how females speak 1 and how males speak 2. A statistical measure that

quantiﬁes the extent of these diﬀerences can be obtained by our procedure.

Example 2 (Distribution over documents): In sentiment analysis [48,65], the goal is to

identify the sentiment (e.g., positive, negative, neutral) of a document. These documents

may each contain hundreds of words, and are typically of diﬀerent lengths. For example,

Maas et al. [48] considered 2000 movie reviews, each of which is either positive or negative.

If there are in total Msentiments, then there are naturally Mdistributions in the space

of documents — each distribution corresponding to a diﬀerent sentiment. Let Vbe the

vocabulary, i.e., the set of all possible words. Then, in this case, Z={(v1, . . . , vn) : vi∈

V, n ∈N}with each element in Zdenoting a generic document. Given a speciﬁc data set

with nidocuments, conveying the i-th sentiment, for i= 1, . . . , M , a data analyst may want

to know how diﬀerent the Msentiments are in this data set. Our procedure can quantify

this diﬀerence, so as to suggest whether it is possible to eﬀectively classify the diﬀerent

sentiments.

Often in practice, we have a distance between two objects in Zwhich has already been

demonstrated to be useful in the domain area. For the ﬁrst example above, a useful metric is

the dynamic time warping (DTW) distance between multi-dimensional time series [9,31], a

distance which could account for diﬀerences in speaking rates between speakers and can be

computed even when the two time series are discretized into diﬀerent lengths. For the second

example above, one can use the Jaccard distance or other distances between documents [65].

In these situations, applying usual Euclidean methods may require additional embedding

eﬀorts (see e.g., [35] on document embedding). In contrast, it will be seen that our method

can directly provide an easy, practical, and interpretable way to quantify the diﬀerence

between multiple distributions, as long as a metric is available on the space Z.

In this paper, given niindependent observations from Pi,i= 1, . . . , M, we propose and

study an empirical estimator ˆηof η, constructed using geometric graphs [21,10] (e.g., the

k-nearest neighbor (k-NN) graph for k≥1). The main contributions of the paper, and

some important properties of ηand ˆηare summarized below:

1. We propose a nonparametric measure of multi-sample dissimilarity η=η(P1, . . . , PM)∈

[0,1] that satisﬁes properties (i)–(iii) mentioned above (see Theorem 1). Moreover,

any value of η, between 0 and 1, conveys an idea about how diﬀerent these distribu-

tions are. For example, in a large class of location and scale distributions, ηincreases

as the “diﬀerence” between the parameters gets larger (see Proposition 1). Our η

also satisﬁes many desirable properties commonly associated with f-divergences [20],

including the data processing inequality and invariance under bijective transforma-

tions (Proposition 2) and joint convexity (Proposition 3). Indeed, ηis a member of

the multi-distribution f-divergence proposed in Garc´ıa-Garc´ıa and Williamson [30];

however, no estimation strategy was given in [30].

2. We develop an estimator ˆηof η, which is consistent (Theorem 2), interpretable, eas-

ily implementable and computationally eﬃcient. It can be computed in linear time

(up to logarithmic factors); this is an enormous reduction from the O(n2) complex-

ity of energy and kernel based methods [70,33], and other O(n3) complexity / NP

complete approaches [62,13,22,49,37]. We further show that ˆηis a generalization

of the two-sample statistic based on k-NN proposed in Schilling [63] and Henze [40]

(see Remark 2). The multi-sample test (1) based on ˆηis also consistent against all

alternatives for which Pi6=Pj, for some 1 ≤i<j≤M(Corollary 2).

3. The asymptotic distribution of ˆηis Gaussian, which yields easy-to-use asymptotic

tests. Under the null hypothesis (1), the permutation distribution1and the uncondi-

tional distribution of ˆηare both asymptotically normal (Theorem 3). Further, when

1Given the pooled sample without the information on their sample identities, any permutation of the

sample identities is equally likely under the null hypothesis (1). This conditional distribution is referred to

as the permutation distribution in this paper.

Zis a Euclidean space, the asymptotic null distribution is distribution-free if the

common distribution has a Lebesgue density (see Theorem 4for a precise statement).

4. We further provide in Section 5a complete characterization of the asymptotic power

of the test for (1) based on ˆη(see Theorem 5), and its detection threshold, using

the technique of Poissonization (cf., the recent paper Bhattacharya [11] where the

detection threshold for as a class of graph-based two-sample tests is studied). In

particular, we show that, under both ﬁxed and shrinking alternatives converging to the

null, ˆηhas an asymptotic normal distribution after proper centering; see Appendix B

(Theorems B.1 and B.2).

5. In Section 6, we demonstrate the usefulness of the proposed methodology via real and

synthetic data examples. Our method is also implemented in an Rpackage2.

The outline of the paper is as follows. In Section 2, we formally deﬁne ηand investigate

its properties. Its estimator ˆηis studied in Section 3along with its basic characteristics.

In Section 4we provide the asymptotic distribution of ˆη, under the null (1). A rigorous

study of the power behavior of the test based on ˆηand its detection threshold is given in

Section 5. Simulations and real data experiments are provided in Section 6. All the proofs

of our main results, further discussions, additional numerical experiments, and more results

on the behavior of ˆηunder the alternative are given in Appendices A-D.

1.1 Related Works

The M-sample testing problem (1) has been extensively studied in the statistics literature,

both in parametric and nonparametric regimes. Parametric tests (e.g., t-test, MANOVA,

likelihood ratio tests, Wald tests) are provably powerful when the underlying model as-

sumptions hold true (see e.g., [44] and the references therein), but could have poor perfor-

mance when the model is misspeciﬁed. In comparison, nonparametric methods are usually

powerful against general alternatives, under much more relaxed assumptions on the data

distributions; this will be the framework adopted in this paper.

There is a long history of nonparametric two-sample tests. In the one-dimensional

case, classical well-known distribution-free tests include the Kolmogorov-Smirnov test [66]

and the Wald-Wolfowitz run test [71]. In the multivariate setting, Friedman and Rafsky

[28] proposed generalizations of the Wald-Wolfowitz test based on the minimum spanning

tree of the pooled sample. Its theoretical properties were further analyzed by Henze and

Penrose [41]. Multivariate two-sample tests based on nearest neighbor ideas were proposed

in [63,40,36]. Chen and Friedman [17] extended a number of two-sample tests based on

type coincidences in a geometric graph. Bhattacharya [10] proposed a general asymptotic

framework for studying graph-based two-sample tests; also see [11]. Apart from the above

tests based on geometric graphs, there are two-sample tests based on data depth [46,73]. In

the past decade, energy statistics [70] and tests based on the maximum mean discrepancy

(MMD) [33] in the kernel literature have also drawn great attention in two-sample testing,

and their equivalence has been established [64]. There are also methods that achieve ﬁnite-

sample distribution-freeness by using minimum non-bipartite matching [62,49], the shortest

Hamiltonian path [13], or multivariate ranks deﬁned via optimal transport [22,37].

When moving to general M-sample testing, Petrie [57] generalized a number of graph-

theoretic tests to the multi-sample scenario. Recently Mukherjee et al. [49] generalized the

2See https://cran.r-project.org/package=KMD and https://github.com/zh2395/KMD.

test in [62] based on minimum non-bipartite matching, retaining the exact distribution-

freeness in ﬁnite samples. Energy statistic has also been used in Rizzo and Sz´ekely [61]

for M-sample testing, providing a nonparametric extension of ANOVA. Liu and Singh [46]

proposed an index Qbetween [0,1] measuring the dissimilarity between two distributions,

based on data depth, with the null being achieved when Q=1

2; however Qdoes not satisfy

properties (i)-(iii). Although some f-divergences, such as the Hellinger distance [38] or the

total variation distance, satisfy (i)-(iii), they cannot be easily extended beyond M= 2.

2 KMD: The Population Version

In this section we deﬁne the population version of our measure of dissimilarity ηbetween

the Mdistributions P1, . . . , PM. Our deﬁnition of ηinvolves the use of a reproducing kernel

Hilbert space (RKHS) over the ﬁnite discrete space S:= {1, . . . , M}, and hence we call our

proposal the kernel measure of multi-sample dissimilarity (KMD). Although the discrete

kernel K(x, y) := I(x=y), for x, y ∈ S, seems to be the most natural choice of the kernel

over the discrete space S, our results are applicable to other kernels as well.

Reproducing Kernel Hilbert Space (RKHS): While there is a general theory of RKHS

on arbitrary spaces, the RKHS over a ﬁnite space is much simpler, which will be introduced

in the following. For an introduction to the theory of RKHS and its applications in statistics

we refer the reader to [8,69]. By a kernel function K:S×S → Rwe mean a symmetric and

nonnegative deﬁnite function, i.e., the matrix [K(i, j)]M

i,j=1 is positive semi-deﬁnite. The

kernel K(·,·) is said to be characteristic if for any (α1, . . . , αM)6= (0,...,0) with PM

i=1 αi=

0, we have PM

i,j=1 αiαjK(i, j)>0. Note that the usual deﬁnition of a characteristic kernel

is through the uniqueness of the kernel mean embedding [67], but it is equivalent to this

simpler deﬁnition when the space Sis ﬁnite; see e.g., [67, Section 4.4].

2.1 Deﬁnition of η

Suppose that we have niindependent observations Xi1, . . . , Xinifrom the distribution Pi

taking values in Z, for i= 1, . . . , M. Denote the pooled sample {X11, . . . , X1n1, . . . , XM1,...,

XMnM}as {Z1, . . . , Zn}with n=n1+. . .+nM, and the corresponding labels as ∆1,...,∆n,

i.e., ∆j=i∈ {1, . . . , M}if Zjcomes from distribution Pi. If ni

n≈πi∈(0,1) such that

i=1 πi= 1, then {(∆i, Zi)}n

i=1 can be “approximately” thought of as an i.i.d. sample of

size nfrom ( ˜

∆,˜

Z) whose distribution is speciﬁed as follows:

1. P(˜

∆ = i) = πi, for i= 1, . . . , M.

2. Given ˜

∆ = i,˜

Zis drawn from distribution Pi.

The following lemma (proved in Appendix C.1) is a crucial observation that will allow us

to formally deﬁne η.

Lemma 1. The Mdistributions P1, . . . , PMare the same if and only if ˜

∆⊥⊥ ˜

Z, and the

Mdistributions are mutually singular if and only if ˜

∆is a function of ˜

The connection between Msample testing and measures of association between ˜

∆ and

Zhas been noted before; see e.g., [29,50]. The above lemma motivates the use of a

certain measure of association [16,4,21,42] between ˜

∆ and ˜

Zto quantify the dissimilarity

between the Mdistributions. In particular, we adopt the ideas from the kernel measure

of association (KMAc) proposed in Deb et al. [21]. Let µbe the distribution of ( ˜

Z, ˜

∆)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AKernelMeasureofDissimilaritybetweenMDistributionsZhenHuangDepartmentofStatistics,ColumbiaUniversitye-mail:zh2395@columbia.eduandBodhisattvaSen*DepartmentofStatistics,ColumbiaUniversitye-mail:bodhi@stat.columbia.eduOctober18,2022AbstractGivenM2distributionsdenedonageneralmeasurablespace,weintroduc...

展开>> 收起<<

A Kernel Measure of Dissimilarity between M Distributions Zhen Huang.pdf

共80页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Kernel Measure of Dissimilarity between M Distributions Zhen Huang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: