A Kernel Measure of Dissimilarity between M Distributions Zhen Huang

2025-04-27 0 0 1.24MB 80 页 10玖币
侵权投诉
A Kernel Measure of Dissimilarity between M
Distributions
Zhen Huang
Department of Statistics, Columbia University
e-mail: zh2395@columbia.edu
and
Bodhisattva Sen
Department of Statistics, Columbia University
e-mail: bodhi@stat.columbia.edu
October 18, 2022
Abstract
Given M2 distributions defined on a general measurable space, we introduce a
nonparametric (kernel) measure of multi-sample dissimilarity (KMD) — a parameter
that quantifies the difference between the Mdistributions. The population KMD,
which takes values between 0 and 1, is 0 if and only if all the Mdistributions are
the same, and 1 if and only if all the distributions are mutually singular. Moreover,
KMD possesses many properties commonly associated with f-divergences such as the
data processing inequality and invariance under bijective transformations. The sample
estimate of KMD, based on independent observations from the Mdistributions, can
be computed in near linear time (up to logarithmic factors) using k-nearest neighbor
graphs (for k1 fixed). We develop an easily implementable test for the equality of
Mdistributions based on the sample KMD that is consistent against all alternatives
where at least two distributions are not equal. We prove central limit theorems for the
sample KMD, and provide a complete characterization of the asymptotic power of the
test, as well as its detection threshold. The usefulness of our measure is demonstrated
via real and synthetic data examples; our method is also implemented in an Rpackage.
Keywords: Asymptotic power behavior, detection threshold, k-nearest neighbor graph,
multi-distribution f-divergence, nonparametric test for equality of distributions
1 Introduction
Suppose that Zis a general measurable space, and we have Mdistributions P1, . . . , PMon
Zfrom which we observe independent samples. A natural statistical question to ask here
is: “Are these Mdistribution the same? If they are not the same, then how different are
they?”. In this paper, we answer these questions by proposing a nonparametric measure
that quantifies the differences between the multiple samples. To rephrase this in statistical
language, we define a measure ηη(P1, . . . , PM) such that:
(i) ηis a deterministic number between [0,1];
Supported by NSF grant DMS-2015376
1
arXiv:2210.00634v2 [math.ST] 15 Oct 2022
Female 1 speaking 1
Female 2 speaking 1
Male 1 speaking 1
Male 1 speaking 2
−4
0
4
Time
MFCCs
MFCC 1 2 3 4 5 6 7 8 9 10 11 12 13
Figure 1: Four instances of spoken Arabic digits, each of which contains time series of 13
Mel Frequency Cepstrum Coefficients (MFCCs).
(ii) η= 0 if and only if P1=. . . =PM(i.e., all the Mdistributions are the same), and
(iii) η= 1 if and only if the Mdistributions are mutually singular, i.e., there exist disjoint
measurable sets {Ai}M
i=1 such that Pi(Ai) = 1, for i= 1, . . . , M. Thus, η= 1 quantifies
that the Mdistributions are very different.
Moreover, any value between 0 and 1 of ηwould convey an idea of how different these M
distributions are. We call our proposed ηas a kernel measure of multi-sample dissimilarity
(KMD) as its definition involves a positive semi-definite kernel matrix.
While there is a rich literature on the multi-sample testing problem:
H0:P1=. . . =PMagainst H1:Pi6=Pj,for some 1 i<jM, (1)
(see Section 1.1 for a detailed review) most of these tests do not quantify to what ex-
tent these distributions are different, when the null is violated. Moreover, many popular
distances or similarities — such as the KL-divergence, Hellinger distance, and other f-
divergences [59,38,7] — that quantify the difference between two distributions can be
difficult to estimate when Zis not a Euclidean space. In modern statistical applications, it
is quite often necessary to compare more than two distributions, and they can be defined
on a general measurable space. Two motivating applications in this direction are:
Example 1 (Multivariate functional data): In speech recognition, data may be viewed as
functional inputs. For example, Figure 1shows four instances of spoken Arabic digits from
the ArabicDigits data set in Gorecki et al. [32], each of which contains time series of 13 Mel
Frequency Cepstrum Coefficients (MFCCs). Hence, in this scenario, Zis the space of all
13-dimensional time series. A natural problem here is to distinguish the 10 spoken digits,
which can be seen as M= 10 different distributions on Z. Knowing how different these
Mdistributions are on a scale between 0 to 1 provides a sense of how well any machine
learning algorithm can perform in distinguishing these Mdigits. One can also study how
the digits spoken may differ by gender. For example, how males speak the number 1 should
be different from how males speak the number 2, but this difference should be smaller
than that between how females speak 1 and how males speak 2. A statistical measure that
quantifies the extent of these differences can be obtained by our procedure.
Example 2 (Distribution over documents): In sentiment analysis [48,65], the goal is to
identify the sentiment (e.g., positive, negative, neutral) of a document. These documents
may each contain hundreds of words, and are typically of different lengths. For example,
Maas et al. [48] considered 2000 movie reviews, each of which is either positive or negative.
2
If there are in total Msentiments, then there are naturally Mdistributions in the space
of documents — each distribution corresponding to a different sentiment. Let Vbe the
vocabulary, i.e., the set of all possible words. Then, in this case, Z={(v1, . . . , vn) : vi
V, n N}with each element in Zdenoting a generic document. Given a specific data set
with nidocuments, conveying the i-th sentiment, for i= 1, . . . , M , a data analyst may want
to know how different the Msentiments are in this data set. Our procedure can quantify
this difference, so as to suggest whether it is possible to effectively classify the different
sentiments.
Often in practice, we have a distance between two objects in Zwhich has already been
demonstrated to be useful in the domain area. For the first example above, a useful metric is
the dynamic time warping (DTW) distance between multi-dimensional time series [9,31], a
distance which could account for differences in speaking rates between speakers and can be
computed even when the two time series are discretized into different lengths. For the second
example above, one can use the Jaccard distance or other distances between documents [65].
In these situations, applying usual Euclidean methods may require additional embedding
efforts (see e.g., [35] on document embedding). In contrast, it will be seen that our method
can directly provide an easy, practical, and interpretable way to quantify the difference
between multiple distributions, as long as a metric is available on the space Z.
In this paper, given niindependent observations from Pi,i= 1, . . . , M, we propose and
study an empirical estimator ˆηof η, constructed using geometric graphs [21,10] (e.g., the
k-nearest neighbor (k-NN) graph for k1). The main contributions of the paper, and
some important properties of ηand ˆηare summarized below:
1. We propose a nonparametric measure of multi-sample dissimilarity η=η(P1, . . . , PM)
[0,1] that satisfies properties (i)–(iii) mentioned above (see Theorem 1). Moreover,
any value of η, between 0 and 1, conveys an idea about how different these distribu-
tions are. For example, in a large class of location and scale distributions, ηincreases
as the “difference” between the parameters gets larger (see Proposition 1). Our η
also satisfies many desirable properties commonly associated with f-divergences [20],
including the data processing inequality and invariance under bijective transforma-
tions (Proposition 2) and joint convexity (Proposition 3). Indeed, ηis a member of
the multi-distribution f-divergence proposed in Garc´ıa-Garc´ıa and Williamson [30];
however, no estimation strategy was given in [30].
2. We develop an estimator ˆηof η, which is consistent (Theorem 2), interpretable, eas-
ily implementable and computationally efficient. It can be computed in linear time
(up to logarithmic factors); this is an enormous reduction from the O(n2) complex-
ity of energy and kernel based methods [70,33], and other O(n3) complexity / NP
complete approaches [62,13,22,49,37]. We further show that ˆηis a generalization
of the two-sample statistic based on k-NN proposed in Schilling [63] and Henze [40]
(see Remark 2). The multi-sample test (1) based on ˆηis also consistent against all
alternatives for which Pi6=Pj, for some 1 i<jM(Corollary 2).
3. The asymptotic distribution of ˆηis Gaussian, which yields easy-to-use asymptotic
tests. Under the null hypothesis (1), the permutation distribution1and the uncondi-
tional distribution of ˆηare both asymptotically normal (Theorem 3). Further, when
1Given the pooled sample without the information on their sample identities, any permutation of the
sample identities is equally likely under the null hypothesis (1). This conditional distribution is referred to
as the permutation distribution in this paper.
3
Zis a Euclidean space, the asymptotic null distribution is distribution-free if the
common distribution has a Lebesgue density (see Theorem 4for a precise statement).
4. We further provide in Section 5a complete characterization of the asymptotic power
of the test for (1) based on ˆη(see Theorem 5), and its detection threshold, using
the technique of Poissonization (cf., the recent paper Bhattacharya [11] where the
detection threshold for as a class of graph-based two-sample tests is studied). In
particular, we show that, under both fixed and shrinking alternatives converging to the
null, ˆηhas an asymptotic normal distribution after proper centering; see Appendix B
(Theorems B.1 and B.2).
5. In Section 6, we demonstrate the usefulness of the proposed methodology via real and
synthetic data examples. Our method is also implemented in an Rpackage2.
The outline of the paper is as follows. In Section 2, we formally define ηand investigate
its properties. Its estimator ˆηis studied in Section 3along with its basic characteristics.
In Section 4we provide the asymptotic distribution of ˆη, under the null (1). A rigorous
study of the power behavior of the test based on ˆηand its detection threshold is given in
Section 5. Simulations and real data experiments are provided in Section 6. All the proofs
of our main results, further discussions, additional numerical experiments, and more results
on the behavior of ˆηunder the alternative are given in Appendices A-D.
1.1 Related Works
The M-sample testing problem (1) has been extensively studied in the statistics literature,
both in parametric and nonparametric regimes. Parametric tests (e.g., t-test, MANOVA,
likelihood ratio tests, Wald tests) are provably powerful when the underlying model as-
sumptions hold true (see e.g., [44] and the references therein), but could have poor perfor-
mance when the model is misspecified. In comparison, nonparametric methods are usually
powerful against general alternatives, under much more relaxed assumptions on the data
distributions; this will be the framework adopted in this paper.
There is a long history of nonparametric two-sample tests. In the one-dimensional
case, classical well-known distribution-free tests include the Kolmogorov-Smirnov test [66]
and the Wald-Wolfowitz run test [71]. In the multivariate setting, Friedman and Rafsky
[28] proposed generalizations of the Wald-Wolfowitz test based on the minimum spanning
tree of the pooled sample. Its theoretical properties were further analyzed by Henze and
Penrose [41]. Multivariate two-sample tests based on nearest neighbor ideas were proposed
in [63,40,36]. Chen and Friedman [17] extended a number of two-sample tests based on
type coincidences in a geometric graph. Bhattacharya [10] proposed a general asymptotic
framework for studying graph-based two-sample tests; also see [11]. Apart from the above
tests based on geometric graphs, there are two-sample tests based on data depth [46,73]. In
the past decade, energy statistics [70] and tests based on the maximum mean discrepancy
(MMD) [33] in the kernel literature have also drawn great attention in two-sample testing,
and their equivalence has been established [64]. There are also methods that achieve finite-
sample distribution-freeness by using minimum non-bipartite matching [62,49], the shortest
Hamiltonian path [13], or multivariate ranks defined via optimal transport [22,37].
When moving to general M-sample testing, Petrie [57] generalized a number of graph-
theoretic tests to the multi-sample scenario. Recently Mukherjee et al. [49] generalized the
2See https://cran.r-project.org/package=KMD and https://github.com/zh2395/KMD.
4
test in [62] based on minimum non-bipartite matching, retaining the exact distribution-
freeness in finite samples. Energy statistic has also been used in Rizzo and Sz´ekely [61]
for M-sample testing, providing a nonparametric extension of ANOVA. Liu and Singh [46]
proposed an index Qbetween [0,1] measuring the dissimilarity between two distributions,
based on data depth, with the null being achieved when Q=1
2; however Qdoes not satisfy
properties (i)-(iii). Although some f-divergences, such as the Hellinger distance [38] or the
total variation distance, satisfy (i)-(iii), they cannot be easily extended beyond M= 2.
2 KMD: The Population Version
In this section we define the population version of our measure of dissimilarity ηbetween
the Mdistributions P1, . . . , PM. Our definition of ηinvolves the use of a reproducing kernel
Hilbert space (RKHS) over the finite discrete space S:= {1, . . . , M}, and hence we call our
proposal the kernel measure of multi-sample dissimilarity (KMD). Although the discrete
kernel K(x, y) := I(x=y), for x, y ∈ S, seems to be the most natural choice of the kernel
over the discrete space S, our results are applicable to other kernels as well.
Reproducing Kernel Hilbert Space (RKHS): While there is a general theory of RKHS
on arbitrary spaces, the RKHS over a finite space is much simpler, which will be introduced
in the following. For an introduction to the theory of RKHS and its applications in statistics
we refer the reader to [8,69]. By a kernel function K:S×S Rwe mean a symmetric and
nonnegative definite function, i.e., the matrix [K(i, j)]M
i,j=1 is positive semi-definite. The
kernel K(·,·) is said to be characteristic if for any (α1, . . . , αM)6= (0,...,0) with PM
i=1 αi=
0, we have PM
i,j=1 αiαjK(i, j)>0. Note that the usual definition of a characteristic kernel
is through the uniqueness of the kernel mean embedding [67], but it is equivalent to this
simpler definition when the space Sis finite; see e.g., [67, Section 4.4].
2.1 Definition of η
Suppose that we have niindependent observations Xi1, . . . , Xinifrom the distribution Pi
taking values in Z, for i= 1, . . . , M. Denote the pooled sample {X11, . . . , X1n1, . . . , XM1,...,
XMnM}as {Z1, . . . , Zn}with n=n1+. . .+nM, and the corresponding labels as ∆1,...,n,
i.e., ∆j=i∈ {1, . . . , M}if Zjcomes from distribution Pi. If ni
nπi(0,1) such that
PM
i=1 πi= 1, then {(∆i, Zi)}n
i=1 can be “approximately” thought of as an i.i.d. sample of
size nfrom ( ˜
,˜
Z) whose distribution is specified as follows:
1. P(˜
∆ = i) = πi, for i= 1, . . . , M.
2. Given ˜
∆ = i,˜
Zis drawn from distribution Pi.
The following lemma (proved in Appendix C.1) is a crucial observation that will allow us
to formally define η.
Lemma 1. The Mdistributions P1, . . . , PMare the same if and only if ˜
˜
Z, and the
Mdistributions are mutually singular if and only if ˜
is a function of ˜
Z.
The connection between Msample testing and measures of association between ˜
∆ and
˜
Zhas been noted before; see e.g., [29,50]. The above lemma motivates the use of a
certain measure of association [16,4,21,42] between ˜
∆ and ˜
Zto quantify the dissimilarity
between the Mdistributions. In particular, we adopt the ideas from the kernel measure
of association (KMAc) proposed in Deb et al. [21]. Let µbe the distribution of ( ˜
Z, ˜
∆)
5
摘要:

AKernelMeasureofDissimilaritybetweenMDistributionsZhenHuangDepartmentofStatistics,ColumbiaUniversitye-mail:zh2395@columbia.eduandBodhisattvaSen*DepartmentofStatistics,ColumbiaUniversitye-mail:bodhi@stat.columbia.eduOctober18,2022AbstractGivenM2distributionsde nedonageneralmeasurablespace,weintroduc...

展开>> 收起<<
A Kernel Measure of Dissimilarity between M Distributions Zhen Huang.pdf

共80页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:80 页 大小:1.24MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 80
客服
关注