
If there are in total Msentiments, then there are naturally Mdistributions in the space
of documents — each distribution corresponding to a different sentiment. Let Vbe the
vocabulary, i.e., the set of all possible words. Then, in this case, Z={(v1, . . . , vn) : vi∈
V, n ∈N}with each element in Zdenoting a generic document. Given a specific data set
with nidocuments, conveying the i-th sentiment, for i= 1, . . . , M , a data analyst may want
to know how different the Msentiments are in this data set. Our procedure can quantify
this difference, so as to suggest whether it is possible to effectively classify the different
sentiments.
Often in practice, we have a distance between two objects in Zwhich has already been
demonstrated to be useful in the domain area. For the first example above, a useful metric is
the dynamic time warping (DTW) distance between multi-dimensional time series [9,31], a
distance which could account for differences in speaking rates between speakers and can be
computed even when the two time series are discretized into different lengths. For the second
example above, one can use the Jaccard distance or other distances between documents [65].
In these situations, applying usual Euclidean methods may require additional embedding
efforts (see e.g., [35] on document embedding). In contrast, it will be seen that our method
can directly provide an easy, practical, and interpretable way to quantify the difference
between multiple distributions, as long as a metric is available on the space Z.
In this paper, given niindependent observations from Pi,i= 1, . . . , M, we propose and
study an empirical estimator ˆηof η, constructed using geometric graphs [21,10] (e.g., the
k-nearest neighbor (k-NN) graph for k≥1). The main contributions of the paper, and
some important properties of ηand ˆηare summarized below:
1. We propose a nonparametric measure of multi-sample dissimilarity η=η(P1, . . . , PM)∈
[0,1] that satisfies properties (i)–(iii) mentioned above (see Theorem 1). Moreover,
any value of η, between 0 and 1, conveys an idea about how different these distribu-
tions are. For example, in a large class of location and scale distributions, ηincreases
as the “difference” between the parameters gets larger (see Proposition 1). Our η
also satisfies many desirable properties commonly associated with f-divergences [20],
including the data processing inequality and invariance under bijective transforma-
tions (Proposition 2) and joint convexity (Proposition 3). Indeed, ηis a member of
the multi-distribution f-divergence proposed in Garc´ıa-Garc´ıa and Williamson [30];
however, no estimation strategy was given in [30].
2. We develop an estimator ˆηof η, which is consistent (Theorem 2), interpretable, eas-
ily implementable and computationally efficient. It can be computed in linear time
(up to logarithmic factors); this is an enormous reduction from the O(n2) complex-
ity of energy and kernel based methods [70,33], and other O(n3) complexity / NP
complete approaches [62,13,22,49,37]. We further show that ˆηis a generalization
of the two-sample statistic based on k-NN proposed in Schilling [63] and Henze [40]
(see Remark 2). The multi-sample test (1) based on ˆηis also consistent against all
alternatives for which Pi6=Pj, for some 1 ≤i<j≤M(Corollary 2).
3. The asymptotic distribution of ˆηis Gaussian, which yields easy-to-use asymptotic
tests. Under the null hypothesis (1), the permutation distribution1and the uncondi-
tional distribution of ˆηare both asymptotically normal (Theorem 3). Further, when
1Given the pooled sample without the information on their sample identities, any permutation of the
sample identities is equally likely under the null hypothesis (1). This conditional distribution is referred to
as the permutation distribution in this paper.
3