A uniform kernel trick for high and infinite-dimensional two-sample problems_2

2025-04-27 0 0 527.03KB 21 页 10玖币
侵权投诉
A uniform kernel trick for high and infinite-dimensional two-sample problems
Javier C´
arcamoa,, Antonio Cuevasb, Luis-Alberto Rodr´
ıguezc
aEuskal Herriko Unibertsitatea-Universidad del Pıs Vasco, Spain
bUniversidad Aut´onoma de Madrid and Instituto de Ciencias Matem´aticas ICMAT (CSIC-UAM-UCM-UC3M), Spain
cGeorg-August-Universit¨at G¨ottingen, Germany
Abstract
We use a suitable version of the so-called “kernel trick” to devise two-sample tests, especially focussed on high-
dimensional and functional data. Our proposal entails a simplification of the practical problem of selecting an ap-
propriate kernel function. Specifically, we apply a uniform variant of the kernel trick which involves the supremum
within a class of kernel-based distances. We obtain the asymptotic distribution of the test statistic under the null and
alternative hypotheses. The proofs rely on empirical processes theory, combined with the delta method and Hadamard
directional dierentiability techniques, and functional Karhunen-Lo`
eve-type expansions of the underlying processes.
This methodology has some advantages over other standard approaches in the literature. We also give some experi-
mental insight into the performance of our proposal compared to other kernel-based approaches (the original proposal
by [10] and some variants based on splitting methods) as well as tests based on energy distances [41].
Keywords: Homogeneity test, Reproducing kernel Hilbert spaces, Supremum-like distances.
2020 MSC: 62G10, 62G20
1. Introduction: an overview
In this section we provide an extended summary including not only the main ideas of this work but, specially, the
general setting, motivation and related literature, as well as the technical tools we use.
The kernel trick and some potential kernel traps
We focus on statistical problems where, essentially, the aim is to properly separate data coming from two dierent
populations; this is the case of binary supervised classification and two-sample testing problems. In such situations,
the kernel trick is a common paradigm. In a few words, the standard multivariate version (i.e., with data in Rd) of the
kernel trick lies in separating the data in both populations using a symmetric non-negative definite “kernel function”.
The values of the kernel can be seen as the inner product of transformed versions of the original observations in a
dierent (usually higher-dimensional) space. It is expected that the groups can be better distinguished in the new final
space; see [43].
We are particularly interested in those situations in which the available data are high-dimensional or even func-
tional (thus, infinite-dimensional). In such cases, the strategy of mapping the data into a higher-dimensional space
does not seem to be so compelling. Still, the kernel trick remains meaningful in a sort of “second generation” version,
whose point is to take the data to a more comfortable and flexible space. In this new space, the statistical metho-
dology might be mathematically more tractable, and more easily implemented and interpreted. To be more precise, a
probability distribution P on the sample space Xis replaced with a function
µP(x)=Xk(x,y)dP(y),xX,(1)
in an appropriate space of “nice functions” defined by means of the kernel k. In this way, the distance between
two probability measures is computed in terms of the metric in the functional space. As a matter of fact, one of
Corresponding author. Email address: javier.carcamo@ehu.eus
Preprint submitted to Journal of Multivariate Analysis April 24, 2024
arXiv:2210.02171v3 [math.ST] 23 Apr 2024
the most appealing proposals in this direction relies on kernel-based distances, expressed in terms of the embedding
transformation µPin (1); see [10].
The kernel kinvolved in this methodology depends, almost unavoidably, on some tuning parameter λ, typically a
scale factor. Therefore, we actually have a family of kernels, kλ, for λΛ, where Λis usually a subset of Rk(k1).
For instance, the popular family of Gaussian kernels with parameter λ(0,)is defined by
kλ(x,y)=exp λxy2,for x,yX,(2)
where is a norm in X. Unfortunately, there is no general rule to know a priori which kernel works best with the
available data. In other words, the choice of λis, to some extent, arbitrary but not irrelevant, as it could remarkably
aect the final output. For example, very small or very large choices of λin (2) result in null discrepancies, which have
no ability to distinguish distributions. The selection of λis hence a delicate problem that has not been satisfactorily
solved so far. This is what we call the kernel trap: a bad choice of the parameter leading to poor results. Although
this problem was not explicitly considered in [10] and subsequent works on this topic, the authors were aware of this
relevant question; in practice, they use a heuristic choice of λ.
Further, a parameter-dependent method might be an obstacle for practitioners who are often reluctant to use
procedures depending on auxiliary, hard-to-interpret parameters. We thus find here a particular instance of the trade-
obetween power and applicability: as stated in [47], the practical power of a statistical procedure is defined as “the
product of the mathematical power by the probability that the procedure will be used” (Tukey credits to Churchill
Eisenhart for this idea). From this perspective, our proposal can be viewed as an attempt to make kernel-based
homogeneity tests more usable by getting rid of the tuning parameter(s). Roughly speaking, the idea that we propose
to avoid selecting a specific value of λwithin the family {kλλΛ}is to take the supremum over the set of parameters
Λof the resulting family of kernel-distances. We call this approach the uniform kernel trick, as we map the data into
many functional spaces at the same time and use, as test statistic, the supremum of the corresponding kernel distances.
We believe that this methodology could be successfully applied as well in supervised classification, though this topic
is not considered in this work.
The topic of this paper
Two-sample tests, also called homogeneity tests, aim to decide whether or not it can be accepted that two random
elements have the same distribution, using the information provided by two independent samples. This problem is
omnipresent in practice on account of their applicability to a great variety of situations, ranging from biomedicine
to quality control. Since the classical Student’s t-tests or rank-based (Mann-Whitney, Wilcoxon, . . . ) procedures,
the subject has received an almost permanent attention from the statistical community. In this work we focus on
two-sample tests valid, under broad assumptions, for general settings in which the data are drawn from two random
elements Xand Ytaking values in a general space X. The set Xis the “sample space” or “feature space” in the
Machine Learning language. In the important particular case X=L2([0,1]),Xand Yare stochastic processes and
the two-sample problem lies within the framework of Functional Data Analysis (FDA).
Many important statistical methods, including goodness of fit and homogeneity tests, are based on an appropri-
ate metric (or discrepancy measure) that allows groups or distributions to be distinguished. Probability distances or
semi-distances reveal to the practitioner the dissimilarity between two random quantities. Therefore, the estimation of
a suitable distance helps detect significant dierences between two populations. Some well-known, classic examples
of such metrics are the Kolmogorov distance, that leads to the popular Kolmogorov-Smirnov statistic, and L2-based
discrepancy measures, leading to Cram´
er-von Mises or Anderson-Darling statistics. These methods, based on cumu-
lative distribution functions, are no longer useful with high-dimensional or non-Euclidean data, as in FDA problems.
For this reason we follow a dierent strategy based on more adaptable metrics between general probability measures.
The energy distance (see the review by [41]) and the associated distance covariance, as well as kernel distance,
represent a step forward in this direction since they can be calculated with relative ease for high-dimensional distri-
butions. In [28] the relationships among these metrics in the context of hypothesis testing are discussed. In this paper
we consider an extension, as well as an alternative mathematical approach, for the two-sample test in [10]. These au-
thors show that kernel-based procedures perform better than other more classical approaches when dimension grows,
although they are strongly dependent on the choice of the kernel parameter.
2
Three important auxiliary tools: RKHS, mean embeddings, and kernel distances
To present the contributions of this paper, we briefly refer to some important, mutually related, technical notions.
As emphasized in [6], Reproducing Kernel Hilbert Spaces (RKHS in short) provide an excellent environment to
construct helpful transformations in several statistical problems. Given a topological space X(in many applications
Xis a subset of a Hilbert space), a kernel k is a real non-negative semidefinite symmetric function on X×X. The
RKHS associated with k, denoted in the following by Hk, is the Hilbert space generated by finite linear combinations
of type jαjk(xj,); see Section 2 for additional details.
Let Mp(X)be the set of (Borel) probability measures on X. Under mild assumptions on k, the functions in
Hkare measurable and P-integrable, for each P Mp(X). Moreover, it can be checked that the function µPin (1)
belongs to Hk. The transformation P µPfrom Mp(X)to Hkis called the (kernel) mean embedding; see [28] and
[6, Chapter 4]. The mean embedding of P can be viewed as a smoothed version of the distribution of P through the
kernel kwithin the RKHS. This is evident when P is absolutely continuous with density fand k(x,y)=K(xy), for
some real function K. In this situation, µPis the convolution of fand K. On the other hand, mean embeddings appear,
under the name of potential functions, in some other mathematical fields (such as functional analysis); see [20, p. 15].
The kernel distance between P and Q in Mp(X)is
dk(P,Q)=µPµQHk=X2kd(PQ)(PQ)1/2
,(3)
where Hkstands for the norm in Hkand (PQ)(PQ)denotes the product (signed) measure on X2=X×X.
Therefore, dk(P,Q)is the RKHS distance between the mean embeddings of the corresponding probability measures.
Kernel distances were popularized in machine learning as tools to tackle several relevant statistical problems, such as
homogeneity tests [10], independence [23], test of conditional independence [24] and density estimation [27]. The
key idea behind this methodology can be seen as a particular case of the fruitful kernel trick paradigm.
Our contributions: the uniform kernel trick
We consider a family of kernels {kλλΛ}, where Λis certain parametric space. For the Gaussian kernel in
(2), Λ=(0,), but in general λcould be a multidimensional parameter, as in the case of Mat´
ern kernels or inverse
quadratic kernels; see [46, p. 1846]. Each kλhas an associated RKHS, Hk(endowed with its intrinsic norm Hk),
and the corresponding probability distance dk. For P,QMp(X), we want to test H0P=Q using the distances
within the collection {dkλΛ}. The current theoretical framework does not support the automatic (data-driven)
choice of λΛ, since the asymptotic theory is mainly developed for a fixed kernel, corresponding to a specific value
of λ. However, the choice of λis a non-trivial and sensitive issue with no obvious best solution, and which might
aect the test performance.
There are various interesting proposals to deal with this problem in practice: the median heuristic of [10]; sample-
splitting and optimization methods in [3] and [32]; and aggregation methods such as [3]. In this paper we explore
an alternative to avoid making a parametric decision or splitting the data set. Our proposal can be included within
the aggregative methods: we combine the information provided by dierent kernels by taking the supremum over the
induced kernel metrics. Specifically, we use the metric that “best separates” P and Q, that is, the supremum of all
kernel distances given by
dk,Λ(P,Q)=sup
λΛ(dk(P,Q))=sup
λΛµλ
Pµλ
QHk,P,QMp(X),(4)
where, for λΛ,µλ
Pand µλ
Qare the mean embeddings of P and Q, respectively, in Hk. We call the quantity in (4)
the supremum (or uniform) kernel distance of {kλλΛ}. Also, the uniform kernel trick refers to the overall idea of
using (4) to eliminate the parameter in kernel-based statistics. Observe that dkin (3) is a particular case of dk,Λin (4)
when Λis a single-element set. Therefore, all the results in this work can be applied for usual kernel distances. In
addition, in the family {kλλΛ}we can include kernels from dierent parametric families, which would generate
more robust test statistics that might work well under many types of alternatives.
The supremum kernel distance (4) entails several advantages and some mathematical challenges: First, the kernel
selection problem is considerably simplified and solved in a natural way. Additionally, the approach is general enough
3
to be applied in infinite-dimensional settings as FDA. This is interesting since in FDA there are only a few homogene-
ity tests in the literature. Some of them have been developed in the setting of ANOVA models (involving several
samples) under homoscedasticity (equal covariance operators of the involved processes) and Gaussian assumptions.
Hence, the current methodologies amount to testing the null hypothesis of equal means in all the populations; see,
e.g., [16] for an early contribution and [52] for a broader perspective. Our proposal is therefore quite related to more
general approaches, not requiring any homoscedasticity assumption and still valid for a FDA framework. Examples
of such similar tests are [29] and [34], as well as the random projections-based methodology in [15].
The inclusion of the supremum in (4) represents an additional diculty. The asymptotic properties of the test
statistic based on (4) are derived by following a dierent strategy from that of [10] and later works. The methodology
proposed here allows us to cope with the supremum and applies directly to the case of unequal sample sizes. In short,
our approach can be described as follows: First, we consider plug-in estimators of the kernel distances, obtained
by replacing the unknown distributions by their empirical counterparts. Then, we use the powerful theory of empir-
ical processes together with some recent results on the dierentiability of the supremum (see [14]) and functional
Karhunen-Lo`
eve expansions of the underlying processes. These developments entail several technical diculties
from the mathematical point of view. However, they are worthwhile since they allow us to analyze the asymptotic
behavior, under both the null and the alternative hypothesis, of the two-sample test based on (4).
The organization of this paper
In Section 2 we provide some preliminaries regarding RKHS basics and empirical processes. While most of this
background is well-known or can be found in the literature, it is included here to introduce the necessary notation
and make the paper as self-contained as possible. Section 3 contains the main theoretical contributions. First, we
obtain a Donsker property for unions of unit balls in RKHS that could be of independent interest. We establish the
asymptotic validity under the null hypothesis of the two-sample test based on the distance (4). The asymptotic statis-
tical power (i.e., the behaviour under the alternative hypothesis of non-homogeneity) is also analysed. An empirical
study, comparing the uniform kernel test with some other competitors is presented in Section 4. In the scenarios we
have considered, SKD is competitive with the other kernel-based methods, especially in the case of heteroscedastic
populations. However, given the limited nature of the study, we cannot conclude that our proposal unequivocally
outperforms existing approaches.
2. Preliminaries
In this section we describe various tools that we use throughout this work.
Reproducing kernel Hilbert spaces (RKHS)
The theory of RKHS plays a relevant role in this paper. This is a classical and well-known topic; see [35, Appendix
F] for a brief account of the RKHS theory and [6] or [21] for a statistical perspective. Hence, we only mention what
is strictly necessary for later use. Let Xbe a topological space and kX×XRakernel, that is, a symmetric
and positive semi-definite function. Let us consider H0
k, the pre-Hilbert space of all finite linear combinations g()=
n
i=1αik(xi,)(with αiR,nNand xiX), endowed with the inner product
n
i=1
αik(xi,),
m
j=1
βjk(xj,)Hk=
i,j
αiβjk(xi,xj).(5)
The RKHS Hkis defined as the completion of H0
k; see [6, Chapter 1]. The inner product ,kin Hkis obtained
through (5) in such a way that bilinearity is preserved. A key property of RKHS is the so-called reproducing property:
f,k(x,)Hk=f(x),for all fHk,xX.(6)
Kernel distances as integral probability metrics
Each P Mp(X)(Borel probability measure on X), can be seen as a linear functional on Hkvia the mapping
fHkP(f)=XfdP,(7)
4
whenever HkL1(P)(set of integrable variables with respect to P). This condition is also equivalent to saying that
the function xk(x,)is Pettis integrable (with respect to P) and to the existence of the mean embedding µPin (1)
as an element of Hkfulfilling Riesz representation condition
P(f)=f, µPHk,for fHk.(8)
Sucient conditions guaranteeing the injectivity of the mean embedding transformation can be found in [27]. Note
that in (7) (and what follows) we use the standard notation in empirical processes theory: P(f)(or simply P f) stands
for the mathematical expectation of fwith respect to P.
The existence of the mean embedding implies that the kernel distance in (3), as well as the supremum kernel
distance in (4), are well-defined. Indeed, they are integral probability metrics; see [39]. To see this, let us consider
the unit ball of Hk, that is, Fk={fHk,fHk1}.(9)
We have that
µPµQHk=sup
fFkf, µPµQHk(a)
=sup
fFkf,Xk(,x)d(PQ)(x)Hk
(b)
=sup
fFkXf,k(,x)Hkd(PQ)(x)(c)
=sup
fFk(P(f)Q(f)),
(10)
where (a)follows from the definition of mean embedding (1), (b)from Pettis integrability, and (c)from the repro-
ducing property (6); see also [12, Lemma 4]. Thus, the kernel distance (3) is the integral probability metric generated
by the class Fkin (9). Therefore, the supremum kernel distance (4) admits the alternative representation
dk,Λ(P,Q)=sup
fFk,Λ(P(f)Q(f)) with Fk,Λ=
λΛFk,(11)
where Fkis the unit ball in the RKHS space associated with kλ. In other words, dk,Λis the integral probability metric
defined through the union of unit balls of the whole family of RKHS constructed with {kλλΛ}.
From the characterizations as integral probability metrics in (10) and (11), we conclude that dkand dk,Λsatisfy the
properties of a pseudo-metric (non-negativeness, symmetry, triangular property). However, to ensure the identifiability
property of a metric d(i.e., d(P,Q)=0 if and only if P =Q) additional conditions are needed. It can be checked that
when X=Rd, identifiability is satisfied for the usual kernels (such as the Gaussian kernel in (2)). However, when Xis
infinite-dimensional this type of results are more complicated; see [19] for a deep study of this topic for the Gaussian
kernel (2). More details can also be found in [26] and [27].
Plug-in estimators, empirical processes and Donsker classes of functions
A simple and natural estimator of the supremum kernel distance (4) can be obtained by applying the plug-in
principle in (11). Given two independent samples X1,...,Xnand Y1,...,Ymfrom P and Q, respectively, we replace
the unknown underlying probability measures P and Q with the observed empirical counterparts,
Pn=1
n
n
i=1
δXi,Qm=1
m
m
i=1
δYi,
δabeing the unit point mass at a. This leads to the estimator of dk,Λ(P,Q)in (11) given by
dk,Λ(Pn,Qm)=sup
fFk,Λ(Pn(f)Qm(f))=sup
fFk,Λ
1
n
n
i=1
f(Xi)1
m
m
j=1
f(Yj)
.(12)
As a supremum over a class of functions is involved in (12), the theory of empirical processes comes into play
naturally. Given a collection of functions F, we recall that the F-indexed empirical process (associated to P) is
GP
n=n(PnP). The class Fis called P-Donsker if GP
nGPin (F), the space of bounded real functionals
defined on Fwith the supremum norm; see [48]. Here, ’ stands for weak convergence in (F)and GPis a
P-Brownian bridge, that is, a zero-mean Gaussian process with covariance function
E[GP(f1)GP(f2)]=P(f1f2)P(f1)P(f2),f1,f2F.
Additionally, Fis universal Donsker if it is P-Donsker, for every P Mp(X).
5
摘要:

Auniformkerneltrickforhighandinfinite-dimensionaltwo-sampleproblemsJavierC´arcamoa,∗,AntonioCuevasb,Luis-AlbertoRodr´ıguezcaEuskalHerrikoUnibertsitatea-UniversidaddelPa´ısVasco,SpainbUniversidadAut´onomadeMadridandInstitutodeCienciasMatem´aticasICMAT(CSIC-UAM-UCM-UC3M),SpaincGeorg-August-Universit¨a...

展开>> 收起<<
A uniform kernel trick for high and infinite-dimensional two-sample problems_2.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:527.03KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注