A uniform kernel trick for high and infinite-dimensional two-sample problems_2

2025-04-27 0 0 527.03KB 21 页 10玖币

侵权投诉

A uniform kernel trick for high and inﬁnite-dimensional two-sample problems

Javier C´

arcamoa,∗, Antonio Cuevasb, Luis-Alberto Rodr´

ıguezc

aEuskal Herriko Unibertsitatea-Universidad del Pa´ıs Vasco, Spain

bUniversidad Aut´onoma de Madrid and Instituto de Ciencias Matem´aticas ICMAT (CSIC-UAM-UCM-UC3M), Spain

cGeorg-August-Universit¨at G¨ottingen, Germany

Abstract

We use a suitable version of the so-called “kernel trick” to devise two-sample tests, especially focussed on high-

dimensional and functional data. Our proposal entails a simpliﬁcation of the practical problem of selecting an ap-

propriate kernel function. Speciﬁcally, we apply a uniform variant of the kernel trick which involves the supremum

within a class of kernel-based distances. We obtain the asymptotic distribution of the test statistic under the null and

alternative hypotheses. The proofs rely on empirical processes theory, combined with the delta method and Hadamard

directional diﬀerentiability techniques, and functional Karhunen-Lo`

eve-type expansions of the underlying processes.

This methodology has some advantages over other standard approaches in the literature. We also give some experi-

mental insight into the performance of our proposal compared to other kernel-based approaches (the original proposal

by [10] and some variants based on splitting methods) as well as tests based on energy distances [41].

Keywords: Homogeneity test, Reproducing kernel Hilbert spaces, Supremum-like distances.

2020 MSC: 62G10, 62G20

1. Introduction: an overview

In this section we provide an extended summary including not only the main ideas of this work but, specially, the

general setting, motivation and related literature, as well as the technical tools we use.

The kernel trick and some potential kernel traps

We focus on statistical problems where, essentially, the aim is to properly separate data coming from two diﬀerent

populations; this is the case of binary supervised classiﬁcation and two-sample testing problems. In such situations,

the kernel trick is a common paradigm. In a few words, the standard multivariate version (i.e., with data in Rd) of the

kernel trick lies in separating the data in both populations using a symmetric non-negative deﬁnite “kernel function”.

The values of the kernel can be seen as the inner product of transformed versions of the original observations in a

diﬀerent (usually higher-dimensional) space. It is expected that the groups can be better distinguished in the new ﬁnal

space; see [43].

We are particularly interested in those situations in which the available data are high-dimensional or even func-

tional (thus, inﬁnite-dimensional). In such cases, the strategy of mapping the data into a higher-dimensional space

does not seem to be so compelling. Still, the kernel trick remains meaningful in a sort of “second generation” version,

whose point is to take the data to a more comfortable and ﬂexible space. In this new space, the statistical metho-

dology might be mathematically more tractable, and more easily implemented and interpreted. To be more precise, a

probability distribution P on the sample space Xis replaced with a function

µP(x)=Xk(x,y)dP(y),x∈X,(1)

in an appropriate space of “nice functions” deﬁned by means of the kernel k. In this way, the distance between

two probability measures is computed in terms of the metric in the functional space. As a matter of fact, one of

∗Corresponding author. Email address: javier.carcamo@ehu.eus

Preprint submitted to Journal of Multivariate Analysis April 24, 2024

arXiv:2210.02171v3 [math.ST] 23 Apr 2024

the most appealing proposals in this direction relies on kernel-based distances, expressed in terms of the embedding

transformation µPin (1); see [10].

The kernel kinvolved in this methodology depends, almost unavoidably, on some tuning parameter λ, typically a

scale factor. Therefore, we actually have a family of kernels, kλ, for λ∈Λ, where Λis usually a subset of Rk(k≥1).

For instance, the popular family of Gaussian kernels with parameter λ∈(0,∞)is deﬁned by

kλ(x,y)=exp −λx−y2,for x,y∈X,(2)

where ⋅is a norm in X. Unfortunately, there is no general rule to know a priori which kernel works best with the

available data. In other words, the choice of λis, to some extent, arbitrary but not irrelevant, as it could remarkably

aﬀect the ﬁnal output. For example, very small or very large choices of λin (2) result in null discrepancies, which have

no ability to distinguish distributions. The selection of λis hence a delicate problem that has not been satisfactorily

solved so far. This is what we call the kernel trap: a bad choice of the parameter leading to poor results. Although

this problem was not explicitly considered in [10] and subsequent works on this topic, the authors were aware of this

relevant question; in practice, they use a heuristic choice of λ.

Further, a parameter-dependent method might be an obstacle for practitioners who are often reluctant to use

procedures depending on auxiliary, hard-to-interpret parameters. We thus ﬁnd here a particular instance of the trade-

oﬀbetween power and applicability: as stated in [47], the practical power of a statistical procedure is deﬁned as “the

product of the mathematical power by the probability that the procedure will be used” (Tukey credits to Churchill

Eisenhart for this idea). From this perspective, our proposal can be viewed as an attempt to make kernel-based

homogeneity tests more usable by getting rid of the tuning parameter(s). Roughly speaking, the idea that we propose

to avoid selecting a speciﬁc value of λwithin the family {kλ∶λ∈Λ}is to take the supremum over the set of parameters

Λof the resulting family of kernel-distances. We call this approach the uniform kernel trick, as we map the data into

many functional spaces at the same time and use, as test statistic, the supremum of the corresponding kernel distances.

We believe that this methodology could be successfully applied as well in supervised classiﬁcation, though this topic

is not considered in this work.

The topic of this paper

Two-sample tests, also called homogeneity tests, aim to decide whether or not it can be accepted that two random

elements have the same distribution, using the information provided by two independent samples. This problem is

omnipresent in practice on account of their applicability to a great variety of situations, ranging from biomedicine

to quality control. Since the classical Student’s t-tests or rank-based (Mann-Whitney, Wilcoxon, . . . ) procedures,

the subject has received an almost permanent attention from the statistical community. In this work we focus on

two-sample tests valid, under broad assumptions, for general settings in which the data are drawn from two random

elements Xand Ytaking values in a general space X. The set Xis the “sample space” or “feature space” in the

Machine Learning language. In the important particular case X=L2([0,1]),Xand Yare stochastic processes and

the two-sample problem lies within the framework of Functional Data Analysis (FDA).

Many important statistical methods, including goodness of ﬁt and homogeneity tests, are based on an appropri-

ate metric (or discrepancy measure) that allows groups or distributions to be distinguished. Probability distances or

semi-distances reveal to the practitioner the dissimilarity between two random quantities. Therefore, the estimation of

a suitable distance helps detect signiﬁcant diﬀerences between two populations. Some well-known, classic examples

of such metrics are the Kolmogorov distance, that leads to the popular Kolmogorov-Smirnov statistic, and L2-based

discrepancy measures, leading to Cram´

er-von Mises or Anderson-Darling statistics. These methods, based on cumu-

lative distribution functions, are no longer useful with high-dimensional or non-Euclidean data, as in FDA problems.

For this reason we follow a diﬀerent strategy based on more adaptable metrics between general probability measures.

The energy distance (see the review by [41]) and the associated distance covariance, as well as kernel distance,

represent a step forward in this direction since they can be calculated with relative ease for high-dimensional distri-

butions. In [28] the relationships among these metrics in the context of hypothesis testing are discussed. In this paper

we consider an extension, as well as an alternative mathematical approach, for the two-sample test in [10]. These au-

thors show that kernel-based procedures perform better than other more classical approaches when dimension grows,

although they are strongly dependent on the choice of the kernel parameter.

Three important auxiliary tools: RKHS, mean embeddings, and kernel distances

To present the contributions of this paper, we brieﬂy refer to some important, mutually related, technical notions.

As emphasized in [6], Reproducing Kernel Hilbert Spaces (RKHS in short) provide an excellent environment to

construct helpful transformations in several statistical problems. Given a topological space X(in many applications

Xis a subset of a Hilbert space), a kernel k is a real non-negative semideﬁnite symmetric function on X×X. The

RKHS associated with k, denoted in the following by Hk, is the Hilbert space generated by ﬁnite linear combinations

of type ∑jαjk(xj,⋅); see Section 2 for additional details.

Let Mp(X)be the set of (Borel) probability measures on X. Under mild assumptions on k, the functions in

Hkare measurable and P-integrable, for each P ∈Mp(X). Moreover, it can be checked that the function µPin (1)

belongs to Hk. The transformation P ↦µPfrom Mp(X)to Hkis called the (kernel) mean embedding; see [28] and

[6, Chapter 4]. The mean embedding of P can be viewed as a smoothed version of the distribution of P through the

kernel kwithin the RKHS. This is evident when P is absolutely continuous with density fand k(x,y)=K(x−y), for

some real function K. In this situation, µPis the convolution of fand K. On the other hand, mean embeddings appear,

under the name of potential functions, in some other mathematical ﬁelds (such as functional analysis); see [20, p. 15].

The kernel distance between P and Q in Mp(X)is

dk(P,Q)=µP−µQHk=X2kd(P−Q)⊗(P−Q)1/2

,(3)

where ⋅Hkstands for the norm in Hkand (P−Q)⊗(P−Q)denotes the product (signed) measure on X2=X×X.

Therefore, dk(P,Q)is the RKHS distance between the mean embeddings of the corresponding probability measures.

Kernel distances were popularized in machine learning as tools to tackle several relevant statistical problems, such as

homogeneity tests [10], independence [23], test of conditional independence [24] and density estimation [27]. The

key idea behind this methodology can be seen as a particular case of the fruitful kernel trick paradigm.

Our contributions: the uniform kernel trick

We consider a family of kernels {kλ∶λ∈Λ}, where Λis certain parametric space. For the Gaussian kernel in

(2), Λ=(0,∞), but in general λcould be a multidimensional parameter, as in the case of Mat´

ern kernels or inverse

quadratic kernels; see [46, p. 1846]. Each kλhas an associated RKHS, Hk,λ (endowed with its intrinsic norm ⋅Hk,λ ),

and the corresponding probability distance dk,λ. For P,Q∈Mp(X), we want to test H0∶P=Q using the distances

within the collection {dk,λ ∶λ∈Λ}. The current theoretical framework does not support the automatic (data-driven)

choice of λ∈Λ, since the asymptotic theory is mainly developed for a ﬁxed kernel, corresponding to a speciﬁc value

of λ. However, the choice of λis a non-trivial and sensitive issue with no obvious best solution, and which might

aﬀect the test performance.

There are various interesting proposals to deal with this problem in practice: the median heuristic of [10]; sample-

splitting and optimization methods in [3] and [32]; and aggregation methods such as [3]. In this paper we explore

an alternative to avoid making a parametric decision or splitting the data set. Our proposal can be included within

the aggregative methods: we combine the information provided by diﬀerent kernels by taking the supremum over the

induced kernel metrics. Speciﬁcally, we use the metric that “best separates” P and Q, that is, the supremum of all

kernel distances given by

dk,Λ(P,Q)=sup

λ∈Λ(dk,λ(P,Q))=sup

λ∈Λµλ

P−µλ

QHk,λ ,P,Q∈Mp(X),(4)

where, for λ∈Λ,µλ

Pand µλ

Qare the mean embeddings of P and Q, respectively, in Hk,λ. We call the quantity in (4)

the supremum (or uniform) kernel distance of {kλ∶λ∈Λ}. Also, the uniform kernel trick refers to the overall idea of

using (4) to eliminate the parameter in kernel-based statistics. Observe that dkin (3) is a particular case of dk,Λin (4)

when Λis a single-element set. Therefore, all the results in this work can be applied for usual kernel distances. In

addition, in the family {kλ∶λ∈Λ}we can include kernels from diﬀerent parametric families, which would generate

more robust test statistics that might work well under many types of alternatives.

The supremum kernel distance (4) entails several advantages and some mathematical challenges: First, the kernel

selection problem is considerably simpliﬁed and solved in a natural way. Additionally, the approach is general enough

to be applied in inﬁnite-dimensional settings as FDA. This is interesting since in FDA there are only a few homogene-

ity tests in the literature. Some of them have been developed in the setting of ANOVA models (involving several

samples) under homoscedasticity (equal covariance operators of the involved processes) and Gaussian assumptions.

Hence, the current methodologies amount to testing the null hypothesis of equal means in all the populations; see,

e.g., [16] for an early contribution and [52] for a broader perspective. Our proposal is therefore quite related to more

general approaches, not requiring any homoscedasticity assumption and still valid for a FDA framework. Examples

of such similar tests are [29] and [34], as well as the random projections-based methodology in [15].

The inclusion of the supremum in (4) represents an additional diﬃculty. The asymptotic properties of the test

statistic based on (4) are derived by following a diﬀerent strategy from that of [10] and later works. The methodology

proposed here allows us to cope with the supremum and applies directly to the case of unequal sample sizes. In short,

our approach can be described as follows: First, we consider plug-in estimators of the kernel distances, obtained

by replacing the unknown distributions by their empirical counterparts. Then, we use the powerful theory of empir-

ical processes together with some recent results on the diﬀerentiability of the supremum (see [14]) and functional

Karhunen-Lo`

eve expansions of the underlying processes. These developments entail several technical diﬃculties

from the mathematical point of view. However, they are worthwhile since they allow us to analyze the asymptotic

behavior, under both the null and the alternative hypothesis, of the two-sample test based on (4).

The organization of this paper

In Section 2 we provide some preliminaries regarding RKHS basics and empirical processes. While most of this

background is well-known or can be found in the literature, it is included here to introduce the necessary notation

and make the paper as self-contained as possible. Section 3 contains the main theoretical contributions. First, we

obtain a Donsker property for unions of unit balls in RKHS that could be of independent interest. We establish the

asymptotic validity under the null hypothesis of the two-sample test based on the distance (4). The asymptotic statis-

tical power (i.e., the behaviour under the alternative hypothesis of non-homogeneity) is also analysed. An empirical

study, comparing the uniform kernel test with some other competitors is presented in Section 4. In the scenarios we

have considered, SKD is competitive with the other kernel-based methods, especially in the case of heteroscedastic

populations. However, given the limited nature of the study, we cannot conclude that our proposal unequivocally

outperforms existing approaches.

2. Preliminaries

In this section we describe various tools that we use throughout this work.

Reproducing kernel Hilbert spaces (RKHS)

The theory of RKHS plays a relevant role in this paper. This is a classical and well-known topic; see [35, Appendix

F] for a brief account of the RKHS theory and [6] or [21] for a statistical perspective. Hence, we only mention what

is strictly necessary for later use. Let Xbe a topological space and k∶X×X→Rakernel, that is, a symmetric

and positive semi-deﬁnite function. Let us consider H0

k, the pre-Hilbert space of all ﬁnite linear combinations g(⋅)=

∑n

i=1αik(xi,⋅)(with αi∈R,n∈Nand xi∈X), endowed with the inner product

n



i=1

αik(xi,⋅),



j=1

βjk(xj,⋅)Hk=

i,j

αiβjk(xi,xj).(5)

The RKHS Hkis deﬁned as the completion of H0

k; see [6, Chapter 1]. The inner product ⋅,⋅kin Hkis obtained

through (5) in such a way that bilinearity is preserved. A key property of RKHS is the so-called reproducing property:

f,k(x,⋅)Hk=f(x),for all f∈Hk,x∈X.(6)

Kernel distances as integral probability metrics

Each P ∈Mp(X)(Borel probability measure on X), can be seen as a linear functional on Hkvia the mapping

f∈Hk↦P(f)=XfdP,(7)

whenever Hk⊂L1(P)(set of integrable variables with respect to P). This condition is also equivalent to saying that

the function x↦k(x,⋅)is Pettis integrable (with respect to P) and to the existence of the mean embedding µPin (1)

as an element of Hkfulﬁlling Riesz representation condition

P(f)=f, µPHk,for f∈Hk.(8)

Suﬃcient conditions guaranteeing the injectivity of the mean embedding transformation can be found in [27]. Note

that in (7) (and what follows) we use the standard notation in empirical processes theory: P(f)(or simply P f) stands

for the mathematical expectation of fwith respect to P.

The existence of the mean embedding implies that the kernel distance in (3), as well as the supremum kernel

distance in (4), are well-deﬁned. Indeed, they are integral probability metrics; see [39]. To see this, let us consider

the unit ball of Hk, that is, Fk={f∈Hk,fHk≤1}.(9)

We have that

µP−µQHk=sup

f∈Fkf, µP−µQHk(a)

=sup

f∈Fkf,Xk(⋅,x)d(P−Q)(x)Hk

(b)

=sup

f∈FkXf,k(⋅,x)Hkd(P−Q)(x)(c)

=sup

f∈Fk(P(f)−Q(f)),

(10)

where (a)follows from the deﬁnition of mean embedding (1), (b)from Pettis integrability, and (c)from the repro-

ducing property (6); see also [12, Lemma 4]. Thus, the kernel distance (3) is the integral probability metric generated

by the class Fkin (9). Therefore, the supremum kernel distance (4) admits the alternative representation

dk,Λ(P,Q)=sup

f∈Fk,Λ(P(f)−Q(f)) with Fk,Λ=

λ∈ΛFk,λ,(11)

where Fk,λ is the unit ball in the RKHS space associated with kλ. In other words, dk,Λis the integral probability metric

deﬁned through the union of unit balls of the whole family of RKHS constructed with {kλ∶λ∈Λ}.

From the characterizations as integral probability metrics in (10) and (11), we conclude that dkand dk,Λsatisfy the

properties of a pseudo-metric (non-negativeness, symmetry, triangular property). However, to ensure the identiﬁability

property of a metric d(i.e., d(P,Q)=0 if and only if P =Q) additional conditions are needed. It can be checked that

when X=Rd, identiﬁability is satisﬁed for the usual kernels (such as the Gaussian kernel in (2)). However, when Xis

inﬁnite-dimensional this type of results are more complicated; see [19] for a deep study of this topic for the Gaussian

kernel (2). More details can also be found in [26] and [27].

Plug-in estimators, empirical processes and Donsker classes of functions

A simple and natural estimator of the supremum kernel distance (4) can be obtained by applying the plug-in

principle in (11). Given two independent samples X1,...,Xnand Y1,...,Ymfrom P and Q, respectively, we replace

the unknown underlying probability measures P and Q with the observed empirical counterparts,

Pn=1



i=1

δXi,Qm=1



i=1

δYi,

δabeing the unit point mass at a. This leads to the estimator of dk,Λ(P,Q)in (11) given by

dk,Λ(Pn,Qm)=sup

f∈Fk,Λ(Pn(f)−Qm(f))=sup

f∈Fk,Λ

1



i=1

f(Xi)−1



j=1

f(Yj)

.(12)

As a supremum over a class of functions is involved in (12), the theory of empirical processes comes into play

naturally. Given a collection of functions F, we recall that the F-indexed empirical process (associated to P) is

n=√n(Pn−P). The class Fis called P-Donsker if GP

n↝GPin ℓ∞(F), the space of bounded real functionals

deﬁned on Fwith the supremum norm; see [48]. Here, ‘↝’ stands for weak convergence in ℓ∞(F)and GPis a

P-Brownian bridge, that is, a zero-mean Gaussian process with covariance function

E[GP(f1)GP(f2)]=P(f1f2)−P(f1)P(f2),f1,f2∈F.

Additionally, Fis universal Donsker if it is P-Donsker, for every P ∈Mp(X).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Auniformkerneltrickforhighandinfinite-dimensionaltwo-sampleproblemsJavierC´arcamoa,∗,AntonioCuevasb,Luis-AlbertoRodr´ıguezcaEuskalHerrikoUnibertsitatea-UniversidaddelPa´ısVasco,SpainbUniversidadAut´onomadeMadridandInstitutodeCienciasMatem´aticasICMAT(CSIC-UAM-UCM-UC3M),SpaincGeorg-August-Universit¨a...

展开>> 收起<<

A uniform kernel trick for high and infinite-dimensional two-sample problems_2.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A uniform kernel trick for high and infinite-dimensional two-sample problems_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: