Variance-Aware Estimation of Kernel Mean Embedding Geoffrey Wolfer1and Pierre Alquier2

2025-04-24 1 0 706.08KB 34 页 10玖币

侵权投诉

Variance-Aware Estimation of

Kernel Mean Embedding

Geoffrey Wolfer †1 and Pierre Alquier ‡2

1Waseda University, Center for Data Science

2ESSEC Business School, Asia-Paciﬁc campus

Abstract

An important feature of kernel mean embeddings (KME) is that the rate of convergence of the em-

pirical KME to the true distribution KME can be bounded independently of the dimension of the space,

properties of the distribution and smoothness features of the kernel. We show how to speed-up con-

vergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore,

we show that even when such information is a priori unknown, we can efﬁciently estimate it from the

data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous

settings. We further extend our results from independent data to stationary mixing sequences and

illustrate our methods in the context of hypothesis testing and robust parametric estimation.

Keywords— kernel mean embedding, maximum mean discrepancy, distribution learning, empirical Bernstein

Contents

1 Introduction 2

2 Variance-aware convergence rates 4

3 Convergence rates with empirical variance proxy 7

3.1 Intuition in the hypercube ............................................ 7

3.2 Systematic approach ............................................... 8

3.3 Computability of the empirical variance proxy ................................ 9

4 Convergence rates for the difference of two means 9

4.1 First approach: estimation by a U-statistic ................................... 9

4.2 Variance-aware control of the ﬂuctuations for each sample ......................... 10

5 Convergence rates with time-dependent data 11

5.1 For ϕ-mixing processes .............................................. 11

5.2 For β-mixing processes .............................................. 12

6 Applications 12

6.1 Hypothesis testing ................................................ 13

6.1.1 Goodness of ﬁt test ............................................ 13

6.1.2 Two-sample test .............................................. 14

6.2 Robust parametric estimation under Huber contamination ......................... 16

6.2.1 Improved conﬁdence bounds in the Huber contamination model ................. 16

6.2.2 Improved conﬁdence bounds in the parameter space ........................ 17

†email: geo.wolfer@aoni.waseda.jp.

Part of this research was supported by the Special Postdoctoral Researcher (SPDR) program of RIKEN.

‡email: alquier@essec.edu.

arXiv:2210.06672v2 [math.ST] 31 Oct 2024

7 Proofs 19

7.1 Proof of Theorem 2.1 ............................................... 19

7.2 Proof of Lemma 3.3 ................................................ 21

7.3 Proof of Lemma 3.4 ................................................ 21

7.4 Proof of Theorem 3.1 ............................................... 22

7.5 Proof of Lemma 5.1 ................................................ 23

7.6 Proof of Theorem 5.1 ............................................... 23

7.7 Proof of Theorem 6.1 ............................................... 24

7.8 Proof of Theorem 6.2 ............................................... 25

7.9 Proof of Lemma 6.1 ................................................ 26

A Extension to non translation invariant kernels 28

B Additional proofs 29

B.1 Proof of Lemma 3.1 ................................................ 29

B.2 Proof of Lemma 3.2 ................................................ 30

B.3 Proof of Theorem 5.2 ............................................... 30

B.4 Proof of Lemma A.1 ................................................ 31

B.5 Proof of Lemma A.2 ................................................ 32

B.6 Proof of Theorem A.1 ............................................... 33

B.7 Proof of Lemma B.1 ................................................ 33

C Tools from the literature 34

1 Introduction

Estimating a probability distribution Pover a space Xfrom a nindependent samples X1, . . . , Xn∼Pis a central

problem in computer science and statistics [Kearns et al.,1994,Tsybakov,2008]. To formalize the question, one

selects a distance (or at least a contrast function) between distributions and oftentimes introduces assumptions on

the underlying probability space (e.g. ﬁnitely supported, probability function is absolutely continuous with respect

to the Lebesgue measure, H¨

older continuous, . . . ). Increasing the stringency of assumptions generally leads to

substantially faster minimax rates. For instance, in the ﬁnite support |X|<∞case and with respect to total

variation, whilst the expectation risk evolves roughly as p|X|/n, it is known that this rate can be sharpened by

replacing |X|with a bound on the “half-norm” 1∥P∥1/2 .

= (∑x∈X pP(x))2[Berend and Kontorovich,2013], when

deﬁned, and which corresponds to some measure of entropy of the underlying distribution. Furthermore, even

without prior knowledge of ∥P∥1/2, one can construct conﬁdence intervals around Pthat depend on ∥b

Pn∥1/2—

the half-norm of the empirical distribution b

Pn(x) = 1

n∑n

t=1δ[Xt=x][Cohen et al.,2020, Theorem 2.1]. Namely,

when Pis supported on N, with probability at least 1 −δ, it holds that



b

Pn−P



TV ≤1

√n 



b

Pn



1/2

1/2 +3r1

2log(2/δ)!. (1)

The advantage of the above expression is twofold (a)we do not need make assumptions on ∥P∥1/2, which could

be non-convergent2, and (b)in favorable cases where ∥b

Pn∥1/2 is small, the intervals will be narrower. In this paper,

we set out to explore the question of whether analogues of (1) are possible for general probability spaces and with

respect to maximum mean discrepancy.

Notation and background. The set of integers up to n∈Nis denoted by [n].

={1, 2, . . .}. Let Xbe a separable

topological space, and P(X)the set of all Borel probability measures over X. For a bounded function f:X → R,

we deﬁne for convenience

=sup

x∈X

f(x)f.

=inf

x∈X f(x),∆f.

=f−f. (2)

Let k:X × X → Rbe a continuous positive deﬁnite kernel and Hkbe the associated reproducing kernel Hilbert

space (RKHS) [Berlinet and Thomas-Agnan,2011]. We assume that the kernel is bounded3in the sense that

supx∈X k(x,x)<∞. Letting P∈ P(X), the kernel mean embedding (KME) of Pis deﬁned as

µP.

=EX∼P[k(X,·)]=ZXk(x,·)dP(x)∈ Hk,

1The half-norm is not a norm in the proper sense as it violates the triangle inequality.

2As opposed to the empirical proxy ∥b

Pn∥1/2, which is always a ﬁnite quantity.

3Note that the supremum of the kernel is always reached on the diagonal, that is, sup(x,x′)∈X2k(x,x′) = supx∈X k(x,x).

which is interpreted as a Bochner integral [Diestel and Uhl,1977, Chapter 2]. Let n∈Nand X1, . . . , Xnbe a

sequence of observations sampled independently4from P. We write

µP(X1, . . . , Xn) = b

µP.

∑

t=1

k(Xt,·)∈ Hk, (3)

for the KME of the empirical distribution b

Pn.

n∑n

t=1δXt, and call the distance ∥b

µP−µP∥Hkthe maximum mean

discrepancy (MMD) between the true mean embedding and its empirical estimator. A kernel is called characteristic

if the map µ:P7→ µPis injective. This property ensures that ∥µP−µP′∥Hk=0 if and only if P=P′. A kernel is

called translation invariant5(TI) when there exists a positive deﬁnite function ψ:X → Rsuch for any x,x′∈ X,

k(x,x′) = ψ(x′−x). (4)

In particular, ψ=ψ(0) = k. When X=Rd, a kernel kis said to be a radial basis function (RBF) when for any

x,x′∈ X,k(x,x′) = ϕ(∥x′−x∥2)for some function ϕ:R+→R. Noticeably, kbeing positive deﬁnite does not

preclude it from taking negative values. However, when kis RBF, the following lower bound on ϕholds (see e.g.

Genton [2001], Stein [1999]),

ϕ≥ϕinf

t≥0(2

t(d−2)/2

Γ(d/2)J(d−2)/2(t)),

where Γis the Gamma function and Jβis the Bessel function of the ﬁrst kind of order β, showing that |ϕ|becomes

evanescent as the dimension increases.

Related work. From an asymptotic standpoint, the weak law of large numbers asserts that b

µPconverges to the

true µP. Furthermore, √n(b

µP−µP)

converges in distribution to a zero mean Gaussian process on Hk[Berlinet and Thomas-Agnan,2011, Section 9.1].

This work, however, is more concerned with the ﬁnite sample theory, and more speciﬁcally in the rate of conver-

gence of b

µPtowards µPwith respect to the RKHS norm. Conveniently and perhaps surprisingly, it is possible

to derive a rate that depends neither on the smoothness of the considered kernel k, nor the properties of the true

distribution. To obtain a distribution independent rate at OP(n−1/2), a typical strategy (see e.g. Lopez-Paz et al.

[2015, Section B.1]) consists in ﬁrst expressing the dual relationship between the norm in the RKHS and the uniform

norm of an empirical process,

∥b

µP−µP∥Hk=sup

f∈Hk

∥f∥Hk≤1

⟨f,b

µP−µP⟩Hk=sup

f∈Hk

∥f∥Hk≤1

∑

t=1

f(Xt)−Ef!. (5)

A classical symmetrization argument [Mohri et al.,2018] followed by an application of McDiarmid’s inequality

[McDiarmid et al.,1989] yield that with probability at least 1 −δ,

sup

f∈Hk

∥f∥Hk≤1

∑

t=1

f(Xt)−Ef!≤2Rn+s2klog(1/δ)

n, (6)

where Rnis the Rademacher complexity [Mohri et al.,2018, Deﬁnition 3.1] of the class of unit functions in the

RKHS. The bound R2

n≤k/n[Bartlett and Mendelson,2002], and an application of Jensen’s inequality conclude

the claim. With a more careful analysis, Tolstikhin et al. [2017, Proposition A.1] halved the constant of the ﬁrst term

in (6), hence showed that with probability at least 1 −δ,

∥b

µP−µP∥Hk≤sk

n+s2klog(1/δ)

n. (7)

What is more, Tolstikhin et al. [2017, Theorems 1,6,8] provide corresponding lower bounds in ΩP(n−1/2), showing

that the embedding of the empirical measure achieves an unimprovable rate of convergence. Results similar to (7)

(with worse constants) can be derived when the observations are not independent, see Ch´

erief-Abdellatif and

Alquier [2022, Lemma 7.2]. We note that other estimators have been proposed in the literature for the case where

more information is available about the underlying distribution P. See for instance Muandet et al. [2014,2016] who

propose a shrinkage estimator inspired by the James-Stein estimator. In this work, we stress that our estimator will

be the KME of the empirical distribution, deﬁned in (3).

4Unless otherwise speciﬁed, the data will be considered iid. We will address the case of dependent data in Section 5.

5Also sometimes referred to as an anisotropic stationary kernel.

Main contributions. In this paragraph, we brieﬂy summarize our ﬁndings. The tilde notation suppresses con-

stants and logarithmic factors in 1/δ, and the reader is referred to Theorem 2.1 and Theorem 3.1 for the full expres-

sion of the second order terms. Our ﬁrst contribution (Theorem 2.1) consists in deriving a bound on ∥b

µP−µP∥Hk

that can leverage additional information about the underlying distribution Pand the selected kernel k. Namely,

we show that with probability at least 1 −δ,

∥b

µP−µP∥Hk≤r2vk(P)log(2/δ)

n+e

O1

n,

where vk(P)—deﬁned in (8)—corresponds to some notion of variance in the RKHS. Notably, it holds that vk(P)≤

k, hence our upper bound is superior to and able to recover known bounds in most settings (Remark 2.1).

Our chief technical contribution (Theorem 3.1) is in establishing that, at least when kis translation invariant, the

dependence of the above conﬁdence interval can be made independent of the underlying distribution by replacing

the variance by an empirical proxy. Speciﬁcally, with probability at least 1 −δ,

∥b

µP−µP∥Hk≤r2b

vk(X1, . . . , Xn)log(4/δ)

n+e

O1

n

where b

v—deﬁned in (12), (13)—is a quantity that only depends on the chosen kernel and the sample. We also

extend the latter result to kernels without the translation invariance property in Theorem A.1.

Expanding beyond the iid setting, we obtain convergence rates for stationary mixing sequences. For ϕ-mixing

processes, Theorem 5.1 establishes the following rate of convergence.

∥b

µP−µP∥Hk≤rv+Σn

n+4r2v∥Γ∥2log(1/δ)

n+e

O1

n,

where Σnis a total measure of covariance between observations in the RKHS (refer to Lemma 5.1) and ∥Γ∥2is

the spectral norm of a coupling matrix deﬁned in Eq. (17). We additionally analyze the broader class of β-mixing

processes in Theorem 5.2.

Outline. In Section 2, for the problem of estimating a distribution with respect to maximum mean discrepancy,

we give a convergence rate with a dominant term in OP(n−1/2)involving a variance term vwhich depends on both

the chosen kernel kand the underlying distribution P. We then illustrate how the rate can subdue minimax lower

bounds when vis favorable. In particular, we show that for large collections of kernels, and when X ⊂ Rd, the

variance vcan be controlled by a quantity that decouples the inﬂuence of the kernel and a measure of total variance

of P. In Section 3, we proceed to show that even if vis unknown, it is possible to efﬁciently estimate it from the

data with an “empirical Bernstein” approach whenever the kernel is translation invariant, or at least enjoys a mildly

varying diagonal. In Section 4, we illustrate the beneﬁts of the variance aware bounds on the classical two-sample

test problem. In Section 5, we establish convergence rates for stationary ϕand βmixing sequences. In Section 6.2,

we put our methods into practice, ﬁrst in the context of hypothesis testing, and second by improving the results

of Briol et al. [2019] and Ch´

erief-Abdellatif and Alquier [2022] in the context of robust parametric maximum mean

discrepancy estimation. All proofs are deferred to Section 7for clarity of the exposition.

2 Variance-aware convergence rates

The central quantity we propose to consider is the following variance term in the RKHS,

vk(P).

=EX∼P∥k(X,·)−µP∥2

Hk. (8)

It is clear that vk(P)depends both on the choice of kernel kand on the underlying distribution, and we will simply

write v=vk(P)to avoid encumbering notation. Simple calculations (refer to Lemma 5.1) show that

E∥b

µP−µP∥2

Hk=v

where the expectation is taken with respect to an independent sample X1, . . . , Xn∼P, thus by applications of

Jensen’s and Chebyshev’s inequalities, we readily obtain a rate on the expected risk in terms of v,

E∥b

µP−µP∥Hk≤rv

and a deviation bound

P∥b

µP−µP∥Hk

>(1+τ)√v/n≤1/(1+τ)2,τ∈(0, ∞).

One basic feature of RKHS is that new kernels can be constructed by linearly combining existing kernels—see for

instance Gretton [2013].It is therefore noteworthy that vis linear with respect reproducing kernels. Reformulate

v=EX∼P[k(X,X)]−EX,X′∼Pk(X,X′),

and suppose that k=∑ki∈K αikiwhere K=nk1, . . . , k|K|ois a collection of reproducing kernels with α∈R|K|. It

then holds that

vk=∑

ki∈K

αivki.

Moving on to high-probability conﬁdence bounds, an application of Bernstein’s inequality in Hilbert spaces [Pinelis

and Sakhanenko,1985,Yurinsky,1995] not only recovers the rate of convergence, as alluded to by Tolstikhin et al.

[2017], but in fact yields a maximal inequality.

Theorem 2.1 (Variance-aware conﬁdence interval).Let X1, . . . , Xniid

∼P, k :X × X → Rbe a reproducing kernel, and

let

=EX∼P∥k(X,·)−µP∥2

Hk.

With probability at least 1−δ, it holds that

max

1≤t≤nnt∥b

µP(X1, . . . , Xt)−µP∥Hko≤q2vn log(2/δ) + 4

3pklog(2/δ),

where k =supx∈X k(x,x). In particular, with probability at least 1−δ, it holds that

∥b

µP(X1, . . . , Xn)−µP∥Hk≤ Bk,δ(P,n),

with

Bk,δ(P,n) = Bδ.

=r2vlog(2/δ)

n+4

3pklog(2/δ)

Remark 2.1. Since the reproducing kernel is bounded, it is always the case that

v≤EX∼P∥k(X,·)∥2

Hk≤EX∼Pk(X,X)≤k,

where the ﬁrst inequality can be found e.g. in [Ch´erief-Abdellatif and Alquier,2022, Lemma 7.1, Proof]. As a result, Theo-

rem 2.1 strictly supersedes (7)at least when

n>16

9 log(2/δ)

1+p2 log(1/δ)−p2 log(2/δ)!2

For instance, for δ=0.05, it sufﬁces that n ≥46.

Remark 2.2. The Bernstein approach, in contrast to bounded differences, has the additional advantage of yielding a maximal

inequality, further opening the door for early stopping methods.

Remark 2.3 (Sharpness of the constants).We note but do not pursue that there exist other families of concentration

inequalities which are known to dominate Bennett–Bernstein-type inequalities. For instance, the class of inequalities pioneered

by Bentkus et al. [2006], with empirical counterparts derived by Kuchibhotla and Zheng [2021], which is known to be nearly

optimal for sample averages of independent observations from a log-concave probability distribution. We also mention the

family of inequalities obtained by Waudby-Smith and Ramdas [2024] based on martingale methods. However, our problem

requires a bound for norms of sums of vectors in Hilbert spaces, or alternatively for the supremum of empirical processes, which

we could not locate in the aforementioned references. Additionally, even in the simpler case of sample averages, computation

of the bound requires effort, thus a concentration bound in more the complicated Hilbert space setting could be challenging to

compute. Finally, Bernstein-type bounds have known extension for the case of time-dependent data, enabling us to analyze the

estimation problem from stationary mixing sequences of observations (refer to Section 5).

We immediately observe that:

(O1)While the bound in (7) depends solely on chosen quantities and is computable without any knowledge of P,

this is not the case for Bδin Theorem 2.1.

(O2)Perhaps even more concerning, it is a priori unclear how vdepends on properties of kand P, thus making it

difﬁcult to convert assumptions on Pinto a bound for v.

We defer (O1)to Section 3and ﬁrst address (O2)by pointing out that when X=Rd, and for numerous hyper-

parametrized families of kernels, it is possible to promptly obtain upper bounds on vthat decouple the inﬂuence

of the hyper-parameter and some measure of spread of the underlying distribution.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Variance-AwareEstimationofKernelMeanEmbeddingGeoffreyWolfer†1andPierreAlquier‡21WasedaUniversity,CenterforDataScience2ESSECBusinessSchool,Asia-PacificcampusAbstractAnimportantfeatureofkernelmeanembeddings(KME)isthattherateofconvergenceoftheem-piricalKMEtothetruedistributionKMEcanbeboundedindependent...

展开>> 收起<<

Variance-Aware Estimation of Kernel Mean Embedding Geoffrey Wolfer1and Pierre Alquier2.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Variance-Aware Estimation of Kernel Mean Embedding Geoffrey Wolfer1and Pierre Alquier2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: