Variance-Aware Estimation of Kernel Mean Embedding Geoffrey Wolfer1and Pierre Alquier2

2025-04-24 0 0 706.08KB 34 页 10玖币
侵权投诉
Variance-Aware Estimation of
Kernel Mean Embedding
Geoffrey Wolfer 1 and Pierre Alquier 2
1Waseda University, Center for Data Science
2ESSEC Business School, Asia-Pacific campus
Abstract
An important feature of kernel mean embeddings (KME) is that the rate of convergence of the em-
pirical KME to the true distribution KME can be bounded independently of the dimension of the space,
properties of the distribution and smoothness features of the kernel. We show how to speed-up con-
vergence by leveraging variance information in the reproducing kernel Hilbert space. Furthermore,
we show that even when such information is a priori unknown, we can efficiently estimate it from the
data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous
settings. We further extend our results from independent data to stationary mixing sequences and
illustrate our methods in the context of hypothesis testing and robust parametric estimation.
Keywords— kernel mean embedding, maximum mean discrepancy, distribution learning, empirical Bernstein
Contents
1 Introduction 2
2 Variance-aware convergence rates 4
3 Convergence rates with empirical variance proxy 7
3.1 Intuition in the hypercube ............................................ 7
3.2 Systematic approach ............................................... 8
3.3 Computability of the empirical variance proxy ................................ 9
4 Convergence rates for the difference of two means 9
4.1 First approach: estimation by a U-statistic ................................... 9
4.2 Variance-aware control of the fluctuations for each sample ......................... 10
5 Convergence rates with time-dependent data 11
5.1 For ϕ-mixing processes .............................................. 11
5.2 For β-mixing processes .............................................. 12
6 Applications 12
6.1 Hypothesis testing ................................................ 13
6.1.1 Goodness of fit test ............................................ 13
6.1.2 Two-sample test .............................................. 14
6.2 Robust parametric estimation under Huber contamination ......................... 16
6.2.1 Improved confidence bounds in the Huber contamination model ................. 16
6.2.2 Improved confidence bounds in the parameter space ........................ 17
email: geo.wolfer@aoni.waseda.jp.
Part of this research was supported by the Special Postdoctoral Researcher (SPDR) program of RIKEN.
email: alquier@essec.edu.
1
arXiv:2210.06672v2 [math.ST] 31 Oct 2024
7 Proofs 19
7.1 Proof of Theorem 2.1 ............................................... 19
7.2 Proof of Lemma 3.3 ................................................ 21
7.3 Proof of Lemma 3.4 ................................................ 21
7.4 Proof of Theorem 3.1 ............................................... 22
7.5 Proof of Lemma 5.1 ................................................ 23
7.6 Proof of Theorem 5.1 ............................................... 23
7.7 Proof of Theorem 6.1 ............................................... 24
7.8 Proof of Theorem 6.2 ............................................... 25
7.9 Proof of Lemma 6.1 ................................................ 26
A Extension to non translation invariant kernels 28
B Additional proofs 29
B.1 Proof of Lemma 3.1 ................................................ 29
B.2 Proof of Lemma 3.2 ................................................ 30
B.3 Proof of Theorem 5.2 ............................................... 30
B.4 Proof of Lemma A.1 ................................................ 31
B.5 Proof of Lemma A.2 ................................................ 32
B.6 Proof of Theorem A.1 ............................................... 33
B.7 Proof of Lemma B.1 ................................................ 33
C Tools from the literature 34
1 Introduction
Estimating a probability distribution Pover a space Xfrom a nindependent samples X1, . . . , XnPis a central
problem in computer science and statistics [Kearns et al.,1994,Tsybakov,2008]. To formalize the question, one
selects a distance (or at least a contrast function) between distributions and oftentimes introduces assumptions on
the underlying probability space (e.g. finitely supported, probability function is absolutely continuous with respect
to the Lebesgue measure, H¨
older continuous, . . . ). Increasing the stringency of assumptions generally leads to
substantially faster minimax rates. For instance, in the finite support |X|<case and with respect to total
variation, whilst the expectation risk evolves roughly as p|X|/n, it is known that this rate can be sharpened by
replacing |X|with a bound on the “half-norm” 1P1/2 .
= (xX pP(x))2[Berend and Kontorovich,2013], when
defined, and which corresponds to some measure of entropy of the underlying distribution. Furthermore, even
without prior knowledge of P1/2, one can construct confidence intervals around Pthat depend on b
Pn1/2
the half-norm of the empirical distribution b
Pn(x) = 1
nn
t=1δ[Xt=x][Cohen et al.,2020, Theorem 2.1]. Namely,
when Pis supported on N, with probability at least 1 δ, it holds that
b
PnP
TV 1
n
b
Pn
1/2
1/2 +3r1
2log(2/δ)!. (1)
The advantage of the above expression is twofold (a)we do not need make assumptions on P1/2, which could
be non-convergent2, and (b)in favorable cases where b
Pn1/2 is small, the intervals will be narrower. In this paper,
we set out to explore the question of whether analogues of (1) are possible for general probability spaces and with
respect to maximum mean discrepancy.
Notation and background. The set of integers up to nNis denoted by [n].
={1, 2, . . .}. Let Xbe a separable
topological space, and P(X)the set of all Borel probability measures over X. For a bounded function f:X R,
we define for convenience
f.
=sup
x∈X
f(x)f.
=inf
x∈X f(x),f.
=ff. (2)
Let k:X × X Rbe a continuous positive definite kernel and Hkbe the associated reproducing kernel Hilbert
space (RKHS) [Berlinet and Thomas-Agnan,2011]. We assume that the kernel is bounded3in the sense that
supxX k(x,x)<. Letting P∈ P(X), the kernel mean embedding (KME) of Pis defined as
µP.
=EXP[k(X,·)]=ZXk(x,·)dP(x)∈ Hk,
1The half-norm is not a norm in the proper sense as it violates the triangle inequality.
2As opposed to the empirical proxy b
Pn1/2, which is always a finite quantity.
3Note that the supremum of the kernel is always reached on the diagonal, that is, sup(x,x)∈X2k(x,x) = supxX k(x,x).
2
which is interpreted as a Bochner integral [Diestel and Uhl,1977, Chapter 2]. Let nNand X1, . . . , Xnbe a
sequence of observations sampled independently4from P. We write
b
µP(X1, . . . , Xn) = b
µP.
=1
n
n
t=1
k(Xt,·)∈ Hk, (3)
for the KME of the empirical distribution b
Pn.
=1
nn
t=1δXt, and call the distance b
µPµPHkthe maximum mean
discrepancy (MMD) between the true mean embedding and its empirical estimator. A kernel is called characteristic
if the map µ:P7µPis injective. This property ensures that µPµPHk=0 if and only if P=P. A kernel is
called translation invariant5(TI) when there exists a positive definite function ψ:X Rsuch for any x,x∈ X,
k(x,x) = ψ(xx). (4)
In particular, ψ=ψ(0) = k. When X=Rd, a kernel kis said to be a radial basis function (RBF) when for any
x,x∈ X,k(x,x) = ϕ(xx2)for some function ϕ:R+R. Noticeably, kbeing positive definite does not
preclude it from taking negative values. However, when kis RBF, the following lower bound on ϕholds (see e.g.
Genton [2001], Stein [1999]),
ϕϕinf
t0(2
t(d2)/2
Γ(d/2)J(d2)/2(t)),
where Γis the Gamma function and Jβis the Bessel function of the first kind of order β, showing that |ϕ|becomes
evanescent as the dimension increases.
Related work. From an asymptotic standpoint, the weak law of large numbers asserts that b
µPconverges to the
true µP. Furthermore, n(b
µPµP)
converges in distribution to a zero mean Gaussian process on Hk[Berlinet and Thomas-Agnan,2011, Section 9.1].
This work, however, is more concerned with the finite sample theory, and more specifically in the rate of conver-
gence of b
µPtowards µPwith respect to the RKHS norm. Conveniently and perhaps surprisingly, it is possible
to derive a rate that depends neither on the smoothness of the considered kernel k, nor the properties of the true
distribution. To obtain a distribution independent rate at OP(n1/2), a typical strategy (see e.g. Lopez-Paz et al.
[2015, Section B.1]) consists in first expressing the dual relationship between the norm in the RKHS and the uniform
norm of an empirical process,
b
µPµPHk=sup
f∈Hk
fHk1
f,b
µPµPHk=sup
f∈Hk
fHk1
1
n
n
t=1
f(Xt)Ef!. (5)
A classical symmetrization argument [Mohri et al.,2018] followed by an application of McDiarmid’s inequality
[McDiarmid et al.,1989] yield that with probability at least 1 δ,
sup
f∈Hk
fHk1
1
n
n
t=1
f(Xt)Ef!2Rn+s2klog(1/δ)
n, (6)
where Rnis the Rademacher complexity [Mohri et al.,2018, Definition 3.1] of the class of unit functions in the
RKHS. The bound R2
nk/n[Bartlett and Mendelson,2002], and an application of Jensen’s inequality conclude
the claim. With a more careful analysis, Tolstikhin et al. [2017, Proposition A.1] halved the constant of the first term
in (6), hence showed that with probability at least 1 δ,
b
µPµPHksk
n+s2klog(1/δ)
n. (7)
What is more, Tolstikhin et al. [2017, Theorems 1,6,8] provide corresponding lower bounds in P(n1/2), showing
that the embedding of the empirical measure achieves an unimprovable rate of convergence. Results similar to (7)
(with worse constants) can be derived when the observations are not independent, see Ch´
erief-Abdellatif and
Alquier [2022, Lemma 7.2]. We note that other estimators have been proposed in the literature for the case where
more information is available about the underlying distribution P. See for instance Muandet et al. [2014,2016] who
propose a shrinkage estimator inspired by the James-Stein estimator. In this work, we stress that our estimator will
be the KME of the empirical distribution, defined in (3).
4Unless otherwise specified, the data will be considered iid. We will address the case of dependent data in Section 5.
5Also sometimes referred to as an anisotropic stationary kernel.
3
Main contributions. In this paragraph, we briefly summarize our findings. The tilde notation suppresses con-
stants and logarithmic factors in 1/δ, and the reader is referred to Theorem 2.1 and Theorem 3.1 for the full expres-
sion of the second order terms. Our first contribution (Theorem 2.1) consists in deriving a bound on b
µPµPHk
that can leverage additional information about the underlying distribution Pand the selected kernel k. Namely,
we show that with probability at least 1 δ,
b
µPµPHkr2vk(P)log(2/δ)
n+e
O1
n,
where vk(P)—defined in (8)—corresponds to some notion of variance in the RKHS. Notably, it holds that vk(P)
k, hence our upper bound is superior to and able to recover known bounds in most settings (Remark 2.1).
Our chief technical contribution (Theorem 3.1) is in establishing that, at least when kis translation invariant, the
dependence of the above confidence interval can be made independent of the underlying distribution by replacing
the variance by an empirical proxy. Specifically, with probability at least 1 δ,
b
µPµPHkr2b
vk(X1, . . . , Xn)log(4/δ)
n+e
O1
n
where b
v—defined in (12), (13)—is a quantity that only depends on the chosen kernel and the sample. We also
extend the latter result to kernels without the translation invariance property in Theorem A.1.
Expanding beyond the iid setting, we obtain convergence rates for stationary mixing sequences. For ϕ-mixing
processes, Theorem 5.1 establishes the following rate of convergence.
b
µPµPHkrv+Σn
n+4r2vΓ2log(1/δ)
n+e
O1
n,
where Σnis a total measure of covariance between observations in the RKHS (refer to Lemma 5.1) and Γ2is
the spectral norm of a coupling matrix defined in Eq. (17). We additionally analyze the broader class of β-mixing
processes in Theorem 5.2.
Outline. In Section 2, for the problem of estimating a distribution with respect to maximum mean discrepancy,
we give a convergence rate with a dominant term in OP(n1/2)involving a variance term vwhich depends on both
the chosen kernel kand the underlying distribution P. We then illustrate how the rate can subdue minimax lower
bounds when vis favorable. In particular, we show that for large collections of kernels, and when X Rd, the
variance vcan be controlled by a quantity that decouples the influence of the kernel and a measure of total variance
of P. In Section 3, we proceed to show that even if vis unknown, it is possible to efficiently estimate it from the
data with an “empirical Bernstein” approach whenever the kernel is translation invariant, or at least enjoys a mildly
varying diagonal. In Section 4, we illustrate the benefits of the variance aware bounds on the classical two-sample
test problem. In Section 5, we establish convergence rates for stationary ϕand βmixing sequences. In Section 6.2,
we put our methods into practice, first in the context of hypothesis testing, and second by improving the results
of Briol et al. [2019] and Ch´
erief-Abdellatif and Alquier [2022] in the context of robust parametric maximum mean
discrepancy estimation. All proofs are deferred to Section 7for clarity of the exposition.
2 Variance-aware convergence rates
The central quantity we propose to consider is the following variance term in the RKHS,
vk(P).
=EXPk(X,·)µP2
Hk. (8)
It is clear that vk(P)depends both on the choice of kernel kand on the underlying distribution, and we will simply
write v=vk(P)to avoid encumbering notation. Simple calculations (refer to Lemma 5.1) show that
Eb
µPµP2
Hk=v
n,
where the expectation is taken with respect to an independent sample X1, . . . , XnP, thus by applications of
Jensen’s and Chebyshev’s inequalities, we readily obtain a rate on the expected risk in terms of v,
Eb
µPµPHkrv
n,
and a deviation bound
Pb
µPµPHk
>(1+τ)v/n1/(1+τ)2,τ(0, ).
4
One basic feature of RKHS is that new kernels can be constructed by linearly combining existing kernels—see for
instance Gretton [2013].It is therefore noteworthy that vis linear with respect reproducing kernels. Reformulate
v=EXP[k(X,X)]EX,XPk(X,X),
and suppose that k=kiK αikiwhere K=nk1, . . . , k|K|ois a collection of reproducing kernels with αR|K|. It
then holds that
vk=
ki∈K
αivki.
Moving on to high-probability confidence bounds, an application of Bernstein’s inequality in Hilbert spaces [Pinelis
and Sakhanenko,1985,Yurinsky,1995] not only recovers the rate of convergence, as alluded to by Tolstikhin et al.
[2017], but in fact yields a maximal inequality.
Theorem 2.1 (Variance-aware confidence interval).Let X1, . . . , Xniid
P, k :X × X Rbe a reproducing kernel, and
let
v.
=EXPk(X,·)µP2
Hk.
With probability at least 1δ, it holds that
max
1tnntb
µP(X1, . . . , Xt)µPHkoq2vn log(2/δ) + 4
3pklog(2/δ),
where k =supxX k(x,x). In particular, with probability at least 1δ, it holds that
b
µP(X1, . . . , Xn)µPHk≤ Bk,δ(P,n),
with
Bk,δ(P,n) = Bδ.
=r2vlog(2/δ)
n+4
3pklog(2/δ)
n.
Remark 2.1. Since the reproducing kernel is bounded, it is always the case that
vEXPk(X,·)2
HkEXPk(X,X)k,
where the first inequality can be found e.g. in [Ch´erief-Abdellatif and Alquier,2022, Lemma 7.1, Proof]. As a result, Theo-
rem 2.1 strictly supersedes (7)at least when
n>16
9 log(2/δ)
1+p2 log(1/δ)p2 log(2/δ)!2
.
For instance, for δ=0.05, it suffices that n 46.
Remark 2.2. The Bernstein approach, in contrast to bounded differences, has the additional advantage of yielding a maximal
inequality, further opening the door for early stopping methods.
Remark 2.3 (Sharpness of the constants).We note but do not pursue that there exist other families of concentration
inequalities which are known to dominate Bennett–Bernstein-type inequalities. For instance, the class of inequalities pioneered
by Bentkus et al. [2006], with empirical counterparts derived by Kuchibhotla and Zheng [2021], which is known to be nearly
optimal for sample averages of independent observations from a log-concave probability distribution. We also mention the
family of inequalities obtained by Waudby-Smith and Ramdas [2024] based on martingale methods. However, our problem
requires a bound for norms of sums of vectors in Hilbert spaces, or alternatively for the supremum of empirical processes, which
we could not locate in the aforementioned references. Additionally, even in the simpler case of sample averages, computation
of the bound requires effort, thus a concentration bound in more the complicated Hilbert space setting could be challenging to
compute. Finally, Bernstein-type bounds have known extension for the case of time-dependent data, enabling us to analyze the
estimation problem from stationary mixing sequences of observations (refer to Section 5).
We immediately observe that:
(O1)While the bound in (7) depends solely on chosen quantities and is computable without any knowledge of P,
this is not the case for Bδin Theorem 2.1.
(O2)Perhaps even more concerning, it is a priori unclear how vdepends on properties of kand P, thus making it
difficult to convert assumptions on Pinto a bound for v.
We defer (O1)to Section 3and first address (O2)by pointing out that when X=Rd, and for numerous hyper-
parametrized families of kernels, it is possible to promptly obtain upper bounds on vthat decouple the influence
of the hyper-parameter and some measure of spread of the underlying distribution.
5
摘要:

Variance-AwareEstimationofKernelMeanEmbeddingGeoffreyWolfer†1andPierreAlquier‡21WasedaUniversity,CenterforDataScience2ESSECBusinessSchool,Asia-PacificcampusAbstractAnimportantfeatureofkernelmeanembeddings(KME)isthattherateofconvergenceoftheem-piricalKMEtothetruedistributionKMEcanbeboundedindependent...

展开>> 收起<<
Variance-Aware Estimation of Kernel Mean Embedding Geoffrey Wolfer1and Pierre Alquier2.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:34 页 大小:706.08KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注