
7 Proofs 19
7.1 Proof of Theorem 2.1 ............................................... 19
7.2 Proof of Lemma 3.3 ................................................ 21
7.3 Proof of Lemma 3.4 ................................................ 21
7.4 Proof of Theorem 3.1 ............................................... 22
7.5 Proof of Lemma 5.1 ................................................ 23
7.6 Proof of Theorem 5.1 ............................................... 23
7.7 Proof of Theorem 6.1 ............................................... 24
7.8 Proof of Theorem 6.2 ............................................... 25
7.9 Proof of Lemma 6.1 ................................................ 26
A Extension to non translation invariant kernels 28
B Additional proofs 29
B.1 Proof of Lemma 3.1 ................................................ 29
B.2 Proof of Lemma 3.2 ................................................ 30
B.3 Proof of Theorem 5.2 ............................................... 30
B.4 Proof of Lemma A.1 ................................................ 31
B.5 Proof of Lemma A.2 ................................................ 32
B.6 Proof of Theorem A.1 ............................................... 33
B.7 Proof of Lemma B.1 ................................................ 33
C Tools from the literature 34
1 Introduction
Estimating a probability distribution Pover a space Xfrom a nindependent samples X1, . . . , Xn∼Pis a central
problem in computer science and statistics [Kearns et al.,1994,Tsybakov,2008]. To formalize the question, one
selects a distance (or at least a contrast function) between distributions and oftentimes introduces assumptions on
the underlying probability space (e.g. finitely supported, probability function is absolutely continuous with respect
to the Lebesgue measure, H¨
older continuous, . . . ). Increasing the stringency of assumptions generally leads to
substantially faster minimax rates. For instance, in the finite support |X|<∞case and with respect to total
variation, whilst the expectation risk evolves roughly as p|X|/n, it is known that this rate can be sharpened by
replacing |X|with a bound on the “half-norm” 1∥P∥1/2 .
= (∑x∈X pP(x))2[Berend and Kontorovich,2013], when
defined, and which corresponds to some measure of entropy of the underlying distribution. Furthermore, even
without prior knowledge of ∥P∥1/2, one can construct confidence intervals around Pthat depend on ∥b
Pn∥1/2—
the half-norm of the empirical distribution b
Pn(x) = 1
n∑n
t=1δ[Xt=x][Cohen et al.,2020, Theorem 2.1]. Namely,
when Pis supported on N, with probability at least 1 −δ, it holds that
b
Pn−P
TV ≤1
√n
b
Pn
1/2
1/2 +3r1
2log(2/δ)!. (1)
The advantage of the above expression is twofold (a)we do not need make assumptions on ∥P∥1/2, which could
be non-convergent2, and (b)in favorable cases where ∥b
Pn∥1/2 is small, the intervals will be narrower. In this paper,
we set out to explore the question of whether analogues of (1) are possible for general probability spaces and with
respect to maximum mean discrepancy.
Notation and background. The set of integers up to n∈Nis denoted by [n].
={1, 2, . . .}. Let Xbe a separable
topological space, and P(X)the set of all Borel probability measures over X. For a bounded function f:X → R,
we define for convenience
f.
=sup
x∈X
f(x)f.
=inf
x∈X f(x),∆f.
=f−f. (2)
Let k:X × X → Rbe a continuous positive definite kernel and Hkbe the associated reproducing kernel Hilbert
space (RKHS) [Berlinet and Thomas-Agnan,2011]. We assume that the kernel is bounded3in the sense that
supx∈X k(x,x)<∞. Letting P∈ P(X), the kernel mean embedding (KME) of Pis defined as
µP.
=EX∼P[k(X,·)]=ZXk(x,·)dP(x)∈ Hk,
1The half-norm is not a norm in the proper sense as it violates the triangle inequality.
2As opposed to the empirical proxy ∥b
Pn∥1/2, which is always a finite quantity.
3Note that the supremum of the kernel is always reached on the diagonal, that is, sup(x,x′)∈X2k(x,x′) = supx∈X k(x,x).
2