Gaussian Processes on Distributions based on Regularized Optimal Transport Fran cois Bachoc12

2025-04-24 0 0 1.53MB 37 页 10玖币
侵权投诉
Gaussian Processes on Distributions based on
Regularized Optimal Transport
Fran¸cois Bachoc1,2
Louis B´ethune1,3
Alberto Gonzalez-Sanz1,2
Jean-Michel Loubes1,2
1Universit´e Paul Sabatier
2Institut de Math´ematiques de Toulouse
3Institut de Recherche en Informatique de Toulouse
Abstract
We present a novel kernel over the space of probability measures based
on the dual formulation of optimal regularized transport. We propose an
Hilbertian embedding of the space of probabilities using their Sinkhorn
potentials, which are solutions of the dual entropic relaxed optimal trans-
port between the probabilities and a reference measure U. We prove that
this construction enables to obtain a valid kernel, by using the Hilbert
norms. We prove that the kernel enjoys theoretical properties such as
universality and some invariances, while still being computationally fea-
sible. Moreover we provide theoretical guarantees on the behaviour of a
Gaussian process based on this kernel. The empirical performances are
compared with other traditional choices of kernels for processes indexed
on distributions.
1 Introduction
Context: Gaussian processes and kernels indexed by distributions.
Gaussian process (GP) models are widely used in fields such as geostatistics,
computer code experiments and machine learning. We refer to [Rasmussen and
Williams, 2006] for general references. They consist in modeling an unknown
function as a realization of a GP, and hence correspond to a functional Bayesian
framework. For instance, in computer experiments, the input points of the func-
tion are simulation parameters and the output values are quantities of interest
obtained from the simulations. GPs rely on the definition of a covariance func-
tion that characterises the correlations between values of the process at different
observation points.
In this paper we consider GPs indexed by distributions. Learning functions
defined on distributions has gained a special interest over the last decade in the
machine learning literature, see for instance [oczos et al., 2013]. Distribution-
valued inputs are commonly used to describe complex objects such as images,
shapes or media as described for instance in [Glaunes et al., 2004], [Muandet
et al., 2012], [Ginsbourger et al., 2016] or [Szab´o et al., 2016]. The construc-
tion of a kernel for these inputs requires a notion of similitude between the
probability distributions. Many methods have been considered to provide ker-
nels for distributions, from the mere extraction of parametric features, such as
1
arXiv:2210.06574v1 [stat.ML] 12 Oct 2022
the mean or higher moments, to the well used Maximum Mean Discrepancy
method [Gretton et al., 2012].
Context: optimal transport for kernels. On this topic, optimal trans-
port (OT) has imposed itself has a prominent method for comparing or analyzing
distributions. Previous works in this direction are, in one dimension, [Bachoc
et al., 2017] and [Thi Thien Trang et al., 2021], where a kernel directly based on
the quadratic difference between the quantiles, which yields (on the real line)
the quadratic Wasserstein distance, is proposed. In several dimensions, a quite
natural generalisation uses the quadratic norm between the multidimensional
transport maps between the probabilities and a reference measure, see [Bachoc
et al., 2020] and [Moosm¨uller and Cloninger, 2020]. Even though, with a good
choice of the reference measure, the generated kernels are translation invari-
ant (see [Hallin et al., 2021,del Barrio et al., 2020,del Barrio et al., 2022b]),
for machine learning purposes the computation of the transport map (between
continuous measures) is rather complicated and depends highly on the dimen-
sion, see [Peyr´e et al., 2019]. Moreover, even if the transport maps exist almost
surely (see [McCann, 1995]) for a suitable choice of the reference distribution,
their continuity—and therefore their approximations from empirical settings–
require at least upper-lower bounded densities and the convex supports of the
target measures, see [Figalli, 2017]. Regarding GPs based on OT, the high com-
plexity of the transport problem makes it so that their continuity properties are
not much studied especially in multi-dimension. As a simplification, [Kolouri
et al., 2018] proposed the use of a slice-Wasserstein kernel. The idea is to reduce
the problem by projecting on the different directions generated by a (uniform)
discretization of the unit sphere Sd1, and then integrating w.r.t. the uniform
measure on Sd1. This avoids completely the curse of the dimension, but this
does not discriminate quite well between non convex domains. The correspond-
ing universality properties are studied in the recent work of [Meunier et al.,
2022].
Contributions. While the initial formulations of OT yield computational
challenges, subsequent regularized versions have provided valuable trade-offs be-
tween richness and tractability. In this work, we provide a kernel based on regu-
larized OT. We prove that the norm between the potentials derived from entropy
relaxation of Wasserstein distances, see [Cuturi, 2013], provides a natural em-
bedding for the distributions and can be used to construct a valid kernel. Much
work on the properties of the potentials has been carried out in particular [Mena
and Niles-Weed, 2019,del Barrio et al., 2022a,Gonzalez-Sanz et al., 2022] but
few results exist taking advantage of the natural embedding the potentials pro-
vide. 1) Our contribution is first to propose a novel valid and universal kernel
based on Sinkhorn’s dual potentials when considering the regularized transport
towards a reference measure. 2) We then propose statistical guarantees for this
kernel by studying the properties of its empirical counterpart as well as invari-
ance properties w.r.t. the choice of the reference. 3) We study the theoretical
2
properties of the corresponding GP, especially the existence of a continuous ver-
sion. Feasible computations through Sinkhorn’s algorithm enable to study the
prediction performance of the kernel. 4) We provide publically available code,
together with simulations and real datasets where our kernel competes favorably
with state of the art methods: it yields a very similar accuracy while providing
a computational seed-up of order up to 100.
Outline. Section 2is devoted to providing some definitions and notations
related to Sinkhorn’s transport methods. In Section 3we define and study
the kernel based on the potentials, while Section 4studies the GP with this
kernel as covariance operator. Implementation and experiments are discussed
in Section 5. The proofs and complementary content are postponed to the
Appendix.
2 Definitions and basic properties of Sinkhorn
distance
2.1 General definitions and notations
We let P(A) be the set of probability measures on a general set ARd. When
Ais compact and for s > 0, we let Cs(A) be the space of functions f:AR
that are bsctimes differentiable, with b.cthe integer part, with kfkCs(A)<
where
kfkCs(A):=
bsc
X
i=0 X
|α|=ikDαfk.(1)
Above α= (α1, . . . , αd)Ndwith Pd
j=1 αj=iand Dα=i/∂α1
x1···αd
xd.
The space Cs(A) is endowed with the norm k · kCs(A). A probability P ∈ P(A)
belongs also to the topological dual space of Cs(A). A distance between two
measures P,Q∈ P(A) can be defined as
kPQks:= sup
f∈Cs(A),kfkCs(A)1Zf(x)(dP(x)dQ(x)).(2)
We let `dbe d-dimensional Lebesgue measure. For p > 0 and for P
P(A), we let Lp(P) be the set of functions f:ARsuch that kfkp
Lp(P) :=
RA|f(x)|pdP(x)<.
We use the abreviations “a.s.” for “almost surely” and “a.e.” for “almost
everywhere”. For a probability measure P on A, we let supp(P) be its topological
support (the smallest closed set with P-probability one). For two sets Aand
B, for a probability measure P on A, and for T:AB, we let T ]A be the
probability measure of T(X) where Xis a random vector with law P. For two
probability distributions P and Q on A, we write P Q when P is absolutely
continuous w.r.t. Q and in this case we write dP/dQ for the density of P w.r.t.
3
Q. A random vector Von ARdis said to be sub-Gaussian if there is σ2<
such that E(exp(su>V)) exp(σ2s2/2) for any uRd,kuk= 1 and sR. We
let PSG(A) be the set of sub-Gaussian probability measures on A. For xA
we let δxbe the Dirac probability measure at x.
For a set E, a function k:E×ERis said to be positive definite when
for any x1, . . . , xnE,α1, . . . , αnR,Pn
i,j=1 αiαjk(xi, xj)0. The function
is said to be strictly positive definite if in addition the sum is strictly positive
when x1, . . . , xnare two-by-two distinct and not all α1, . . . , αnare zero.
For ARdwe let diam(A) = sup{kxyk;x,yA}. For tR, we let dte
be the smallest integer larger or equal to t. For two column vectors x,y, we let
hx,yi=x>ybe their scalar product.
2.2 Regularized optimal transport
We consider an input space Ω Rdthat is fixed throughout the paper. For
some of the results of the paper, Ω will be assumed to be compact, while for
others, we can make the weaker assumption to consider sub-Gaussian measures
on Ω (that is not necessarily bounded). Let P, Q be probabilities on Ω and set
Π(P,Q) the set of probability measures π∈ P(Ω ×Ω) with marginals P and Q,
i.e. for all A, B measurable sets
π(A×Ω) = P(A), π(Ω ×B) = Q(B).(3)
The OT problem amounts to solve the optimization problem (see [Kan-
torovich, 1942])
Tc(P,Q) := min
πΠ(P,Q) Zc(x,y)(x,y),(4)
with a continuous cost c: ×[0,). It is well known (see eg. [Villani,
2003]) that Wp(P,Q) := Tk·kp(P,Q)1
p—the value of (4) for a potential cost
(x,y)7→ kxykp, for p1—defines a distance on the space of probabilities
with finite moments of order p. This distance is called the Wasserstein distance.
In this paper we will consider the quadratic cost c(x,y) = kxyk2. In
this setting, when at least one distribution Pis absolutely continuous w.r.t.
Lebesgue measure, then there exists a P-a.e. unique map T: Ω Ω such that
T ]P =Q, and W2(P, Q)2=RkT(x)xk2dP(x). Moreover, there exists a
lower semi-continuous convex function ϕsuch that T=ϕP-a.e., with the
gradient operator, and Tis the only map of this type pushing forward P to Q,
up to a P-negligible modification. This theorem above is commonly referred
to as Brenier’s theorem in [Brenier, 1991]. Note that a similar statement was
established earlier independently in a probabilistic framework in [Cuesta and
Matr´an, 1989].
This result enables to define a natural Hilbertian embedding of the distributions
in P(Ω) by considering the distance between the transport maps towards a
common reference distribution. This framework has been used in [Bachoc et al.,
2020] to provide kernels on distributions. Yet such kernels have the drawback
4
of being difficult to compute, preventing their use for large or high-dimensional
data sets.
Indeed, computing the OT (4) turns out to be computationally difficult. In the
discrete case, different algorithms have been proposed such as the Hungarian
algorithm [Kuhn, 1955], the simplex algorithm [Luenberger et al., 1984] or others
versions using interior points algorithms [Orlin, 1988]. The complexity of these
methods is at worst of order O(n3log(n)) for two discrete distributions with
equal size n. Hence [Bachoc et al., 2020] and many statistical methods based
on OT suffer from this drawback.
To overcome this issue, regularization methods have been proposed to ap-
proximate the OT problem by adding a penalty. The seminal paper by [Cuturi,
2013] provides the description of the Sinkhorn algorithm to regularize OT by
using an entropy penalty.
The relative entropy between two probability measures α, β on Ω, is defined
as
H(α|β) = Z
log(
(x))(x)
if αβand |log(dα/dβ)| ∈ L1(β), and +otherwise. Set  > 0. Then the
entropy regularized version of the OT problem is defined as
S(P,Q) := min
πΠ(P,Q) Z×
1
2kxyk2(x,y)
+H(π|P×Q),
(5)
with P ×Q the product measure. The entropy term Hmodifies the linear term
in classical OT (the quadratic transportation cost) to produce a strictly convex
functional. The parameter balances the trade-off between the classical OT
problem (= 0) and the influence of the regularizing penalty.
The minimization of (5) is achieved using Sinkhorn algorithm. We refer
to [Peyr´e et al., 2019] and references therein for more details. The introduction
of the Sinkhorn divergence enables to obtain an ε-approximation of the OT
distance which can be computed, as pointed out in [Altschuler et al., 2017],
with a complexity of algorithm of order O(n2
ε3), hence in a much faster way than
the original OT problem. Several toolboxes have been developed to compute
regularized OT such among others as [Flamary and Courty, 2017] for Python,
[Klatt et al., 2017] for R.
Contrary to (unregularized) OT, Sinkhorn OT does not provide transport
maps, which would in turn provide a Hilbertian embedding. Hence, we consider
the dual formulation of (5) pointed out in [Genevay, 2019]:
S(P,Q) = sup
fL1(P),gL1(Q)Z
f(x)dP(x) +Z
g(y)dQ(y)
Z×
e1
(f(x)+g(y)1
2kxyk2)dP(x)dQ(y) + .
(6)
5
摘要:

GaussianProcessesonDistributionsbasedonRegularizedOptimalTransportFrancoisBachoc1;2LouisBethune1;3AlbertoGonzalez-Sanz1;2Jean-MichelLoubes1;21UniversitePaulSabatier2InstitutdeMathematiquesdeToulouse3InstitutdeRechercheenInformatiquedeToulouseAbstractWepresentanovelkerneloverthespaceofprobability...

展开>> 收起<<
Gaussian Processes on Distributions based on Regularized Optimal Transport Fran cois Bachoc12.pdf

共37页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:37 页 大小:1.53MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 37
客服
关注