Gaussian Processes on Distributions based on Regularized Optimal Transport Fran cois Bachoc12

2025-04-24 0 0 1.53MB 37 页 10玖币

侵权投诉

Gaussian Processes on Distributions based on

Regularized Optimal Transport

Fran¸cois Bachoc1,2

Louis B´ethune1,3

Alberto Gonzalez-Sanz1,2

Jean-Michel Loubes1,2

1Universit´e Paul Sabatier

2Institut de Math´ematiques de Toulouse

3Institut de Recherche en Informatique de Toulouse

Abstract

We present a novel kernel over the space of probability measures based

on the dual formulation of optimal regularized transport. We propose an

Hilbertian embedding of the space of probabilities using their Sinkhorn

potentials, which are solutions of the dual entropic relaxed optimal trans-

port between the probabilities and a reference measure U. We prove that

this construction enables to obtain a valid kernel, by using the Hilbert

norms. We prove that the kernel enjoys theoretical properties such as

universality and some invariances, while still being computationally fea-

sible. Moreover we provide theoretical guarantees on the behaviour of a

Gaussian process based on this kernel. The empirical performances are

compared with other traditional choices of kernels for processes indexed

on distributions.

1 Introduction

Context: Gaussian processes and kernels indexed by distributions.

Gaussian process (GP) models are widely used in ﬁelds such as geostatistics,

computer code experiments and machine learning. We refer to [Rasmussen and

Williams, 2006] for general references. They consist in modeling an unknown

function as a realization of a GP, and hence correspond to a functional Bayesian

framework. For instance, in computer experiments, the input points of the func-

tion are simulation parameters and the output values are quantities of interest

obtained from the simulations. GPs rely on the deﬁnition of a covariance func-

tion that characterises the correlations between values of the process at diﬀerent

observation points.

In this paper we consider GPs indexed by distributions. Learning functions

deﬁned on distributions has gained a special interest over the last decade in the

machine learning literature, see for instance [P´oczos et al., 2013]. Distribution-

valued inputs are commonly used to describe complex objects such as images,

shapes or media as described for instance in [Glaunes et al., 2004], [Muandet

et al., 2012], [Ginsbourger et al., 2016] or [Szab´o et al., 2016]. The construc-

tion of a kernel for these inputs requires a notion of similitude between the

probability distributions. Many methods have been considered to provide ker-

nels for distributions, from the mere extraction of parametric features, such as

arXiv:2210.06574v1 [stat.ML] 12 Oct 2022

the mean or higher moments, to the well used Maximum Mean Discrepancy

method [Gretton et al., 2012].

Context: optimal transport for kernels. On this topic, optimal trans-

port (OT) has imposed itself has a prominent method for comparing or analyzing

distributions. Previous works in this direction are, in one dimension, [Bachoc

et al., 2017] and [Thi Thien Trang et al., 2021], where a kernel directly based on

the quadratic diﬀerence between the quantiles, which yields (on the real line)

the quadratic Wasserstein distance, is proposed. In several dimensions, a quite

natural generalisation uses the quadratic norm between the multidimensional

transport maps between the probabilities and a reference measure, see [Bachoc

et al., 2020] and [Moosm¨uller and Cloninger, 2020]. Even though, with a good

choice of the reference measure, the generated kernels are translation invari-

ant (see [Hallin et al., 2021,del Barrio et al., 2020,del Barrio et al., 2022b]),

for machine learning purposes the computation of the transport map (between

continuous measures) is rather complicated and depends highly on the dimen-

sion, see [Peyr´e et al., 2019]. Moreover, even if the transport maps exist almost

surely (see [McCann, 1995]) for a suitable choice of the reference distribution,

their continuity—and therefore their approximations from empirical settings–

require at least upper-lower bounded densities and the convex supports of the

target measures, see [Figalli, 2017]. Regarding GPs based on OT, the high com-

plexity of the transport problem makes it so that their continuity properties are

not much studied especially in multi-dimension. As a simpliﬁcation, [Kolouri

et al., 2018] proposed the use of a slice-Wasserstein kernel. The idea is to reduce

the problem by projecting on the diﬀerent directions generated by a (uniform)

discretization of the unit sphere Sd−1, and then integrating w.r.t. the uniform

measure on Sd−1. This avoids completely the curse of the dimension, but this

does not discriminate quite well between non convex domains. The correspond-

ing universality properties are studied in the recent work of [Meunier et al.,

2022].

Contributions. While the initial formulations of OT yield computational

challenges, subsequent regularized versions have provided valuable trade-oﬀs be-

tween richness and tractability. In this work, we provide a kernel based on regu-

larized OT. We prove that the norm between the potentials derived from entropy

relaxation of Wasserstein distances, see [Cuturi, 2013], provides a natural em-

bedding for the distributions and can be used to construct a valid kernel. Much

work on the properties of the potentials has been carried out in particular [Mena

and Niles-Weed, 2019,del Barrio et al., 2022a,Gonzalez-Sanz et al., 2022] but

few results exist taking advantage of the natural embedding the potentials pro-

vide. 1) Our contribution is ﬁrst to propose a novel valid and universal kernel

based on Sinkhorn’s dual potentials when considering the regularized transport

towards a reference measure. 2) We then propose statistical guarantees for this

kernel by studying the properties of its empirical counterpart as well as invari-

ance properties w.r.t. the choice of the reference. 3) We study the theoretical

properties of the corresponding GP, especially the existence of a continuous ver-

sion. Feasible computations through Sinkhorn’s algorithm enable to study the

prediction performance of the kernel. 4) We provide publically available code,

together with simulations and real datasets where our kernel competes favorably

with state of the art methods: it yields a very similar accuracy while providing

a computational seed-up of order up to 100.

Outline. Section 2is devoted to providing some deﬁnitions and notations

related to Sinkhorn’s transport methods. In Section 3we deﬁne and study

the kernel based on the potentials, while Section 4studies the GP with this

kernel as covariance operator. Implementation and experiments are discussed

in Section 5. The proofs and complementary content are postponed to the

Appendix.

2 Deﬁnitions and basic properties of Sinkhorn

distance

2.1 General deﬁnitions and notations

We let P(A) be the set of probability measures on a general set A⊂Rd. When

Ais compact and for s > 0, we let Cs(A) be the space of functions f:A→R

that are bsctimes diﬀerentiable, with b.cthe integer part, with kfkCs(A)<∞

where

kfkCs(A):=

bsc

i=0 X

|α|=ikDαfk∞.(1)

Above α= (α1, . . . , αd)∈Ndwith Pd

j=1 αj=iand Dα=∂i/∂α1

x1···∂αd

xd.

The space Cs(A) is endowed with the norm k · kCs(A). A probability P ∈ P(A)

belongs also to the topological dual space of Cs(A). A distance between two

measures P,Q∈ P(A) can be deﬁned as

kP−Qks:= sup

f∈Cs(A),kfkCs(A)≤1Zf(x)(dP(x)−dQ(x)).(2)

We let `dbe d-dimensional Lebesgue measure. For p > 0 and for P ∈

P(A), we let Lp(P) be the set of functions f:A→Rsuch that kfkp

Lp(P) :=

RA|f(x)|pdP(x)<∞.

We use the abreviations “a.s.” for “almost surely” and “a.e.” for “almost

everywhere”. For a probability measure P on A, we let supp(P) be its topological

support (the smallest closed set with P-probability one). For two sets Aand

B, for a probability measure P on A, and for T:A→B, we let T ]A be the

probability measure of T(X) where Xis a random vector with law P. For two

probability distributions P and Q on A, we write P Q when P is absolutely

continuous w.r.t. Q and in this case we write dP/dQ for the density of P w.r.t.

Q. A random vector Von A⊂Rdis said to be sub-Gaussian if there is σ2<∞

such that E(exp(su>V)) ≤exp(σ2s2/2) for any u∈Rd,kuk= 1 and s∈R. We

let PSG(A) be the set of sub-Gaussian probability measures on A. For x∈A

we let δxbe the Dirac probability measure at x.

For a set E, a function k:E×E→Ris said to be positive deﬁnite when

for any x1, . . . , xn∈E,α1, . . . , αn∈R,Pn

i,j=1 αiαjk(xi, xj)≥0. The function

is said to be strictly positive deﬁnite if in addition the sum is strictly positive

when x1, . . . , xnare two-by-two distinct and not all α1, . . . , αnare zero.

For A⊂Rdwe let diam(A) = sup{kx−yk;x,y∈A}. For t∈R, we let dte

be the smallest integer larger or equal to t. For two column vectors x,y, we let

hx,yi=x>ybe their scalar product.

2.2 Regularized optimal transport

We consider an input space Ω ⊂Rdthat is ﬁxed throughout the paper. For

some of the results of the paper, Ω will be assumed to be compact, while for

others, we can make the weaker assumption to consider sub-Gaussian measures

on Ω (that is not necessarily bounded). Let P, Q be probabilities on Ω and set

Π(P,Q) the set of probability measures π∈ P(Ω ×Ω) with marginals P and Q,

i.e. for all A, B measurable sets

π(A×Ω) = P(A), π(Ω ×B) = Q(B).(3)

The OT problem amounts to solve the optimization problem (see [Kan-

torovich, 1942])

Tc(P,Q) := min

π∈Π(P,Q) Zc(x,y)dπ(x,y),(4)

with a continuous cost c: Ω ×Ω→[0,∞). It is well known (see eg. [Villani,

2003]) that Wp(P,Q) := Tk·kp(P,Q)1

p—the value of (4) for a potential cost

(x,y)7→ kx−ykp, for p≥1—deﬁnes a distance on the space of probabilities

with ﬁnite moments of order p. This distance is called the Wasserstein distance.

In this paper we will consider the quadratic cost c(x,y) = kx−yk2. In

this setting, when at least one distribution Pis absolutely continuous w.r.t.

Lebesgue measure, then there exists a P-a.e. unique map T: Ω →Ω such that

T ]P =Q, and W2(P, Q)2=RΩkT(x)−xk2dP(x). Moreover, there exists a

lower semi-continuous convex function ϕsuch that T=∇ϕP-a.e., with ∇the

gradient operator, and Tis the only map of this type pushing forward P to Q,

up to a P-negligible modiﬁcation. This theorem above is commonly referred

to as Brenier’s theorem in [Brenier, 1991]. Note that a similar statement was

established earlier independently in a probabilistic framework in [Cuesta and

Matr´an, 1989].

This result enables to deﬁne a natural Hilbertian embedding of the distributions

in P(Ω) by considering the distance between the transport maps towards a

common reference distribution. This framework has been used in [Bachoc et al.,

2020] to provide kernels on distributions. Yet such kernels have the drawback

of being diﬃcult to compute, preventing their use for large or high-dimensional

data sets.

Indeed, computing the OT (4) turns out to be computationally diﬃcult. In the

discrete case, diﬀerent algorithms have been proposed such as the Hungarian

algorithm [Kuhn, 1955], the simplex algorithm [Luenberger et al., 1984] or others

versions using interior points algorithms [Orlin, 1988]. The complexity of these

methods is at worst of order O(n3log(n)) for two discrete distributions with

equal size n. Hence [Bachoc et al., 2020] and many statistical methods based

on OT suﬀer from this drawback.

To overcome this issue, regularization methods have been proposed to ap-

proximate the OT problem by adding a penalty. The seminal paper by [Cuturi,

2013] provides the description of the Sinkhorn algorithm to regularize OT by

using an entropy penalty.

The relative entropy between two probability measures α, β on Ω, is deﬁned

H(α|β) = ZΩ

log( dα

dβ (x))dα(x)

if αβand |log(dα/dβ)| ∈ L1(β), and +∞otherwise. Set  > 0. Then the

entropy regularized version of the OT problem is deﬁned as

S(P,Q) := min

π∈Π(P,Q) ZΩ×Ω

2kx−yk2dπ(x,y)

+H(π|P×Q),

(5)

with P ×Q the product measure. The entropy term Hmodiﬁes the linear term

in classical OT (the quadratic transportation cost) to produce a strictly convex

functional. The parameter balances the trade-oﬀ between the classical OT

problem (= 0) and the inﬂuence of the regularizing penalty.

The minimization of (5) is achieved using Sinkhorn algorithm. We refer

to [Peyr´e et al., 2019] and references therein for more details. The introduction

of the Sinkhorn divergence enables to obtain an ε-approximation of the OT

distance which can be computed, as pointed out in [Altschuler et al., 2017],

with a complexity of algorithm of order O(n2

ε3), hence in a much faster way than

the original OT problem. Several toolboxes have been developed to compute

regularized OT such among others as [Flamary and Courty, 2017] for Python,

[Klatt et al., 2017] for R.

Contrary to (unregularized) OT, Sinkhorn OT does not provide transport

maps, which would in turn provide a Hilbertian embedding. Hence, we consider

the dual formulation of (5) pointed out in [Genevay, 2019]:

S(P,Q) = sup

f∈L1(P),g∈L1(Q)ZΩ

f(x)dP(x) +ZΩ

g(y)dQ(y)

−ZΩ×Ω

(f(x)+g(y)−1

2kx−yk2)dP(x)dQ(y) + .

(6)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GaussianProcessesonDistributionsbasedonRegularizedOptimalTransportFrancoisBachoc1;2LouisBethune1;3AlbertoGonzalez-Sanz1;2Jean-MichelLoubes1;21UniversitePaulSabatier2InstitutdeMathematiquesdeToulouse3InstitutdeRechercheenInformatiquedeToulouseAbstractWepresentanovelkerneloverthespaceofprobability...

展开>> 收起<<

Gaussian Processes on Distributions based on Regularized Optimal Transport Fran cois Bachoc12.pdf

共37页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Gaussian Processes on Distributions based on Regularized Optimal Transport Fran cois Bachoc12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: