InfoOT Information Maximizing Optimal Transport Ching-Yao Chuang1Stefanie Jegelka1David Alvarez-Melis2 Abstract

2025-05-05 0 0 6.63MB 15 页 10玖币

侵权投诉

InfoOT: Information Maximizing Optimal Transport

Ching-Yao Chuang 1Stefanie Jegelka 1David Alvarez-Melis 2

Abstract

Optimal transport aligns samples across distribu-

tions by minimizing the transportation cost be-

tween them, e.g., the geometric distances. Yet,

it ignores coherence structure in the data such as

clusters, does not handle outliers well, and can-

not integrate new data points. To address these

drawbacks, we propose InfoOT, an information-

theoretic extension of optimal transport that max-

imizes the mutual information between domains

while minimizing geometric distances. The re-

sulting objective can still be formulated as a (gen-

eralized) optimal transport problem, and can be

efﬁciently solved by projected gradient descent.

This formulation yields a new projection method

that is robust to outliers and generalizes to unseen

samples. Empirically, InfoOT improves the qual-

ity of alignments across benchmarks in domain

adaptation, cross-domain retrieval, and single-

cell alignment. The code is available at

https:

//github.com/chingyaoc/InfoOT.

1. Introduction

Optimal Transport (OT) provides a general framework with

a strong theoretical foundation to compare probability dis-

tributions based on the geometry of their underlying spaces

(Villani,2009). Besides its fundamental role in mathematics,

OT has increasingly received attention in machine learning

due to its wide range of applications in domain adapta-

tion (Courty et al.,2017;Redko et al.,2019;Xu et al.,

2020), generative modeling (Arjovsky et al.,2017;Bous-

quet et al.,2017), representation learning (Ozair et al.,2019;

Chuang et al.,2022), and generalization bounds (Chuang

et al.,2021). The development of efﬁcient algorithms (Cu-

turi,2013;Peyr

e et al.,2016) has signiﬁcantly accelerated

the adoption of optimal transport in these applications.

MIT CSAIL, Cambridge, MA, USA

Microsoft Research,

Cambridge, MA, USA. Correspondence to: Ching-Yao Chuang

<cychuang@mit.edu>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

Computationally, the discrete formulation of OT seeks a

matrix, also called transportation plan, that minimizes the

total geometric transportation cost between two sets of sam-

ples drawn from the source and target distributions. The

transportation plan implicitly deﬁnes (soft) correspondences

across these samples, but provides no mechanism to relate

newly-drawn data points. Aligning these requires solving a

new OT problem from scratch. This limits the applicability

of OT, e.g., to streaming settings where the samples arrive in

sequence, or very large datasets where we can only solve OT

on a subset. In this case, the current solution cannot be used

on future data. To overcome this fundamental constraint,

a line of work proposes to directly estimate a mapping,

the pushforward from source to target, that minimizes the

transportation cost (Perrot et al.,2016;Seguy et al.,2017).

Nevertheless, the resulting mapping is highly dependent

on the complexity of the mapping function (Galanti et al.,

2021).

OT could also yield alignments that ignore the intrinsic

coherence structure of the data. In particular, by relying

exclusively on pairwise geometric distances, two nearby

source samples could be mapped to disparate target samples,

as in Figure 1, which is undesirable in some settings. For

instance, when applying OT for domain adaptation, source

samples with the same class should ideally be mapped to

similar target samples. To mitigate this, prior work has

sought to impose structural priors on the OT objective, e.g.,

via submodular cost functions (Alvarez-Melis et al.,2018)

or a Gromov-Wasserstein regularizer (Vayer et al.,2018b;a).

However, these methods still suffer from sensitivity to out-

liers (Mukherjee et al.,2021) and imbalanced data (Hsu

et al.,2015;Tan et al.,2020).

This work presents Information Maximization Optimal

Transport (InfoOT), an information-theoretic extension of

the optimal transport problem that generalizes the usual

formulation by infusing it with global structure in form

of mutual information. In particular, InfoOT seeks align-

ments that maximize mutual information, an information-

theoretic measure of dependence, between domains. To do

so, we treat the pairs selected by the transportation plan as

samples drawn from the joint distribution and estimate the

mutual information with kernel density estimation based

on the paired samples (Moon et al.,1995). Interestingly,

this results in an OT problem where the cost is the log ra-

arXiv:2210.03164v2 [cs.LG] 29 May 2023

InfoOT: Information Maximizing Optimal Transport

OT InfoOT InfoOT w/ Barycentric InfoOT w/ Conditional

outliers

Figure 1: Illustration of InfoOT on 2D point cloud. Compared to classic OT, InfoOT preserves the cluster structure, where the source

points from the same cluster are mapped to the same target cluster. For projection estimation (dashed lines), the new conditional projection

improves over barycentric projection with better outlier robustness and out-of-sample generalization.

tio between the estimated joint and marginal distributions

fXY (x, y)/(fX(x)fY(y))

. Empirically, we show that using

a cost combining mutual information with geometric dis-

tances yields better alignments across different applications.

Moreover, akin to Gromov-Wasserstein (M

emoli,2011), the

mutual information estimator only relies on intra-domain

distances, which —unlike the standard OT formulation—

makes it suitable for aligning distributions whose supports

lie in different metric spaces, e.g., supports with different

modalities or dimensionality (Alvarez-Melis & Fusi,2020;

Demetci et al.,2020).

By estimating a joint density, InfoOT naturally yields a

novel method for out-of-sample transportation by taking an

expectation over the estimated densities conditioned on the

source samples, which we refer to as conditional projection.

Typically, samples are mapped via a barycentric projection

(Ferradans et al.,2014;Flamary et al.,2016), which corre-

sponds to the weighted average of target samples, where

the weights are determined by the transportation plan. The

barycentric projection inherits the disadvantages of standard

OT: sensitivity to outliers and failing to generalize to new

samples. In contrast, our proposed conditional projection is

robust to outliers and cross-domain class-imbalanced data

(Figure 1and 3) by averaging over samples with importance

sampling, where the weight is, again, the ratio between the

estimated joint and marginal densities. Furthermore, this

projection is well-deﬁned even for unseen samples, which

widens the applicability of OT in streaming or large-scale

settings where solving OT for the complete dataset is pro-

hibitive.

In short, this work makes the following contributions:

•

We propose InfoOT, an information-theoretic extension

to the optimal transport that regularizes alignments by

maximizing mutual information;

•

We develop conditional projection, a new projection

method for OT that is robust to outliers and class imbal-

ance in data, and generalizes to new samples;

•

We evaluate our approach via experiments in domain

adaptation, cross-domain retrieval, and single-cell align-

ment.

2. Related Works

Optimal Transport Optimal transport provides an ele-

gant framework to compare and align distributions. The

discrete formulation, also called Earth Mover’s Distance

(EMD), ﬁnds an optimal coupling between empirical sam-

ples by solving a linear programming problem (Bonneel

et al.,2011). To speed up the computation, Cuturi (2013)

propose the Sinkhorn distance, an entropic regularized ver-

sion of EMD that can be solved more efﬁciently via the

Sinkhorn-Knopp algorithm (Knight,2008). Compared to

EMD, this regularized formulation typically yields denser

transportation plans, where samples can be associated with

multiple target points. Various extensions of OT have been

proposed to impose stronger priors, e.g., Alvarez-Melis et al.

(2018) incorporate additional structure by leveraging a sub-

modular transportation cost, while Flamary et al. (2016)

induce class coherence through a group-sparsity regularizer.

The Gromov-Wasserstein (GW) distance (M

emoli,2011) is

a variant of OT in which the transportation cost is deﬁned

upon intra-domain pairwise distances. Therefore, GW has

been adopted to align ‘incomparable spaces’ (Alvarez-Melis

& Jaakkola,2018;Demetci et al.,2020) as the source and tar-

get domains do not need to lie in the same space. Since the

GW objective is no longer a linear program, it is typically

optimized using projected gradient descent (Peyr

e et al.,

2016;Solomon et al.,2016). The Fused-GW, which com-

bines the OT and GW objectives, was proposed by Vayer

et al. (2018a) to measure graph distances.

Mutual Information and OT The proposed InfoOT ex-

tends the standard OT formulation by maximizing a ker-

nel density estimated mutual information. Recent works

(Bai et al.,2020;Khan & Zhang,2022) also explore the

connection between OT and information theory. Liu et al.

(2021) consider a semi-supervised setting for estimating a

variant of mutual information, where the unpaired samples

are leveraged to minimize the estimation error. Ozair et al.

(2019) replace the KL divergence in mutual information

with Wasserstein distance and develop a loss function for

representation learning. In comparison, the objective of

InfoOT: Information Maximizing Optimal Transport

InfoOT is to seek alignments that maximize the mutual in-

formation while being fully unsupervised by parameterizing

the joint densities with the transportation plan. Another line

of work also combines OT with kernel density estimation

(Canas & Rosasco,2012;Mokrov et al.,2021), but focuses

on different applications.

3. Background on OT and KDE

Optimal Transport Let

{xi}n

i=1 ∈ X n

and

{yi}m

i=1 ∈

be the empirical samples and

C∈Rn×m

be the

transportation cost for each pair, e.g,. Euclidean cost

Cij =∥xi−yj∥

. Given two sets of weights over sam-

ples

p∈Rn

and

q∈Rm

where

i=1 pi=Pm

i=1 qi= 1

and a cost matrix

, Kantorovich’s formulation of optimal

transport solves

min

Γ∈Π(p,q)⟨Γ, C⟩,where

Π(p,q) = {Γ∈Rn×m

+|γ1m=p, γT1n=q}.

The

Π(p,q)

is a set of transportation plans that satisﬁes the

ﬂow constraint. In practice, the Sinkhorn distance (Cuturi,

2013), an entropic regularized version of OT, can be solved

more efﬁciently via the Sinkhorn-Knopp algorithm. In par-

ticular, the Sinkhorn distance solves minΓ∈Π(p,q)⟨Γ, C⟩ −

ϵH(Γ)

, where

H(Γ) = −Pi,j Γij log Γij

is the entropic

regularizer that smooths the transportation plan.

Kernel Density Estimation Kernel Density Estimation

(KDE) is a non-parametric density estimation method based

on kernel smoothing (Parzen,1962;Rosenblatt,1956). Here,

we consider a generalized KDE for metric spaces

(X, dX)

and

(Y, dY)

(Li et al.,2020;Pelletier,2005). In particular,

given a paired dataset

{xi, yi}n

i=1 ∈ {X n,Yn}

sampled

i.i.d. from an unknown joint density

fXY

and a kernel

function

K:R→R

, KDE estimates the marginals and the

joint density as

fX(x) = 1

Kh1(dX(x, xi)) ; (1)

fXY (x, y) = 1

Kh1(dX(x, xi)) Kh2(dY(y, yi)) ,

where

Kh(t) = K(t

h)/Zh

and the normalizing constant

makes equation 1integrate to one. The bandwidth parame-

ter

controls the smoothness of the estimated densities. In

this work, we do not need to estimate the normalizing con-

stant as only the ratio between joint and marginal densities

fXY (x, y)/(ˆ

fX(x)ˆ

fY(y))

is considered while estimating

the mutual information. For all the presented experiments,

we adopt the Gaussian kernel:

Kh(dX(x, x′)) = 1

exp −dX(x, x′)2

2h2σ2,

where

σ2

controls the variance. The Gaussian kernel has

been successfully adopted for KDE even in non-Euclidean

spaces (Li et al.,2020;Said et al.,2017), and we found it

to work well in our experiments. For simplicity, we also set

h1=h2=h

for all the experiments. Importantly, as opposed

to neural-based density estimators such as (Belghazi et al.,

2018b), KDE yields a convenient closed-form algorithm as

we show next. We also found KDE works well empirically

in high-dimensional settings when combined with OT.

4. Information Maximizing OT

Optimal transport captures the geometry of the underlying

space through the ground metric in its objective. Additional

information is not directly captured in this metric —such

as coherence structure— will therefore be ignored when

solving the problem. This is undesirable in applications

where this additional structure matters, for instance in do-

main adaptation, where class coherence is crucial. As a

concrete example, the cluster structure in the dataset in Fig-

ure 1is ignored by classic OT. Intuitively, the reason for this

issue is that the classic OT is too local: the transportation

cost considers each sample separately, without respecting

coherence across close-by samples. Next, we show that mu-

tual information estimated with KDE can introduce global

structure into OT maps.

4.1. Measuring Global Structure with Mutual

Information

Formally, mutual information measures the statistical de-

pendence of two random variables X, Y :

I(X, Y ) =ZZY×X

fX,Y (x, y) log fXY (x, y)

fX(x)fY(y)dxdy

(2)

where

fXY

is joint density and

fX, fY

are marginal proba-

bility density functions. For paired datasets, various mutual

information estimators have been deﬁned (Belghazi et al.,

2018a;Moon et al.,1995;Poole et al.,2019). In contrast,

we are interested in the inverse: given unpaired samples

{xi}n

i=1,{yj}m

i=1

, can we ﬁnd alignments that maximize

the mutual information?

Discrete v.s. Continuous. An immediate idea is to treat

the discrete transportation plan

as the joint distribution

between

and

, and write the mutual information as

Pi,j Γij log(nmΓij ) = log(nm)−H(Γ)

. In this case,

maximizing mutual information would be equivalent to min-

imizing the entropic regularizer

H(Γ)

introduced by (Cuturi,

2013). For a ﬁnite set of samples, this mutual information

estimator is trivially maximized for any one-to-one mapping

as then

H(Γ) = 0

. Figure 2(a) illustrates two one-to-one

mappings

Γa

and

Γb

between points sampled from multi-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InfoOT:InformationMaximizingOptimalTransportChing-YaoChuang1StefanieJegelka1DavidAlvarez-Melis2AbstractOptimaltransportalignssamplesacrossdistribu-tionsbyminimizingthetransportationcostbe-tweenthem,e.g.,thegeometricdistances.Yet,itignorescoherencestructureinthedatasuchasclusters,doesnothandleoutlier...

展开>> 收起<<

InfoOT Information Maximizing Optimal Transport Ching-Yao Chuang1Stefanie Jegelka1David Alvarez-Melis2 Abstract.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

InfoOT Information Maximizing Optimal Transport Ching-Yao Chuang1Stefanie Jegelka1David Alvarez-Melis2 Abstract

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: