InfoOT Information Maximizing Optimal Transport Ching-Yao Chuang1Stefanie Jegelka1David Alvarez-Melis2 Abstract

2025-05-05 0 0 6.63MB 15 页 10玖币
侵权投诉
InfoOT: Information Maximizing Optimal Transport
Ching-Yao Chuang 1Stefanie Jegelka 1David Alvarez-Melis 2
Abstract
Optimal transport aligns samples across distribu-
tions by minimizing the transportation cost be-
tween them, e.g., the geometric distances. Yet,
it ignores coherence structure in the data such as
clusters, does not handle outliers well, and can-
not integrate new data points. To address these
drawbacks, we propose InfoOT, an information-
theoretic extension of optimal transport that max-
imizes the mutual information between domains
while minimizing geometric distances. The re-
sulting objective can still be formulated as a (gen-
eralized) optimal transport problem, and can be
efficiently solved by projected gradient descent.
This formulation yields a new projection method
that is robust to outliers and generalizes to unseen
samples. Empirically, InfoOT improves the qual-
ity of alignments across benchmarks in domain
adaptation, cross-domain retrieval, and single-
cell alignment. The code is available at
https:
//github.com/chingyaoc/InfoOT.
1. Introduction
Optimal Transport (OT) provides a general framework with
a strong theoretical foundation to compare probability dis-
tributions based on the geometry of their underlying spaces
(Villani,2009). Besides its fundamental role in mathematics,
OT has increasingly received attention in machine learning
due to its wide range of applications in domain adapta-
tion (Courty et al.,2017;Redko et al.,2019;Xu et al.,
2020), generative modeling (Arjovsky et al.,2017;Bous-
quet et al.,2017), representation learning (Ozair et al.,2019;
Chuang et al.,2022), and generalization bounds (Chuang
et al.,2021). The development of efficient algorithms (Cu-
turi,2013;Peyr
´
e et al.,2016) has significantly accelerated
the adoption of optimal transport in these applications.
1
MIT CSAIL, Cambridge, MA, USA
2
Microsoft Research,
Cambridge, MA, USA. Correspondence to: Ching-Yao Chuang
<cychuang@mit.edu>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
Computationally, the discrete formulation of OT seeks a
matrix, also called transportation plan, that minimizes the
total geometric transportation cost between two sets of sam-
ples drawn from the source and target distributions. The
transportation plan implicitly defines (soft) correspondences
across these samples, but provides no mechanism to relate
newly-drawn data points. Aligning these requires solving a
new OT problem from scratch. This limits the applicability
of OT, e.g., to streaming settings where the samples arrive in
sequence, or very large datasets where we can only solve OT
on a subset. In this case, the current solution cannot be used
on future data. To overcome this fundamental constraint,
a line of work proposes to directly estimate a mapping,
the pushforward from source to target, that minimizes the
transportation cost (Perrot et al.,2016;Seguy et al.,2017).
Nevertheless, the resulting mapping is highly dependent
on the complexity of the mapping function (Galanti et al.,
2021).
OT could also yield alignments that ignore the intrinsic
coherence structure of the data. In particular, by relying
exclusively on pairwise geometric distances, two nearby
source samples could be mapped to disparate target samples,
as in Figure 1, which is undesirable in some settings. For
instance, when applying OT for domain adaptation, source
samples with the same class should ideally be mapped to
similar target samples. To mitigate this, prior work has
sought to impose structural priors on the OT objective, e.g.,
via submodular cost functions (Alvarez-Melis et al.,2018)
or a Gromov-Wasserstein regularizer (Vayer et al.,2018b;a).
However, these methods still suffer from sensitivity to out-
liers (Mukherjee et al.,2021) and imbalanced data (Hsu
et al.,2015;Tan et al.,2020).
This work presents Information Maximization Optimal
Transport (InfoOT), an information-theoretic extension of
the optimal transport problem that generalizes the usual
formulation by infusing it with global structure in form
of mutual information. In particular, InfoOT seeks align-
ments that maximize mutual information, an information-
theoretic measure of dependence, between domains. To do
so, we treat the pairs selected by the transportation plan as
samples drawn from the joint distribution and estimate the
mutual information with kernel density estimation based
on the paired samples (Moon et al.,1995). Interestingly,
this results in an OT problem where the cost is the log ra-
1
arXiv:2210.03164v2 [cs.LG] 29 May 2023
InfoOT: Information Maximizing Optimal Transport
OT InfoOT InfoOT w/ Barycentric InfoOT w/ Conditional
outliers
Figure 1: Illustration of InfoOT on 2D point cloud. Compared to classic OT, InfoOT preserves the cluster structure, where the source
points from the same cluster are mapped to the same target cluster. For projection estimation (dashed lines), the new conditional projection
improves over barycentric projection with better outlier robustness and out-of-sample generalization.
tio between the estimated joint and marginal distributions
fXY (x, y)/(fX(x)fY(y))
. Empirically, we show that using
a cost combining mutual information with geometric dis-
tances yields better alignments across different applications.
Moreover, akin to Gromov-Wasserstein (M
´
emoli,2011), the
mutual information estimator only relies on intra-domain
distances, which —unlike the standard OT formulation—
makes it suitable for aligning distributions whose supports
lie in different metric spaces, e.g., supports with different
modalities or dimensionality (Alvarez-Melis & Fusi,2020;
Demetci et al.,2020).
By estimating a joint density, InfoOT naturally yields a
novel method for out-of-sample transportation by taking an
expectation over the estimated densities conditioned on the
source samples, which we refer to as conditional projection.
Typically, samples are mapped via a barycentric projection
(Ferradans et al.,2014;Flamary et al.,2016), which corre-
sponds to the weighted average of target samples, where
the weights are determined by the transportation plan. The
barycentric projection inherits the disadvantages of standard
OT: sensitivity to outliers and failing to generalize to new
samples. In contrast, our proposed conditional projection is
robust to outliers and cross-domain class-imbalanced data
(Figure 1and 3) by averaging over samples with importance
sampling, where the weight is, again, the ratio between the
estimated joint and marginal densities. Furthermore, this
projection is well-defined even for unseen samples, which
widens the applicability of OT in streaming or large-scale
settings where solving OT for the complete dataset is pro-
hibitive.
In short, this work makes the following contributions:
We propose InfoOT, an information-theoretic extension
to the optimal transport that regularizes alignments by
maximizing mutual information;
We develop conditional projection, a new projection
method for OT that is robust to outliers and class imbal-
ance in data, and generalizes to new samples;
We evaluate our approach via experiments in domain
adaptation, cross-domain retrieval, and single-cell align-
ment.
2. Related Works
Optimal Transport Optimal transport provides an ele-
gant framework to compare and align distributions. The
discrete formulation, also called Earth Mover’s Distance
(EMD), finds an optimal coupling between empirical sam-
ples by solving a linear programming problem (Bonneel
et al.,2011). To speed up the computation, Cuturi (2013)
propose the Sinkhorn distance, an entropic regularized ver-
sion of EMD that can be solved more efficiently via the
Sinkhorn-Knopp algorithm (Knight,2008). Compared to
EMD, this regularized formulation typically yields denser
transportation plans, where samples can be associated with
multiple target points. Various extensions of OT have been
proposed to impose stronger priors, e.g., Alvarez-Melis et al.
(2018) incorporate additional structure by leveraging a sub-
modular transportation cost, while Flamary et al. (2016)
induce class coherence through a group-sparsity regularizer.
The Gromov-Wasserstein (GW) distance (M
´
emoli,2011) is
a variant of OT in which the transportation cost is defined
upon intra-domain pairwise distances. Therefore, GW has
been adopted to align ‘incomparable spaces’ (Alvarez-Melis
& Jaakkola,2018;Demetci et al.,2020) as the source and tar-
get domains do not need to lie in the same space. Since the
GW objective is no longer a linear program, it is typically
optimized using projected gradient descent (Peyr
´
e et al.,
2016;Solomon et al.,2016). The Fused-GW, which com-
bines the OT and GW objectives, was proposed by Vayer
et al. (2018a) to measure graph distances.
Mutual Information and OT The proposed InfoOT ex-
tends the standard OT formulation by maximizing a ker-
nel density estimated mutual information. Recent works
(Bai et al.,2020;Khan & Zhang,2022) also explore the
connection between OT and information theory. Liu et al.
(2021) consider a semi-supervised setting for estimating a
variant of mutual information, where the unpaired samples
are leveraged to minimize the estimation error. Ozair et al.
(2019) replace the KL divergence in mutual information
with Wasserstein distance and develop a loss function for
representation learning. In comparison, the objective of
2
InfoOT: Information Maximizing Optimal Transport
InfoOT is to seek alignments that maximize the mutual in-
formation while being fully unsupervised by parameterizing
the joint densities with the transportation plan. Another line
of work also combines OT with kernel density estimation
(Canas & Rosasco,2012;Mokrov et al.,2021), but focuses
on different applications.
3. Background on OT and KDE
Optimal Transport Let
{xi}n
i=1 X n
and
{yi}m
i=1
Ym
be the empirical samples and
CRn×m
be the
transportation cost for each pair, e.g,. Euclidean cost
Cij =xiyj
. Given two sets of weights over sam-
ples
pRn
+
and
qRm
+
where
Pn
i=1 pi=Pm
i=1 qi= 1
,
and a cost matrix
C
, Kantorovich’s formulation of optimal
transport solves
min
ΓΠ(p,q)Γ, C,where
Π(p,q) = {ΓRn×m
+|γ1m=p, γT1n=q}.
The
Π(p,q)
is a set of transportation plans that satisfies the
flow constraint. In practice, the Sinkhorn distance (Cuturi,
2013), an entropic regularized version of OT, can be solved
more efficiently via the Sinkhorn-Knopp algorithm. In par-
ticular, the Sinkhorn distance solves minΓΠ(p,q)Γ, C⟩ −
ϵH(Γ)
, where
H(Γ) = Pi,j Γij log Γij
is the entropic
regularizer that smooths the transportation plan.
Kernel Density Estimation Kernel Density Estimation
(KDE) is a non-parametric density estimation method based
on kernel smoothing (Parzen,1962;Rosenblatt,1956). Here,
we consider a generalized KDE for metric spaces
(X, dX)
and
(Y, dY)
(Li et al.,2020;Pelletier,2005). In particular,
given a paired dataset
{xi, yi}n
i=1 {X n,Yn}
sampled
i.i.d. from an unknown joint density
fXY
and a kernel
function
K:RR
, KDE estimates the marginals and the
joint density as
ˆ
fX(x) = 1
nX
i
Kh1(dX(x, xi)) ; (1)
ˆ
fXY (x, y) = 1
nX
i
Kh1(dX(x, xi)) Kh2(dY(y, yi)) ,
where
Kh(t) = K(t
h)/Zh
and the normalizing constant
Zh
makes equation 1integrate to one. The bandwidth parame-
ter
h
controls the smoothness of the estimated densities. In
this work, we do not need to estimate the normalizing con-
stant as only the ratio between joint and marginal densities
ˆ
fXY (x, y)/(ˆ
fX(x)ˆ
fY(y))
is considered while estimating
the mutual information. For all the presented experiments,
we adopt the Gaussian kernel:
Kh(dX(x, x)) = 1
Zh
exp dX(x, x)2
2h2σ2,
where
σ2
controls the variance. The Gaussian kernel has
been successfully adopted for KDE even in non-Euclidean
spaces (Li et al.,2020;Said et al.,2017), and we found it
to work well in our experiments. For simplicity, we also set
h1=h2=h
for all the experiments. Importantly, as opposed
to neural-based density estimators such as (Belghazi et al.,
2018b), KDE yields a convenient closed-form algorithm as
we show next. We also found KDE works well empirically
in high-dimensional settings when combined with OT.
4. Information Maximizing OT
Optimal transport captures the geometry of the underlying
space through the ground metric in its objective. Additional
information is not directly captured in this metric —such
as coherence structure— will therefore be ignored when
solving the problem. This is undesirable in applications
where this additional structure matters, for instance in do-
main adaptation, where class coherence is crucial. As a
concrete example, the cluster structure in the dataset in Fig-
ure 1is ignored by classic OT. Intuitively, the reason for this
issue is that the classic OT is too local: the transportation
cost considers each sample separately, without respecting
coherence across close-by samples. Next, we show that mu-
tual information estimated with KDE can introduce global
structure into OT maps.
4.1. Measuring Global Structure with Mutual
Information
Formally, mutual information measures the statistical de-
pendence of two random variables X, Y :
I(X, Y ) =ZZY×X
fX,Y (x, y) log fXY (x, y)
fX(x)fY(y)dxdy
(2)
where
fXY
is joint density and
fX, fY
are marginal proba-
bility density functions. For paired datasets, various mutual
information estimators have been defined (Belghazi et al.,
2018a;Moon et al.,1995;Poole et al.,2019). In contrast,
we are interested in the inverse: given unpaired samples
{xi}n
i=1,{yj}m
i=1
, can we find alignments that maximize
the mutual information?
Discrete v.s. Continuous. An immediate idea is to treat
the discrete transportation plan
Γ
as the joint distribution
between
X
and
Y
, and write the mutual information as
Pi,j Γij log(nmΓij ) = log(nm)H(Γ)
. In this case,
maximizing mutual information would be equivalent to min-
imizing the entropic regularizer
H(Γ)
introduced by (Cuturi,
2013). For a finite set of samples, this mutual information
estimator is trivially maximized for any one-to-one mapping
as then
H(Γ) = 0
. Figure 2(a) illustrates two one-to-one
mappings
Γa
and
Γb
between points sampled from multi-
3
摘要:

InfoOT:InformationMaximizingOptimalTransportChing-YaoChuang1StefanieJegelka1DavidAlvarez-Melis2AbstractOptimaltransportalignssamplesacrossdistribu-tionsbyminimizingthetransportationcostbe-tweenthem,e.g.,thegeometricdistances.Yet,itignorescoherencestructureinthedatasuchasclusters,doesnothandleoutlier...

展开>> 收起<<
InfoOT Information Maximizing Optimal Transport Ching-Yao Chuang1Stefanie Jegelka1David Alvarez-Melis2 Abstract.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:6.63MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注