Geometric Sparse Coding in Wasserstein Space Marshall MuellerShuchin Aeron James M. MurphyAbiy Tasissa

2025-05-06 0 0 787.07KB 24 页 10玖币
侵权投诉
Geometric Sparse Coding in Wasserstein Space
Marshall MuellerShuchin Aeron
James M. MurphyAbiy Tasissa
October 24, 2022
Abstract
Wasserstein dictionary learning is an unsupervised approach to learning a collec-
tion of probability distributions that generate observed distributions as Wasserstein
barycentric combinations. Existing methods for Wasserstein dictionary learning
optimize an objective that seeks a dictionary with sufficient representation capac-
ity via barycentric interpolation to approximate the observed training data, but
without imposing additional structural properties on the coefficients associated
to the dictionary. This leads to dictionaries that densely represent the observed
data, which makes interpretation of the coefficients challenging and may also lead
to poor empirical performance when using the learned coefficients in downstream
tasks. In contrast and motivated by sparse dictionary learning in Euclidean spaces,
we propose a geometrically sparse regularizer for Wasserstein space that promotes
representations of a data point using only nearby dictionary elements. We show
this approach leads to sparse representations in Wasserstein space and addresses
the problem of non-uniqueness of barycentric representation. Moreover, when data
is generated as Wasserstein barycenters of fixed distributions, this regularizer fa-
cilitates the recovery of the generating distributions in cases that are ill-posed for
unregularized Wasserstein dictionary learning. Through experimentation on syn-
thetic and real data, we show that our geometrically regularized approach yields
sparser and more interpretable dictionaries in Wasserstein space, which perform
better in downstream applications.
1 Introduction
A central goal of statistical signal processing is the discovery of latent structures in com-
plex data. Indeed, although data often resides in a high-dimensional ambient space, the
classical manifold hypothesis posits that in fact, the data can be well approximated by low-
dimensional manifolds or mixtures thereof, which circumvents the curse of dimensionality
Department of Mathematics, Tufts University, Medford, MA 02155, USA (marshall.mueller@
tufts.edu,jm.murphy@tufts.edu,abiy.tasissa@tufts.edu)
Department of Electrical and Computer Engineering, Tufts University, Medford, MA 02155, USA
(shuchin@ece.tufts.edu)
Corresponding authors, listed alphabetically
1
arXiv:2210.12135v1 [cs.LG] 21 Oct 2022
that plagues high-dimensional statistics. Linear dimensionality reduction methods such
as principal component analysis (PCA) (Hotelling, 1933) and non-linear manifold learning
approaches that exploit local connectivity structure in the data (Scholkopf et al., 1997;
Tenenbaum et al., 2000; Roweis and Saul, 2000; Belkin and Niyogi, 2003; Coifman and
Lafon, 2006) rely on this assumption of intrinsically low-dimensional structure in high-
dimensional data to glean insights. These techniques typically output a low-dimensional
representation preserving local geometric structures such as pairwise distances or geodesic
distances with respect to a Riemannian metric.
An alternative perspective for efficiently representing complex data is the sparse coding
and dictionary learning paradigm (Olshausen and Field, 1996, 1997; Barlow et al., 1961;
Hrom´adka et al., 2008). In the simplest setting when data is considered as elements of
Rd(or more generally a normed vector space), the aim of sparse coding is to represent
data {yi}n
i=1 Rd, stacked as rows in the matrix YRn×d, as a linear combination of
vectors {dj}m
j=1, stacked as a dictionary matrix DRm×dsuch that YΛD for some
coefficients ΛRn×m, perhaps subject to constraints on Λ. When the dictionary Dis
fixed, this reduces to an optimization over Λ(Mallat, 1999; Engan et al., 2000). More
generally, Dand Λcan be learned simultaneously with some additional constraints on
the dictionary or coefficients (Lee and Seung, 1999; Aharon et al., 2006). This is typically
formulated as an optimization problem
arg min
D,Λ
L(Y,ΛD) + ρR(D,Λ),(1)
for some loss function L(e.g., L(Y,ΛD) = kYΛDkF) and regularization function R
(e.g., R(D,Λ) = kΛk1) balanced by a parameter ρ > 0. The regularizers ensure well-
posedness of the problem and improve interpretability and robustness. The problem (1)
is the dictionary learning problem in Rd.
The imposed Euclidean structure is convenient computationally but limiting in practice,
as many real data are better modeled as living in spaces with non-Euclidean geometry
where instead Y≈ F(D,Λ) for some nonlinear reconstruction function F(Tuzel et al.,
2006, 2007; Li et al., 2008; Guo et al., 2010; Harandi et al., 2013, 2015; Cherian and Sra,
2016; Yin et al., 2016; Maggioni et al., 2016; Liu et al., 2018; Schmitz et al., 2018; Tankala
et al., 2020). Important questions in this setting are what notion of reconstruction should
take the place of linear combination (i.e. F), how reconstruction quality is assessed
without the use of a global norm (i.e. L), and what constraints are natural on the
coefficients in the nonlinear space (i.e. R).
This paper focuses on nonlinear sparse coding and dictionary learning for data that are
modeled as probability distributions in Wasserstein space. This basic framework was pi-
oneered by Schmitz et al. (2018), where the authors leverage the theory and algorithms
of optimal transport to propose the Wasserstein dictionary learning (WDL) algorithm,
whereby a data point (interpreted as a probability distribution or histogram in Rd) is
approximated as a Wasserstein barycenter (Agueh and Carlier, 2011) of the learned dic-
tionary atoms. The resulting framework is focused on learning a dictionary that re-
constructs well, but neglects other desirable aspects of a dictionary such as sparsity of
the induced coefficients. Moreover, this classical scheme is ill-posed in two senses: for
a fixed dictionary, unique coefficients are not assured; moreover, there may be multiple
dictionaries that enable perfect reconstruction of the observed data.
Summary of Contributions: We generalize the classical WDL algorithm (Schmitz
2
et al., 2018) by incorporating a novel Wasserstein geometric sparse regularizer. Our
regularizer encourages an observed data point to be reconstructed as a barycentric rep-
resentation from nearby (in the sense of Wasserstein distances) dictionary atoms. As
we vary the balance parameter for this regularizer, the proposed method interpolates
between classical WDL (no regularization) and Wasserstein K-means (strong regulariza-
tion). Unlike the original formulation, the proposed regularizer learns dictionary atoms
with geometric similarity to the training data. Theoretically, we demonstrate the ability
of the model to learn sparse coefficients and to overcome issues of non-uniqueness both at
the level of learning coefficients for a fixed dictionary, and for the general WDL problem
on an idealized generative model in Wasserstein space. Empirically, we provide evidence
that our regularized dictionary learning scheme yields more interpretable and useful coef-
ficients for downstream classification tasks; the code to reproduce all experiments in this
paper will be released on the authors’ GitHub.
Notations and Preliminaries: Lowercase and uppercase boldface letters denote (col-
umn) vectors and matrices, respectively. We generally use Greek letters to denote mea-
sures, with the exception that D={D}m
j=1 denotes the dictionary when its elements are
measures. We denote the Euclidean norm of a vector xas ||x||2. Let
m=(xRm
m
X
i=1
xi= 1,i= 1, . . . , m, xi0)
denote the discrete probability simplex of dimension m. Softmax, as a change of variables,
is defined as CoV(x) := exp(x)/exp(x)TN. Here we take the exponential to be an
elementwise operation on the vector and use Nto denote the ones vector of size N.
When we write CoV(X) for some matrix XRn×mwe take it to mean applying the
change of variables to each row.
2 Background and Related Work
Classical Dictionary Learning: In equation (1), using L(Y,ΛD) = kYΛDkF
and ρ= 0 yields an optimization problem with optimal dictionary and coefficients
given by the msingular components with largest singular values (Eckart and Young,
1936). To promote interpretable, sparse coefficients that still realize YΛD, the pro-
totypical regularized dictionary learning problem is min
D,Λ||YΛD||2
F+ρ||Λ||1where
||Λ||1=Pn
i=1 Pm
j=1 |Λij |is a sparsity-promoting regularizer (Donoho, 2006; Elad, 2010).
Beyond enhancing interpretability, efficiency, and uniqueness of representation, sparse
representations improve generalization of efficient supervised learning (Mehta and Gray,
2013; Mairal et al., 2011). In the non-negative matrix factorization (NMF) paradigm,
non-negativity constraints are imposed on the atoms and coefficients, which increases
their interpretability and effectiveness in downstream applications (Lee and Seung, 1999,
2000; Berry et al., 2007).
Optimal Transport: We provide basic background on optimal transport; for more
general treatments and theory, see (Ambrosio et al., 2005; Villani, 2021; Santambrogio,
2015; Peyr´e et al., 2019). Let P(Rd) be the space of probability measures in Rd. Let
µ, ν ∈ P(Rd). Let
Π(µ, ν) ={γ:Rd×RdR|for all A, B Borel, γ(A×Rd) = µ(A), γ(Rd×B) = ν(B)}
3
be the set of joint distributions with marginals µand ν. The squared Wasserstein2
distance is defined as:
W2
2(µ, ν) := min
πΠ(µ,ν)ZRd×Rd
kxyk2
2π(x,y).(2)
Given measures {Dj}m
j=1 ⊂ P(Rd) that have finite second moments, along with a vector
λm, the Wasserstein-(2) barycenter (Agueh and Carlier, 2011) is defined as:
Bary(D,λ) := arg min
µ∈P(Rd)
m
X
j=1
λjW2
2(Dj, µ).(3)
The measure Bary(D,λ) can be interpreted as a weighted average of the {Dj}m
j=1, with
the impact of Djproportional to λj. Wasserstein barycenters have proven useful in a
range of applications, and are in a precise sense the “correct” way to combine measures,
in that Bary(D,λ) preserves the geometric properties of D={Dj}m
j=1 in a way that linear
mixtures do not (Agueh and Carlier, 2011; Rabin et al., 2011; Cuturi and Doucet, 2014;
Bonneel et al., 2016). Wasserstein barycenters are intimately connected to geodesics in
Wasserstein space, in the following sense. For πoptimizing (2), the McCann interpolation
of µ, ν is (Pt)#πwhere Pt(x,y) = (1 t)x+tyfor t[0,1] and where (Pt)#denotes the
pushforward by Pt. The McCann interpolation is the constant-speed geodesic between
µ, ν in the Wasserstein-2 space (McCann, 1997; Ambrosio et al., 2005) and coincides with
the Wasserstein barycenter with weight λ= (1 t, t) on µ, ν.
3 Geometric Sparse Regularization for Wasserstein
Dictionary Learning
WDL (Schmitz et al., 2018) aims to find a dictionary of probability distributions D=
{Dj}m
j=1 ⊂ P(Rd) such that observed data {µi}n
i=1 ⊂ P(Rd) can be represented as Wasser-
stein barycenters of the collection D. The precise optimization problem is
arg min
D∈P (Rd)m
Λ(∆m)n
n
X
i=1
L(Bary(D,λi), µi),(4)
where the loss function Lis typically taken to be W2
2and λimis a (column) vector
of size m, corresponding to a row of Λ. In other words, solving this problem finds the
dictionary of probability distributions that finds best approximations to each data point
µiusing barycentric combinations of D. WDL was proposed in part as an alternative
to geodesic principal component analysis in Wasserstein space (Boissard et al., 2015;
Seguy and Cuturi, 2015; Bigot et al., 2017), and has proven highly effective in terms of
producing meaningful atoms for representing probability distributions. However, it may
yield non-sparse coefficients. Moreover, as we will establish, it is ill-posed both at the
level of having non-unique coefficients for a fixed dictionary Dand at the level of having
multiple dictionaries that can reconstruct the data perfectly.
Thus, one might consider adding the `1regularizer to the WDL objective to induce
desirable sparsity of the representation. However, this fails because all coefficients of
4
a Wasserstein barycenter lie on ∆m. Methods to promote sparsity of coefficients on
the simplex can be done with entropy, projections, and suitable use of the the `2norm
(Donoho et al., 1992; Shashanka et al., 2007; Larsson and Ugander, 2011; Kyrillidis et al.,
2013; Li et al., 2020), but we focus instead on geometric regularization. An analogous
problem has been studied in the linear setting in a situation where the weights are on ∆m
(Tankala et al., 2020). In that context, the geometric sparse regularizer
m
X
j=1
λj||ydj||2
2
(for an individual data point ywith representation coefficient λ) has been proven to
promote sparsity by favoring local representations, namely reconstructing using nearby
(with respect to Euclidean distances) dictionary atoms.
We propose to regularize (4) with a novel Wasserstein geometric sparse regularizer:
R(D,Λ) :=
n
X
i=1
m
X
j=1
(λi)jW2
2(Dj, µi).(5)
This yields a new, regularized objective:
F(D,Λ,{µi}n
i=1) := min
D∈P (Rd)m
Λ(∆m)n
n
X
i=1
W2
2(Bary(D,λi), µi) + ρ
n
X
i=1
m
X
j=1
(λi)jW2
2(Dj, µi),(6)
where ρ > 0 is a tuneable balance parameter.
Note that the geometric sparse regularizer (5) resembles the objective in the definition
of the Wasserstein barycenter in (3). Crucially, in the unregularized formulation of the
WDL problem, the distance from the atoms to the data points is only indirectly mini-
mized, because the WDL formulation focuses solely on the reconstruction accuracy of the
generated barycenters and not directly about how similar those atoms are to the data
points. As such, the atoms may be arbitrarily far from the data provided that the recon-
struction accuracy is low. As discussed below, this may cause the existence of arbitrarily
many solutions to the WDL problem (4), which in general is ill-posed.
Interpretations of R(D,Λ):The regularization term R(D,Λ) is analogous to Laplacian
smoothing in Euclidean space (Cai et al., 2010; Dornaika and Weng, 2019) and can be
interpreted as non-linear archetypal learning (Cutler and Breiman, 1994) in Wasserstein
space. As previously mentioned, there is no notion of sparse coding in the unregularized
WDL formulation. The geometric sparse regularizer promotes sparsity by penalizing the
use of atoms that are far from the data to be represented and thus acts as a weighted `1
norm on the coefficients (Tasissa et al., 2021).
Connection with Wasserstein K-means: In this problem (Domazakis et al., 2019;
Verdinelli and Wasserman, 2019; Zhuang et al., 2022), given observed measures {µi}n
i=1
P(Rd) we want to find “centers” D={Dj}m
j=1 ⊂ P(Rd) solving the optimization problem
min
C,D
n
X
i=1
m
X
j=1
Cij W2
2(Dj, µi),
where CRn×mis such that Cij ∈ {0,1}is a binary assignment variable satisfying
Pm
j=1 Cij = 1 for all i= 1, . . . , n. Suppose D∈ P(Rd)mand ΛRn×mare the
5
摘要:

GeometricSparseCodinginWassersteinSpaceMarshallMueller*ShuchinAeron„JamesM.Murphy*…AbiyTasissa*…October24,2022AbstractWassersteindictionarylearningisanunsupervisedapproachtolearningacollec-tionofprobabilitydistributionsthatgenerateobserveddistributionsasWassersteinbarycentriccombinations.Existingmet...

展开>> 收起<<
Geometric Sparse Coding in Wasserstein Space Marshall MuellerShuchin Aeron James M. MurphyAbiy Tasissa.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:787.07KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注