Geometric Sparse Coding in Wasserstein Space Marshall MuellerShuchin Aeron James M. MurphyAbiy Tasissa

2025-05-06 0 0 787.07KB 24 页 10玖币

侵权投诉

Geometric Sparse Coding in Wasserstein Space

Marshall Mueller∗Shuchin Aeron†

James M. Murphy∗‡ Abiy Tasissa∗‡

October 24, 2022

Abstract

Wasserstein dictionary learning is an unsupervised approach to learning a collec-

tion of probability distributions that generate observed distributions as Wasserstein

barycentric combinations. Existing methods for Wasserstein dictionary learning

optimize an objective that seeks a dictionary with suﬃcient representation capac-

ity via barycentric interpolation to approximate the observed training data, but

without imposing additional structural properties on the coeﬃcients associated

to the dictionary. This leads to dictionaries that densely represent the observed

data, which makes interpretation of the coeﬃcients challenging and may also lead

to poor empirical performance when using the learned coeﬃcients in downstream

tasks. In contrast and motivated by sparse dictionary learning in Euclidean spaces,

we propose a geometrically sparse regularizer for Wasserstein space that promotes

representations of a data point using only nearby dictionary elements. We show

this approach leads to sparse representations in Wasserstein space and addresses

the problem of non-uniqueness of barycentric representation. Moreover, when data

is generated as Wasserstein barycenters of ﬁxed distributions, this regularizer fa-

cilitates the recovery of the generating distributions in cases that are ill-posed for

unregularized Wasserstein dictionary learning. Through experimentation on syn-

thetic and real data, we show that our geometrically regularized approach yields

sparser and more interpretable dictionaries in Wasserstein space, which perform

better in downstream applications.

1 Introduction

A central goal of statistical signal processing is the discovery of latent structures in com-

plex data. Indeed, although data often resides in a high-dimensional ambient space, the

classical manifold hypothesis posits that in fact, the data can be well approximated by low-

dimensional manifolds or mixtures thereof, which circumvents the curse of dimensionality

∗Department of Mathematics, Tufts University, Medford, MA 02155, USA (marshall.mueller@

tufts.edu,jm.murphy@tufts.edu,abiy.tasissa@tufts.edu)

†Department of Electrical and Computer Engineering, Tufts University, Medford, MA 02155, USA

(shuchin@ece.tufts.edu)

‡Corresponding authors, listed alphabetically

arXiv:2210.12135v1 [cs.LG] 21 Oct 2022

that plagues high-dimensional statistics. Linear dimensionality reduction methods such

as principal component analysis (PCA) (Hotelling, 1933) and non-linear manifold learning

approaches that exploit local connectivity structure in the data (Scholkopf et al., 1997;

Tenenbaum et al., 2000; Roweis and Saul, 2000; Belkin and Niyogi, 2003; Coifman and

Lafon, 2006) rely on this assumption of intrinsically low-dimensional structure in high-

dimensional data to glean insights. These techniques typically output a low-dimensional

representation preserving local geometric structures such as pairwise distances or geodesic

distances with respect to a Riemannian metric.

An alternative perspective for eﬃciently representing complex data is the sparse coding

and dictionary learning paradigm (Olshausen and Field, 1996, 1997; Barlow et al., 1961;

Hrom´adka et al., 2008). In the simplest setting when data is considered as elements of

Rd(or more generally a normed vector space), the aim of sparse coding is to represent

data {yi}n

i=1 ⊂Rd, stacked as rows in the matrix Y∈Rn×d, as a linear combination of

vectors {dj}m

j=1, stacked as a dictionary matrix D∈Rm×dsuch that Y≈ΛD for some

coeﬃcients Λ∈Rn×m, perhaps subject to constraints on Λ. When the dictionary Dis

ﬁxed, this reduces to an optimization over Λ(Mallat, 1999; Engan et al., 2000). More

generally, Dand Λcan be learned simultaneously with some additional constraints on

the dictionary or coeﬃcients (Lee and Seung, 1999; Aharon et al., 2006). This is typically

formulated as an optimization problem

arg min

D,Λ

L(Y,ΛD) + ρR(D,Λ),(1)

for some loss function L(e.g., L(Y,ΛD) = kY−ΛDkF) and regularization function R

(e.g., R(D,Λ) = kΛk1) balanced by a parameter ρ > 0. The regularizers ensure well-

posedness of the problem and improve interpretability and robustness. The problem (1)

is the dictionary learning problem in Rd.

The imposed Euclidean structure is convenient computationally but limiting in practice,

as many real data are better modeled as living in spaces with non-Euclidean geometry

where instead Y≈ F(D,Λ) for some nonlinear reconstruction function F(Tuzel et al.,

2006, 2007; Li et al., 2008; Guo et al., 2010; Harandi et al., 2013, 2015; Cherian and Sra,

2016; Yin et al., 2016; Maggioni et al., 2016; Liu et al., 2018; Schmitz et al., 2018; Tankala

et al., 2020). Important questions in this setting are what notion of reconstruction should

take the place of linear combination (i.e. F), how reconstruction quality is assessed

without the use of a global norm (i.e. L), and what constraints are natural on the

coeﬃcients in the nonlinear space (i.e. R).

This paper focuses on nonlinear sparse coding and dictionary learning for data that are

modeled as probability distributions in Wasserstein space. This basic framework was pi-

oneered by Schmitz et al. (2018), where the authors leverage the theory and algorithms

of optimal transport to propose the Wasserstein dictionary learning (WDL) algorithm,

whereby a data point (interpreted as a probability distribution or histogram in Rd) is

approximated as a Wasserstein barycenter (Agueh and Carlier, 2011) of the learned dic-

tionary atoms. The resulting framework is focused on learning a dictionary that re-

constructs well, but neglects other desirable aspects of a dictionary such as sparsity of

the induced coeﬃcients. Moreover, this classical scheme is ill-posed in two senses: for

a ﬁxed dictionary, unique coeﬃcients are not assured; moreover, there may be multiple

dictionaries that enable perfect reconstruction of the observed data.

Summary of Contributions: We generalize the classical WDL algorithm (Schmitz

et al., 2018) by incorporating a novel Wasserstein geometric sparse regularizer. Our

regularizer encourages an observed data point to be reconstructed as a barycentric rep-

resentation from nearby (in the sense of Wasserstein distances) dictionary atoms. As

we vary the balance parameter for this regularizer, the proposed method interpolates

between classical WDL (no regularization) and Wasserstein K-means (strong regulariza-

tion). Unlike the original formulation, the proposed regularizer learns dictionary atoms

with geometric similarity to the training data. Theoretically, we demonstrate the ability

of the model to learn sparse coeﬃcients and to overcome issues of non-uniqueness both at

the level of learning coeﬃcients for a ﬁxed dictionary, and for the general WDL problem

on an idealized generative model in Wasserstein space. Empirically, we provide evidence

that our regularized dictionary learning scheme yields more interpretable and useful coef-

ﬁcients for downstream classiﬁcation tasks; the code to reproduce all experiments in this

paper will be released on the authors’ GitHub.

Notations and Preliminaries: Lowercase and uppercase boldface letters denote (col-

umn) vectors and matrices, respectively. We generally use Greek letters to denote mea-

sures, with the exception that D={D}m

j=1 denotes the dictionary when its elements are

measures. We denote the Euclidean norm of a vector xas ||x||2. Let

∆m=(x∈Rm

i=1

xi= 1,∀i= 1, . . . , m, xi≥0)

denote the discrete probability simplex of dimension m. Softmax, as a change of variables,

is deﬁned as CoV(x) := exp(x)/exp(x)TN. Here we take the exponential to be an

elementwise operation on the vector and use Nto denote the ones vector of size N.

When we write CoV(X) for some matrix X∈Rn×mwe take it to mean applying the

change of variables to each row.

2 Background and Related Work

Classical Dictionary Learning: In equation (1), using L(Y,ΛD) = kY−ΛDkF

and ρ= 0 yields an optimization problem with optimal dictionary and coeﬃcients

given by the msingular components with largest singular values (Eckart and Young,

1936). To promote interpretable, sparse coeﬃcients that still realize Y≈ΛD, the pro-

totypical regularized dictionary learning problem is min

D,Λ||Y−ΛD||2

F+ρ||Λ||1where

||Λ||1=Pn

i=1 Pm

j=1 |Λij |is a sparsity-promoting regularizer (Donoho, 2006; Elad, 2010).

Beyond enhancing interpretability, eﬃciency, and uniqueness of representation, sparse

representations improve generalization of eﬃcient supervised learning (Mehta and Gray,

2013; Mairal et al., 2011). In the non-negative matrix factorization (NMF) paradigm,

non-negativity constraints are imposed on the atoms and coeﬃcients, which increases

their interpretability and eﬀectiveness in downstream applications (Lee and Seung, 1999,

2000; Berry et al., 2007).

Optimal Transport: We provide basic background on optimal transport; for more

general treatments and theory, see (Ambrosio et al., 2005; Villani, 2021; Santambrogio,

2015; Peyr´e et al., 2019). Let P(Rd) be the space of probability measures in Rd. Let

µ, ν ∈ P(Rd). Let

Π(µ, ν) ={γ:Rd×Rd→R|for all A, B Borel, γ(A×Rd) = µ(A), γ(Rd×B) = ν(B)}

be the set of joint distributions with marginals µand ν. The squared Wasserstein−2

distance is deﬁned as:

2(µ, ν) := min

π∈Π(µ,ν)ZRd×Rd

kx−yk2

2π(x,y).(2)

Given measures {Dj}m

j=1 ⊂ P(Rd) that have ﬁnite second moments, along with a vector

λ∈∆m, the Wasserstein-(2) barycenter (Agueh and Carlier, 2011) is deﬁned as:

Bary(D,λ) := arg min

µ∈P(Rd)

j=1

λjW2

2(Dj, µ).(3)

The measure Bary(D,λ) can be interpreted as a weighted average of the {Dj}m

j=1, with

the impact of Djproportional to λj. Wasserstein barycenters have proven useful in a

range of applications, and are in a precise sense the “correct” way to combine measures,

in that Bary(D,λ) preserves the geometric properties of D={Dj}m

j=1 in a way that linear

mixtures do not (Agueh and Carlier, 2011; Rabin et al., 2011; Cuturi and Doucet, 2014;

Bonneel et al., 2016). Wasserstein barycenters are intimately connected to geodesics in

Wasserstein space, in the following sense. For π∗optimizing (2), the McCann interpolation

of µ, ν is (Pt)#π∗where Pt(x,y) = (1 −t)x+tyfor t∈[0,1] and where (Pt)#denotes the

pushforward by Pt. The McCann interpolation is the constant-speed geodesic between

µ, ν in the Wasserstein-2 space (McCann, 1997; Ambrosio et al., 2005) and coincides with

the Wasserstein barycenter with weight λ= (1 −t, t) on µ, ν.

3 Geometric Sparse Regularization for Wasserstein

Dictionary Learning

WDL (Schmitz et al., 2018) aims to ﬁnd a dictionary of probability distributions D=

{Dj}m

j=1 ⊂ P(Rd) such that observed data {µi}n

i=1 ⊂ P(Rd) can be represented as Wasser-

stein barycenters of the collection D. The precise optimization problem is

arg min

D∈P (Rd)m

Λ∈(∆m)n

i=1

L(Bary(D,λi), µi),(4)

where the loss function Lis typically taken to be W2

2and λi∈∆mis a (column) vector

of size m, corresponding to a row of Λ. In other words, solving this problem ﬁnds the

dictionary of probability distributions that ﬁnds best approximations to each data point

µiusing barycentric combinations of D. WDL was proposed in part as an alternative

to geodesic principal component analysis in Wasserstein space (Boissard et al., 2015;

Seguy and Cuturi, 2015; Bigot et al., 2017), and has proven highly eﬀective in terms of

producing meaningful atoms for representing probability distributions. However, it may

yield non-sparse coeﬃcients. Moreover, as we will establish, it is ill-posed both at the

level of having non-unique coeﬃcients for a ﬁxed dictionary Dand at the level of having

multiple dictionaries that can reconstruct the data perfectly.

Thus, one might consider adding the `1regularizer to the WDL objective to induce

desirable sparsity of the representation. However, this fails because all coeﬃcients of

a Wasserstein barycenter lie on ∆m. Methods to promote sparsity of coeﬃcients on

the simplex can be done with entropy, projections, and suitable use of the the `2norm

(Donoho et al., 1992; Shashanka et al., 2007; Larsson and Ugander, 2011; Kyrillidis et al.,

2013; Li et al., 2020), but we focus instead on geometric regularization. An analogous

problem has been studied in the linear setting in a situation where the weights are on ∆m

(Tankala et al., 2020). In that context, the geometric sparse regularizer

j=1

λj||y−dj||2

(for an individual data point ywith representation coeﬃcient λ) has been proven to

promote sparsity by favoring local representations, namely reconstructing using nearby

(with respect to Euclidean distances) dictionary atoms.

We propose to regularize (4) with a novel Wasserstein geometric sparse regularizer:

R(D,Λ) :=

i=1

j=1

(λi)jW2

2(Dj, µi).(5)

This yields a new, regularized objective:

F(D,Λ,{µi}n

i=1) := min

D∈P (Rd)m

Λ∈(∆m)n

i=1

2(Bary(D,λi), µi) + ρ

i=1

j=1

(λi)jW2

2(Dj, µi),(6)

where ρ > 0 is a tuneable balance parameter.

Note that the geometric sparse regularizer (5) resembles the objective in the deﬁnition

of the Wasserstein barycenter in (3). Crucially, in the unregularized formulation of the

WDL problem, the distance from the atoms to the data points is only indirectly mini-

mized, because the WDL formulation focuses solely on the reconstruction accuracy of the

generated barycenters and not directly about how similar those atoms are to the data

points. As such, the atoms may be arbitrarily far from the data provided that the recon-

struction accuracy is low. As discussed below, this may cause the existence of arbitrarily

many solutions to the WDL problem (4), which in general is ill-posed.

Interpretations of R(D,Λ):The regularization term R(D,Λ) is analogous to Laplacian

smoothing in Euclidean space (Cai et al., 2010; Dornaika and Weng, 2019) and can be

interpreted as non-linear archetypal learning (Cutler and Breiman, 1994) in Wasserstein

space. As previously mentioned, there is no notion of sparse coding in the unregularized

WDL formulation. The geometric sparse regularizer promotes sparsity by penalizing the

use of atoms that are far from the data to be represented and thus acts as a weighted `1

norm on the coeﬃcients (Tasissa et al., 2021).

Connection with Wasserstein K-means: In this problem (Domazakis et al., 2019;

Verdinelli and Wasserman, 2019; Zhuang et al., 2022), given observed measures {µi}n

i=1 ⊂

P(Rd) we want to ﬁnd “centers” D={Dj}m

j=1 ⊂ P(Rd) solving the optimization problem

min

C,D

i=1

j=1

Cij W2

2(Dj, µi),

where C∈Rn×mis such that Cij ∈ {0,1}is a binary assignment variable satisfying

j=1 Cij = 1 for all i= 1, . . . , n. Suppose D∗∈ P(Rd)mand Λ∗∈Rn×mare the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeometricSparseCodinginWassersteinSpaceMarshallMueller*ShuchinAeronJamesM.Murphy*AbiyTasissa*October24,2022AbstractWassersteindictionarylearningisanunsupervisedapproachtolearningacollec-tionofprobabilitydistributionsthatgenerateobserveddistributionsasWassersteinbarycentriccombinations.Existingmet...

展开>> 收起<<

Geometric Sparse Coding in Wasserstein Space Marshall MuellerShuchin Aeron James M. MurphyAbiy Tasissa.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Geometric Sparse Coding in Wasserstein Space Marshall MuellerShuchin Aeron James M. MurphyAbiy Tasissa

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: