WASSERSTEIN ARCHETYPAL ANALYSIS KATY CRAIG BRAXTON OSTING DONG WANG AND YIMING XU Abstract. Archetypal analysis is an unsupervised machine learning method that summarizes data

2025-05-06 1 0 2.68MB 27 页 10玖币

侵权投诉

WASSERSTEIN ARCHETYPAL ANALYSIS

KATY CRAIG, BRAXTON OSTING, DONG WANG, AND YIMING XU

Abstract. Archetypal analysis is an unsupervised machine learning method that summarizes data

using a convex polytope. In its original formulation, for ﬁxed k, the method ﬁnds a convex polytope

with kvertices, called archetype points, such that the polytope is contained in the convex hull of

the data and the mean squared Euclidean distance between the data and the polytope is minimal.

In the present work, we consider an alternative formulation of archetypal analysis based on

the Wasserstein metric, which we call Wasserstein archetypal analysis (WAA). In one dimension,

there exists a unique solution of WAA [7] and, in two dimensions, we prove existence of a solution,

as long as the data distribution is absolutely continuous with respect to Lebesgue measure. We

discuss obstacles to extending our result to higher dimensions and general data distributions. We

then introduce an appropriate regularization of the problem, via a R´enyi entropy, which allows us to

obtain existence of solutions of the regularized problem for general data distributions, in arbitrary

dimensions. We prove a consistency result for the regularized problem, ensuring that if the data are

iid samples from a probability measure, then as the number of samples is increased, a subsequence

of the archetype points converges to the archetype points for the limiting data distribution, almost

surely. Finally, we develop and implement a gradient-based computational approach for the two-

dimensional problem, based on the semi-discrete formulation of the Wasserstein metric. Our analysis

is supported by detailed computational experiments.

1. Introduction

Given a probability measure µ∈ P(Rd), archetypal analysis (AA) aims to ﬁnd the convex

polytope Ω ⊆Rdwith kvertices that best approximates µ. As originally introduced by Culter

and Breiman in 1994 [8], given data X={xi}N

i=1 ⊆Rdand k>d+ 1, AA ﬁnds kvertices,

A={a`}k

`=1 ⊆Rd, that belong to the convex hull of the data, for which the convex hull co(A)

explains the most variation of the dataset. In particular, AA can be framed in terms of the following

constrained optimization problem

min

A={a`}k

`=1⊆Rd(1

i=1

d2(xi,co(A)): A⊂co(X)),(1.1)

Department of Mathematics, University of California, Santa Barbara

Department of Mathematics, University of Utah, Salt Lake City

School of Science and Engineering & Guangdong Provincial Key Laboratory of Big Data Comput-

ing, The Chinese University of Hong Kong, Shenzhen, Guangdong, China

Corporate Model Risk, Wells Fargo

E-mail addresses:kcraig@math.ucsb.edu, osting@math.utah.edu, wangdong@cuhk.edu.cn,

yiming.xu@wellsfargo.com.

Date: October 27, 2022.

2020 Mathematics Subject Classiﬁcation. 62H12, 62G07, 65K10, 49Q22.

Key words and phrases. Archetypal analysis; optimal transport; Wasserstein metric; unsupervised learning; mul-

tivariate data summarization.

K. Craig acknowledges partial support from NSF DMS 1811012, NSF DMS 2145900, and a Hellman Faculty

Fellowship. K. Craig also acknowledges the support from the Simons Center for Theory of Computing, at which part

of this work was completed. B. Osting acknowledges partial support from NSF DMS 17-52202. D. Wang acknowledges

partial support from NSFC 12101524 and the University Development Fund from The Chinese University of Hong

Kong, Shenzhen (UDF01001803).

arXiv:2210.14298v1 [stat.ML] 25 Oct 2022

2 WASSERSTEIN ARCHETYPAL ANALYSIS

where d2(·,·) denotes the squared Euclidean distance. The archetypes, A={a`}k

`=1 ⊆Rd, may be

interpreted as exemplars of extreme points of the dataset, a mixture of which explain the general

characteristics of the associated distribution; see [5, 19, 24] for applications of AA in astronomy,

biology, and many others.

AA is closely related to other unsupervised learning methods, such as k-means, principal compo-

nent analysis, and nonnegative matrix factorization [19]. Under bounded support assumptions, the

consistency and convergence of AA were recently established in [20], laying the foundation for AA

to apply to large-scale datasets [15]. In practice, however, it is often more appropriate to assume

that the distribution generating the data has ﬁnite moments but unbounded support, in which case

AA is not well-posed [20]. Also, due to the deﬁnition of the loss, AA is sensitive to outliers [10]. To

address both issues, the present work considers a diﬀerent formulation of the AA problem, based

on the Wasserstein metric.

1.1. Main results. Let P2(Rd) denote the space of Borel probability measures on Rdwith ﬁnite

second moment, M2(µ) := RRd|x|2dµ(x)<+∞. Given µ∈ P2(Rd) and a number of vertices

k>d+ 1, we seek to ﬁnd the (nondegenerate) convex k-gon Ω ⊆Rdthat is closest to µ, in the

sense that the uniform probability measure on Ω is as close as possible to µin the 2-Wasserstein

metric:

min

Ω∈Sk

W2(µ, 1Ω/|Ω|), Sk={Ω⊂Rd: Ω is a convex k-gon with nonempty interior},(WAA)

where

1Ω(x) = (1 if x∈Ω,

0 otherwise.

Here, we use the term k-gon to mean a bounded polytope with kvertices.

Note that we make a mild abuse of notation in the above problem formulation and throughout

this manuscript: if a measure ν∈ P2(Rd) is absolutely continuous with respect to d-dimensional

Lebesgue measure Ld, we will write ν∈ P2,ac(Rd) and use the same symbol to denote both the

measure and its density, dν(x) = ν(x)dLd(x). In this way, we will use 1Ω/|Ω|to denote the uniform

probability measure on Ω.

Informally, the 2-Wasserstein metric measures the distance between probability measures in

terms of the amount of eﬀort it takes to rearrange one to look like the other. More precisely, given

µ, ν ∈ P2(Rd), the 2-Wasserstein metric is deﬁned by

2(µ, ν) = inf

X∼µ,Y ∼ν

E[|X−Y|2],(1.2)

where the expectation is taken with respect to a rich enough underlying probability space and the

inﬁmum is taken over all couplings (X, Y ) with marginals µand ν. If µ∈ P2,ac(Rd), then there is a

unique optimal coupling and, furthermore, the coupling is deterministic: there exists a measurable

function T, which is unique µ-a.e., so that

Y=T(X).(1.3)

(See work by Gigli for sharp conditions guaranteeing the existence of deterministic couplings [14].)

For further background on optimal transport and the 2-Wasserstein metric, we refer the reader to

one of the many excellent textbooks on the subject [1,2, 11, 21, 25, 26].

Alternative generalizations of the classical AA problem, based on probabilistic interpretations,

can be found in [22], where data are assumed to be generated from a parametric model and the

corresponding archetypes are found using the maximum likelihood estimation, resulting in a similar

formulation as (1.1). Nevertheless, the statistical assumptions in such approaches are often hard to

verify and may only be appropriate for certain datasets. For nonparametric approaches, aside from

the Euclidean and 2-Wasserstein metrics, other discrepancy measures such as the Kullback–Leibler

(KL) divergence may also be used. However, a KL divergence would treat outliers of a given mass

WASSERSTEIN ARCHETYPAL ANALYSIS 3

in the same way regardless of their spatial closeness, whereas a Wasserstein approach places a

larger focus on outliers that are farther from the bulk of the dataset. More generally, one could

also consider p-Wasserstein metrics, where larger choices of pwould further penalize the distance

from the dataset. In this way, the choice of metric or statistical divergence is problem dependent

and encodes modeling assumptions about the structure of the dataset.

When d= 1, there is a closed form solution for the 2-Wasserstein metric, and it is straightforward

to directly compute the unique minimizer of (WAA), as previously done in related work by Cuesta-

Albertos, Matr´an, and Rodr´ıguez-Rodr´ıguez [7]. (See below for a discussion of this and other

related previous works.) Our ﬁrst main result is that, when d= 2, a solution of (WAA) exists,

provided that µis absolutely continuous. We summarize the results in the one and two dimensional

setting in the following theorem.

Theorem 1.1 (Existence of minimizer in one and two dimensions).Let d= 1 and k>2. Then

for all µ∈ P2(R), there exists a unique solution of (WAA), and this solution may be expressed

explicitly as Ω=[a, b]for

a= 4C0−6C1, b = 6C1−2C0, C0=ZR

xdµ(x), C1=ZR

xF (x)dµ(x),(1.4)

where Fis the cumulative distribution function (CDF) of µ.

Let d= 2 and k>3. Then for all µ∈ P2,ac(R2), there exists a solution of (WAA).

In the two dimensional case, our proof of Theorem 1.1 relies on an explicit characterization of

the closure of {1Ω/|Ω|: Ω ∈Sk}in the narrow topology; see Proposition 3.1. The key diﬃculty

in proving a minimizer exists arises when considering the possibility that the minimizing sequence

converges to a limit point with lower dimensional support. Note, in particular, that limit points

with lower dimensional support are in general not uniformly distributed on their support; see

Figure 1. For this reason, not all limit points have a valid interpretation in terms of archetypal

analysis, in which the kvertices or archetypes of their support completely describe the measure via

their convex combinations. For this reason, it is essential that we rule out the possibility that a

limit point of this type achieves the inﬁmum of W2(µ, 1Ω/|Ω|) over Ω ∈Sk. We succeed in doing

this when µ∈ P2,ac(R2) by adapting a perturbation argument of Cuesta-Albertos, et. al., [6] to

the case of convex polygons. In particular, we construct simple perturbations around degenerate

limit points that are both feasible for our constraint set and improve the value of the objective

function. It remains an open question how to extend this technique to µwhich are no absolutely

continuous with respect to L2; see Remark 3.2. Furthermore, our approach to proving the value of

the objective function improves strongly leverages the structure of the 2-Wasserstein metric, and a

diﬀerent approach would be needed to extend the result to p-Wasserstein metrics for p6= 2.

While this perturbative approach allows us to overcome the diﬃculty of degenerate limit points

in two dimensions, our approach fails in dimensions d>3, since in the higher dimensional setting,

edges of polygons can cross in the limit, creating artiﬁcial vertices in the perturbations we consider,

so that they no longer belong to the constraint set; see Remark 3.3 and Figure 2. For this reason,

it remains an open problem whether minimizers of (WAA) exist for dimensions d>3.

Finally, in terms of uniqueness, we observe that, in dimensions higher than one, the minimizer

is clearly not unique: for example, if µis the uniform distribution on the unit ball, any rotation of

an optimal polytope would also be optimal. Understanding whether uniqueness holds up to such

invariances, remains an interesting open question.

As described above, our approach to proving existence of (WAA) in two dimensions proceeds

by adapting a perturbation argument of Cuesta, el. al. [6], which considered a related problem:

given µ∈ P2,ac(Rd), ﬁnd the convex set Ω that minimizes W2(µ, 1Ω/|Ω|). This built on previous

work by the same authors, which also examined optimal W2approximation of measures µ∈ P(R)

in one dimension by empirical measures, uniform measures on intervals, and ellipsoids [7]. The

4 WASSERSTEIN ARCHETYPAL ANALYSIS

Figure 1. In this ﬁgure, we illustrate how limit points of {1Ω/|Ω|: Ω ∈Sk}in the narrow topology

are not, in general, given by uniform measures on their support. For example, the top row of this

ﬁgure illustrates the limit of a sequence of triangles Ωδ, where the base of the triangle remains ﬁxed,

while the height of the triangle converges δconverges to zero. The second row of this ﬁgure shows the

limit of the corresponding probability measures. We see that the limiting measure ν0is supported

on the one dimensional interval prescribed by the base of the triangle, but its density with respect

to L1is not constant, but piecewise linear, concentrating more mass in the interior of the interval,

where the third vertex converged. In particular, we see that limδ→01Ωδ/|Ωδ|=ν06= 1Ω0/|Ω0|.

motivation of this work was to describe the shape and ﬂatness of a measure µ. Subsequent work

by Belili and Heinich [3] considered W2approximation of a measure µover more general types of

measures, including uniform measures on convex sets and uniform measures on sets of the form

Ω1\Ω2where Ω1,Ω2are convex. Furthermore, Belili and Heinich suceeded in proving existence of a

closest approximation under more general hypotheses on µ: instead of requiring µ∈ P2,ac(Rd), they

require that the aﬃne hull of the support of µis d-dimensional. However, an essential hypothesis in

the work of Belili and Heinich is that the class of approximating measures satisfy certain symmetry

properties [3, condition (3)]. For example, if ν0were the limiting measure shown in Figure 1, there

would have to exist a triangle Ω so that the ﬁrst marginal of 1Ω/|Ω|is ν0and its second marginal

is symmetric about the origin. Unfortunately, no such triangle exists. For this reason, while our

study of existence of solutions to (WAA) is closely related to the aforementioned works, the fact

that (WAA) constrains Ω to be a k-gon requires the development of new techniques.

Motivated by the fact that the key challenge in proving existence of minimizers to (WAA) arises

when the support of the minimizing sequence collapses to a lower dimensional set, we now introduce

a regularized version of the problem that prevents this degeneracy. For m>1, the m-R´enyi entropy

is given by

Um(ν) = (1

m−1Rν(x)mdx if ν Ldand dν(x) = ν(x)dx,

+∞otherwise.

In particular, if ν= 1Ω/|Ω|for Ω ∈Sk, we have

Um(1Ω/|Ω|) = (1

(m−1)|Ω|m−1if m > 1,

−log(|Ω|) if m= 1.

(1.5)

WASSERSTEIN ARCHETYPAL ANALYSIS 5

For ﬁxed ε > 0, k>d+ 1, m>1, and µ∈ P2(Rd), we consider the regularized problem:

min

Ω∈Sk

W2(µ, 1Ω/|Ω|) + εUm(1Ω/|Ω|).(WAAε)

Due to the fact that the regularization prevents the minimizer from collapsing to a lower dimensional

set, we are able to obtain existence of solutions in all dimensions, for general µ∈ P2(Rd).

Theorem 1.2 (Existence of minimizer for (WAAε), in all dimensions).Fix ε > 0,k>d+ 1,

m>1, and µ∈ P2(Rd). Then there exists a solution of (WAAε).

The proof of Theorem 1.2 follows via Prokhorov’s theorem and the lower semicontinuity of the

Wasserstein metric and R´enyi entropy in the narrow topology. Uniqueness again fails in dimensions

higher than two, due to the rotational invariance of the R´enyi entropy.

Remark 1.3 (Limit as ε→0).If d= 1 or 2 and µ∈ P2,ac(Rd), similar arguments as in the proof of

Theorem 1.1 can be used to show that, for any εn→0 as n→+∞, the solutions to (WAAε) with

ε=εn, form a minimizing sequence to (WAA) and converge (up to a subsequence) to a solution

of (WAA). This shows consistency of the regularized problem with the original problem, as the

regularization is removed.

Our ﬁnal results consider practical application of Wasserstein archetypal analysis to data. In

particular, we seek to understand how solutions of the regularized problem (WAAε) behave when

a measure µ∈ P2(Rd) is a approximated by a sequence of empirical measure µn:= 1

nPn

i=1 δXi,

Xiiid

∼µ. We show that, almost surely and up to a subsequence, optimizers Ωnfor the empirical

measure µnconverge as n→ ∞ to an optimizer Ω for µ. We also provide a convergence rate in terms

of the value of the objective function, based on a classical estimate of Horowitz and Karandikar [16].

While more sophisticated rates can be obtained using sharper moment estimates or estimates on

the dimensionality of the support of µ[13, 27], we use the classical estimate for simplicity and to

preserve the focus on the archetypal analysis problem.

Theorem 1.4 (Consistency and convergence for (WAAε)).Fix ε > 0,k>d+ 1,m>1, and

µ∈ P2(Rd). Suppose

µn=1

i=1

δXi, Xi

iid

∼µ.

For each n∈N, let {Ωn}n∈Nbe a minimizer to (WAAε)for µn. Then the following hold:

(i) Almost surely, there exists Ω∈Skso that, up to a subsequence,

1Ωn/|Ωn| → 1Ω/|Ω|narrowly,(1.6)

and Ωsolves (WAAε)for µ;

(ii) If β:= RRd|x|d+5dµ(x)<+∞, then

EW2µ, 1

|Ωn|1Ωn+εUm(1Ωn/|Ωn|)−min

Ω∈Sk

W2µ, 1

|Ω|1Ω+εUm(1Ω/|Ω|)6C(β, d)

nd+4 ,

where C(β, d)is some constant depending on βand d.

In addition to ensuring consistency of WAA under empirical approximation, the previous the-

orem is also relevant from the perspective of numerical methods. Our ﬁnal main result is the

development of a numerical method for solving (WAA) and (WAAε), based on approximating a

given measure µ∈ P2(Rd) by a sequence of ﬁnitely supported measures µn=Pn

i=1 δximiand

approximating the solution of (WAA) or (WAAε) for µn. Such an approximation greatly simpliﬁes

the problem numerically, since the Wasserstein distance W2(µn,1Ω/|Ω|) can now be computed using

the semidiscrete method introduced by M´erigot [18]. Based on this perspective, we introduce an

alternating gradient-based method to approximate the optimizer; see Section 5 and Algorithm 1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WASSERSTEINARCHETYPALANALYSISKATYCRAIG,BRAXTONOSTING,DONGWANG,ANDYIMINGXUAbstract.Archetypalanalysisisanunsupervisedmachinelearningmethodthatsummarizesdatausingaconvexpolytope.Initsoriginalformulation,forxedk,themethodndsaconvexpolytopewithkvertices,calledarchetypepoints,suchthatthepolytopeisconta...

展开>> 收起<<

WASSERSTEIN ARCHETYPAL ANALYSIS KATY CRAIG BRAXTON OSTING DONG WANG AND YIMING XU Abstract. Archetypal analysis is an unsupervised machine learning method that summarizes data.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

WASSERSTEIN ARCHETYPAL ANALYSIS KATY CRAIG BRAXTON OSTING DONG WANG AND YIMING XU Abstract. Archetypal analysis is an unsupervised machine learning method that summarizes data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: