WASSERSTEIN ARCHETYPAL ANALYSIS KATY CRAIG BRAXTON OSTING DONG WANG AND YIMING XU Abstract. Archetypal analysis is an unsupervised machine learning method that summarizes data

2025-05-06 0 0 2.68MB 27 页 10玖币
侵权投诉
WASSERSTEIN ARCHETYPAL ANALYSIS
KATY CRAIG, BRAXTON OSTING, DONG WANG, AND YIMING XU
Abstract. Archetypal analysis is an unsupervised machine learning method that summarizes data
using a convex polytope. In its original formulation, for fixed k, the method finds a convex polytope
with kvertices, called archetype points, such that the polytope is contained in the convex hull of
the data and the mean squared Euclidean distance between the data and the polytope is minimal.
In the present work, we consider an alternative formulation of archetypal analysis based on
the Wasserstein metric, which we call Wasserstein archetypal analysis (WAA). In one dimension,
there exists a unique solution of WAA [7] and, in two dimensions, we prove existence of a solution,
as long as the data distribution is absolutely continuous with respect to Lebesgue measure. We
discuss obstacles to extending our result to higher dimensions and general data distributions. We
then introduce an appropriate regularization of the problem, via a R´enyi entropy, which allows us to
obtain existence of solutions of the regularized problem for general data distributions, in arbitrary
dimensions. We prove a consistency result for the regularized problem, ensuring that if the data are
iid samples from a probability measure, then as the number of samples is increased, a subsequence
of the archetype points converges to the archetype points for the limiting data distribution, almost
surely. Finally, we develop and implement a gradient-based computational approach for the two-
dimensional problem, based on the semi-discrete formulation of the Wasserstein metric. Our analysis
is supported by detailed computational experiments.
1. Introduction
Given a probability measure µ∈ P(Rd), archetypal analysis (AA) aims to find the convex
polytope Ω Rdwith kvertices that best approximates µ. As originally introduced by Culter
and Breiman in 1994 [8], given data X={xi}N
i=1 Rdand k>d+ 1, AA finds kvertices,
A={a`}k
`=1 Rd, that belong to the convex hull of the data, for which the convex hull co(A)
explains the most variation of the dataset. In particular, AA can be framed in terms of the following
constrained optimization problem
min
A={a`}k
`=1Rd(1
N
N
X
i=1
d2(xi,co(A)): Aco(X)),(1.1)
Department of Mathematics, University of California, Santa Barbara
Department of Mathematics, University of Utah, Salt Lake City
School of Science and Engineering & Guangdong Provincial Key Laboratory of Big Data Comput-
ing, The Chinese University of Hong Kong, Shenzhen, Guangdong, China
Corporate Model Risk, Wells Fargo
E-mail addresses:kcraig@math.ucsb.edu, osting@math.utah.edu, wangdong@cuhk.edu.cn,
yiming.xu@wellsfargo.com.
Date: October 27, 2022.
2020 Mathematics Subject Classification. 62H12, 62G07, 65K10, 49Q22.
Key words and phrases. Archetypal analysis; optimal transport; Wasserstein metric; unsupervised learning; mul-
tivariate data summarization.
K. Craig acknowledges partial support from NSF DMS 1811012, NSF DMS 2145900, and a Hellman Faculty
Fellowship. K. Craig also acknowledges the support from the Simons Center for Theory of Computing, at which part
of this work was completed. B. Osting acknowledges partial support from NSF DMS 17-52202. D. Wang acknowledges
partial support from NSFC 12101524 and the University Development Fund from The Chinese University of Hong
Kong, Shenzhen (UDF01001803).
1
arXiv:2210.14298v1 [stat.ML] 25 Oct 2022
2 WASSERSTEIN ARCHETYPAL ANALYSIS
where d2(·,·) denotes the squared Euclidean distance. The archetypes, A={a`}k
`=1 Rd, may be
interpreted as exemplars of extreme points of the dataset, a mixture of which explain the general
characteristics of the associated distribution; see [5, 19, 24] for applications of AA in astronomy,
biology, and many others.
AA is closely related to other unsupervised learning methods, such as k-means, principal compo-
nent analysis, and nonnegative matrix factorization [19]. Under bounded support assumptions, the
consistency and convergence of AA were recently established in [20], laying the foundation for AA
to apply to large-scale datasets [15]. In practice, however, it is often more appropriate to assume
that the distribution generating the data has finite moments but unbounded support, in which case
AA is not well-posed [20]. Also, due to the definition of the loss, AA is sensitive to outliers [10]. To
address both issues, the present work considers a different formulation of the AA problem, based
on the Wasserstein metric.
1.1. Main results. Let P2(Rd) denote the space of Borel probability measures on Rdwith finite
second moment, M2(µ) := RRd|x|2(x)<+. Given µ∈ P2(Rd) and a number of vertices
k>d+ 1, we seek to find the (nondegenerate) convex k-gon Ω Rdthat is closest to µ, in the
sense that the uniform probability measure on Ω is as close as possible to µin the 2-Wasserstein
metric:
min
Sk
W2(µ, 1/||), Sk={Rd: Ω is a convex k-gon with nonempty interior},(WAA)
where
1(x) = (1 if x,
0 otherwise.
Here, we use the term k-gon to mean a bounded polytope with kvertices.
Note that we make a mild abuse of notation in the above problem formulation and throughout
this manuscript: if a measure ν∈ P2(Rd) is absolutely continuous with respect to d-dimensional
Lebesgue measure Ld, we will write ν∈ P2,ac(Rd) and use the same symbol to denote both the
measure and its density, (x) = ν(x)dLd(x). In this way, we will use 1/||to denote the uniform
probability measure on Ω.
Informally, the 2-Wasserstein metric measures the distance between probability measures in
terms of the amount of effort it takes to rearrange one to look like the other. More precisely, given
µ, ν ∈ P2(Rd), the 2-Wasserstein metric is defined by
W2
2(µ, ν) = inf
Xµ,Y ν
E[|XY|2],(1.2)
where the expectation is taken with respect to a rich enough underlying probability space and the
infimum is taken over all couplings (X, Y ) with marginals µand ν. If µ∈ P2,ac(Rd), then there is a
unique optimal coupling and, furthermore, the coupling is deterministic: there exists a measurable
function T, which is unique µ-a.e., so that
Y=T(X).(1.3)
(See work by Gigli for sharp conditions guaranteeing the existence of deterministic couplings [14].)
For further background on optimal transport and the 2-Wasserstein metric, we refer the reader to
one of the many excellent textbooks on the subject [1,2, 11, 21, 25, 26].
Alternative generalizations of the classical AA problem, based on probabilistic interpretations,
can be found in [22], where data are assumed to be generated from a parametric model and the
corresponding archetypes are found using the maximum likelihood estimation, resulting in a similar
formulation as (1.1). Nevertheless, the statistical assumptions in such approaches are often hard to
verify and may only be appropriate for certain datasets. For nonparametric approaches, aside from
the Euclidean and 2-Wasserstein metrics, other discrepancy measures such as the Kullback–Leibler
(KL) divergence may also be used. However, a KL divergence would treat outliers of a given mass
WASSERSTEIN ARCHETYPAL ANALYSIS 3
in the same way regardless of their spatial closeness, whereas a Wasserstein approach places a
larger focus on outliers that are farther from the bulk of the dataset. More generally, one could
also consider p-Wasserstein metrics, where larger choices of pwould further penalize the distance
from the dataset. In this way, the choice of metric or statistical divergence is problem dependent
and encodes modeling assumptions about the structure of the dataset.
When d= 1, there is a closed form solution for the 2-Wasserstein metric, and it is straightforward
to directly compute the unique minimizer of (WAA), as previously done in related work by Cuesta-
Albertos, Matr´an, and Rodr´ıguez-Rodr´ıguez [7]. (See below for a discussion of this and other
related previous works.) Our first main result is that, when d= 2, a solution of (WAA) exists,
provided that µis absolutely continuous. We summarize the results in the one and two dimensional
setting in the following theorem.
Theorem 1.1 (Existence of minimizer in one and two dimensions).Let d= 1 and k>2. Then
for all µ∈ P2(R), there exists a unique solution of (WAA), and this solution may be expressed
explicitly as Ω=[a, b]for
a= 4C06C1, b = 6C12C0, C0=ZR
xdµ(x), C1=ZR
xF (x)(x),(1.4)
where Fis the cumulative distribution function (CDF) of µ.
Let d= 2 and k>3. Then for all µ∈ P2,ac(R2), there exists a solution of (WAA).
In the two dimensional case, our proof of Theorem 1.1 relies on an explicit characterization of
the closure of {1/||: Ω Sk}in the narrow topology; see Proposition 3.1. The key difficulty
in proving a minimizer exists arises when considering the possibility that the minimizing sequence
converges to a limit point with lower dimensional support. Note, in particular, that limit points
with lower dimensional support are in general not uniformly distributed on their support; see
Figure 1. For this reason, not all limit points have a valid interpretation in terms of archetypal
analysis, in which the kvertices or archetypes of their support completely describe the measure via
their convex combinations. For this reason, it is essential that we rule out the possibility that a
limit point of this type achieves the infimum of W2(µ, 1/||) over Ω Sk. We succeed in doing
this when µ∈ P2,ac(R2) by adapting a perturbation argument of Cuesta-Albertos, et. al., [6] to
the case of convex polygons. In particular, we construct simple perturbations around degenerate
limit points that are both feasible for our constraint set and improve the value of the objective
function. It remains an open question how to extend this technique to µwhich are no absolutely
continuous with respect to L2; see Remark 3.2. Furthermore, our approach to proving the value of
the objective function improves strongly leverages the structure of the 2-Wasserstein metric, and a
different approach would be needed to extend the result to p-Wasserstein metrics for p6= 2.
While this perturbative approach allows us to overcome the difficulty of degenerate limit points
in two dimensions, our approach fails in dimensions d>3, since in the higher dimensional setting,
edges of polygons can cross in the limit, creating artificial vertices in the perturbations we consider,
so that they no longer belong to the constraint set; see Remark 3.3 and Figure 2. For this reason,
it remains an open problem whether minimizers of (WAA) exist for dimensions d>3.
Finally, in terms of uniqueness, we observe that, in dimensions higher than one, the minimizer
is clearly not unique: for example, if µis the uniform distribution on the unit ball, any rotation of
an optimal polytope would also be optimal. Understanding whether uniqueness holds up to such
invariances, remains an interesting open question.
As described above, our approach to proving existence of (WAA) in two dimensions proceeds
by adapting a perturbation argument of Cuesta, el. al. [6], which considered a related problem:
given µ∈ P2,ac(Rd), find the convex set Ω that minimizes W2(µ, 1/||). This built on previous
work by the same authors, which also examined optimal W2approximation of measures µ∈ P(R)
in one dimension by empirical measures, uniform measures on intervals, and ellipsoids [7]. The
4 WASSERSTEIN ARCHETYPAL ANALYSIS
Figure 1. In this figure, we illustrate how limit points of {1/||: Sk}in the narrow topology
are not, in general, given by uniform measures on their support. For example, the top row of this
figure illustrates the limit of a sequence of triangles Ωδ, where the base of the triangle remains fixed,
while the height of the triangle converges δconverges to zero. The second row of this figure shows the
limit of the corresponding probability measures. We see that the limiting measure ν0is supported
on the one dimensional interval prescribed by the base of the triangle, but its density with respect
to L1is not constant, but piecewise linear, concentrating more mass in the interior of the interval,
where the third vertex converged. In particular, we see that limδ01δ/|δ|=ν06= 10/|0|.
motivation of this work was to describe the shape and flatness of a measure µ. Subsequent work
by Belili and Heinich [3] considered W2approximation of a measure µover more general types of
measures, including uniform measures on convex sets and uniform measures on sets of the form
1\2where Ω1,2are convex. Furthermore, Belili and Heinich suceeded in proving existence of a
closest approximation under more general hypotheses on µ: instead of requiring µ∈ P2,ac(Rd), they
require that the affine hull of the support of µis d-dimensional. However, an essential hypothesis in
the work of Belili and Heinich is that the class of approximating measures satisfy certain symmetry
properties [3, condition (3)]. For example, if ν0were the limiting measure shown in Figure 1, there
would have to exist a triangle Ω so that the first marginal of 1/||is ν0and its second marginal
is symmetric about the origin. Unfortunately, no such triangle exists. For this reason, while our
study of existence of solutions to (WAA) is closely related to the aforementioned works, the fact
that (WAA) constrains Ω to be a k-gon requires the development of new techniques.
Motivated by the fact that the key challenge in proving existence of minimizers to (WAA) arises
when the support of the minimizing sequence collapses to a lower dimensional set, we now introduce
a regularized version of the problem that prevents this degeneracy. For m>1, the m-R´enyi entropy
is given by
Um(ν) = (1
m1Rν(x)mdx if ν Ldand (x) = ν(x)dx,
+otherwise.
In particular, if ν= 1/||for Sk, we have
Um(1/||) = (1
(m1)||m1if m > 1,
log(||) if m= 1.
(1.5)
WASSERSTEIN ARCHETYPAL ANALYSIS 5
For fixed ε > 0, k>d+ 1, m>1, and µ∈ P2(Rd), we consider the regularized problem:
min
Sk
W2(µ, 1/||) + εUm(1/||).(WAAε)
Due to the fact that the regularization prevents the minimizer from collapsing to a lower dimensional
set, we are able to obtain existence of solutions in all dimensions, for general µ∈ P2(Rd).
Theorem 1.2 (Existence of minimizer for (WAAε), in all dimensions).Fix ε > 0,k>d+ 1,
m>1, and µ∈ P2(Rd). Then there exists a solution of (WAAε).
The proof of Theorem 1.2 follows via Prokhorov’s theorem and the lower semicontinuity of the
Wasserstein metric and R´enyi entropy in the narrow topology. Uniqueness again fails in dimensions
higher than two, due to the rotational invariance of the R´enyi entropy.
Remark 1.3 (Limit as ε0).If d= 1 or 2 and µ∈ P2,ac(Rd), similar arguments as in the proof of
Theorem 1.1 can be used to show that, for any εn0 as n+, the solutions to (WAAε) with
ε=εn, form a minimizing sequence to (WAA) and converge (up to a subsequence) to a solution
of (WAA). This shows consistency of the regularized problem with the original problem, as the
regularization is removed.
Our final results consider practical application of Wasserstein archetypal analysis to data. In
particular, we seek to understand how solutions of the regularized problem (WAAε) behave when
a measure µ∈ P2(Rd) is a approximated by a sequence of empirical measure µn:= 1
nPn
i=1 δXi,
Xiiid
µ. We show that, almost surely and up to a subsequence, optimizers Ωnfor the empirical
measure µnconverge as n to an optimizer Ω for µ. We also provide a convergence rate in terms
of the value of the objective function, based on a classical estimate of Horowitz and Karandikar [16].
While more sophisticated rates can be obtained using sharper moment estimates or estimates on
the dimensionality of the support of µ[13, 27], we use the classical estimate for simplicity and to
preserve the focus on the archetypal analysis problem.
Theorem 1.4 (Consistency and convergence for (WAAε)).Fix ε > 0,k>d+ 1,m>1, and
µ∈ P2(Rd). Suppose
µn=1
n
n
X
i=1
δXi, Xi
iid
µ.
For each nN, let {n}nNbe a minimizer to (WAAε)for µn. Then the following hold:
(i) Almost surely, there exists Skso that, up to a subsequence,
1n/|n| → 1/||narrowly,(1.6)
and solves (WAAε)for µ;
(ii) If β:= RRd|x|d+5(x)<+, then
EW2µ, 1
|n|1n+εUm(1n/|n|)min
Sk
W2µ, 1
||1+εUm(1/||)6C(β, d)
nd+4 ,
where C(β, d)is some constant depending on βand d.
In addition to ensuring consistency of WAA under empirical approximation, the previous the-
orem is also relevant from the perspective of numerical methods. Our final main result is the
development of a numerical method for solving (WAA) and (WAAε), based on approximating a
given measure µ∈ P2(Rd) by a sequence of finitely supported measures µn=Pn
i=1 δximiand
approximating the solution of (WAA) or (WAAε) for µn. Such an approximation greatly simplifies
the problem numerically, since the Wasserstein distance W2(µn,1/||) can now be computed using
the semidiscrete method introduced by M´erigot [18]. Based on this perspective, we introduce an
alternating gradient-based method to approximate the optimizer; see Section 5 and Algorithm 1.
摘要:

WASSERSTEINARCHETYPALANALYSISKATYCRAIG,BRAXTONOSTING,DONGWANG,ANDYIMINGXUAbstract.Archetypalanalysisisanunsupervisedmachinelearningmethodthatsummarizesdatausingaconvexpolytope.Initsoriginalformulation,for xedk,themethod ndsaconvexpolytopewithkvertices,calledarchetypepoints,suchthatthepolytopeisconta...

收起<<
WASSERSTEIN ARCHETYPAL ANALYSIS KATY CRAIG BRAXTON OSTING DONG WANG AND YIMING XU Abstract. Archetypal analysis is an unsupervised machine learning method that summarizes data.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:2.68MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注