A Fourier Approach to Mixture Learning Mingda Qiao1 Guru Guruganesh2 Ankit Singh Rawat2 Avinava Dubey2 and Manzil Zaheer3

2025-04-27 0 0 603.41KB 30 页 10玖币
侵权投诉
A Fourier Approach to Mixture Learning
Mingda Qiao1, Guru Guruganesh2, Ankit Singh Rawat2, Avinava Dubey2, and
Manzil Zaheer3
1Stanford University, mqiao@stanford.edu
2Google Research, {gurug,ankitsrawat,avinavadubey}@google.com
3Google DeepMind, manzilzaheer@google.com
Abstract
We revisit the problem of learning mixtures of spherical Gaussians. Given samples from
mixture 1
kPk
j=1 N(µj, Id), the goal is to estimate the means µ1, µ2, . . . , µkRdup to a small
error. The hardness of this learning problem can be measured by the separation ∆ defined as the
minimum distance between all pairs of means. Regev and Vijayaraghavan [RV17] showed that
with ∆ = Ω(log k) separation, the means can be learned using poly(k, d) samples, whereas
super-polynomially many samples are required if ∆ = o(log k) and d= Ω(log k). This leaves
open the low-dimensional regime where d=o(log k).
In this work, we give an algorithm that efficiently learns the means in d=O(log k/ log log k)
dimensions under separation d/log k(modulo doubly logarithmic factors). This separation is
strictly smaller than log k, and is also shown to be necessary. Along with the results of [RV17],
our work almost pins down the critical separation threshold at which efficient parameter learning
becomes possible for spherical Gaussian mixtures. More generally, our algorithm runs in time
poly(k)·f(d, , ), and is thus fixed-parameter tractable in parameters d, ∆ and .
Our approach is based on estimating the Fourier transform of the mixture at carefully chosen
frequencies, and both the algorithm and its analysis are simple and elementary. Our positive
results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild
condition on the Fourier spectrum of the distribution.
1 Introduction
Gaussian mixture models (GMMs) are one of the most well studied models for a population with
different components. A GMM defines a distribution over the d-dimensional Euclidean space as
the weighted sum of normal distributions Pk
i=1 wi· N(µi,Σi), which are specified by following
quantities: the number of components kN, the component means µiRd, the component
covariances ΣiRd×d, which are positive definite matrices, and the weights wi0 that sum up to
1. In this work, we consider the uniform spherical case, where the weights wiare uniform wi=1
k
and the covariance matrix Σi=Idis the identity matrix. The central problem in this setup is
to efficiently estimate the means µ1, . . . , µk. To avoid degenerate cases such as when some of the
means are the same, it is common to parameterize the problem by the separation of the means ∆,
which guarantees that kµiµjk2∆ for all i6=j.
Part of this work was done while working as an intern at Google Research.
1
arXiv:2210.02415v2 [cs.LG] 6 Oct 2022
More precisely, the problem is to estimate the means µ1, . . . , µkRdup to an error with
runtime that is poly(k, 1
, d) with as small a separation ∆ as possible. There has been a long line
of work on this problem which we survey in Section 1.3.
Recently, Regev and Vijayaraghavan [RV17] showed that a separation ∆ = Ω(log k) is strictly
necessary when the dimension d= Ω(log k). Two natural questions arise immediately. First, if
∆ = log kis sufficient when d= Ω(log k). Although the original work of [RV17] showed that it
was information theoretically possible, an actual efficient algorithm was only recently developed
by [LL22] (who show nearly tight results). The second main question is determining the optimal
separation in low dimensions when d=o(log k). Previously, even in O(1) dimensions, the exact
separation necessary was unknown. In this paper, we settle the second question and give nearly
optimal bounds on the separation necessary in low dimensions (see Figure 1 for more details).
1.1 Overview of Results
We begin with a few definitions. A point set {x1, x2, . . . , xk}is called ∆-separated if kxjxj0k2
for any j6=j0. We say that a Gaussian mixture is ∆-separated (or has separation ∆) if the means
of its components are ∆-separated. Two point sets {u1, . . . , uk}and {v1, . . . , vk}are -close if for
some permutation σover [k], kujvσ(j)k2holds for every j.
Our main result is an algorithm that efficiently learns the parameters of a mixture of kspherical
Gaussians under separation ∆ d
log k·poly(log log k) in d=O(log k/ log log k) dimensions. In
the low-dimensional regime, this separation is strictly smaller than min{log k, d}, the smallest
separation under which previous algorithms could provably learn the parameters in poly(k) time.
Theorem 1.1 (Upper bound, informal).Let Pbe a uniform mixture of kspherical Gaussians
in d=Olog k
log log kdimensions with separation = dlog((log k)/d)
log k. There is a poly(k)-time
algorithm that, given samples from P, outputs kvectors that are w.h.p. -close to the true means
for =O(∆/d).
See Theorem 2.1 and Remark 2.2 for a more formal statement of our algorithmic result, which
holds for a wider range of separation ∆ and accuracy parameter . Our learning algorithm is
provably correct for arbitrarily small ∆,  > 0 (possibly with a longer runtime), whereas for most
of the previous algorithms, there is a breakdown point ∆such that the algorithm is not known
to work when ∆ <. Two exceptions are the algorithms of [MV10,BS10], both of which allow
an arbitrarily small separation but run in eΩ(k)time. We also remark that the runtime of our
algorithm scales as ˜
O(k3)·f(d, , ), and is thus fixed-parameter tractable in parameters d, ∆ and
.1
We complement Theorem 1.1 with an almost-matching lower bound, showing that the d/log k
separation is necessary for efficient parameter learning in low dimensions.
Theorem 1.2 (Lower bound, informal).For d=Olog k
log log kand ∆ = od
log k, there are two
mixtures of kspherical Gaussians in Rdsuch that: (1) both have separation ; (2) their means are
not (∆/2)-close; and (3) the total variation (TV) distance between them is kω(1).
1While we focus on the uniform-weight case for simplicity, Theorem 1.1 can be easily extended to the setting
where each weight is in [1/(Ck), C/k] for some constant C > 1.
2
separation Δ
dimension "
log &
1
1/ log &log &
III
I
1
Learnable with poly &samples
Not poly &-sample learnable
II
Theorem 2
IV
Theorem 1
Figure 1: Region I is a direct corollary of [MV10, Proposition 15]. Regions II, III, and IV are
shown by [RV17, Theorems 1.2, 1.3, and 1.4] respectively. The upper boundary of Region IV is
the curve ∆ = d. Theorems 1.1 and 1.2 settle the learnability in the remaining area, by proving
that the line ∆ = d/log kis the boundary between polynomial and super-polynomial sample
complexities (up to a doubly-logarithmic factor).
See Theorem D.1 for a more formal version of the lower bound.
Theorem 1.1 and Theorem 1.2 together nearly settle the polynomial learnability of spherical
Gaussian mixtures in the low-dimensional regime. Up to a doubly logarithmic factor, the “critical
separation” where efficient learning becomes possible is d/log k. To the best of our knowledge,
this was previously unknown even for d=O(1).2
See Figure 1below for a plot of our results in the context of prior work. The green regions cover
the parameters (∆, d) such that mixtures of kspherical Gaussians in ddimensions with separation
∆ are learnable (up to O(∆) error) using poly(k) samples.3The red regions contain the parameters
under which polynomial-sample learning is provably impossible.
The algorithm that underlies Theorem 1.1 can be easily extended beyond the spherical Gaussian
case. The following more general result states that for any distribution Dwhose the Fourier
transform does not decay too fast, we can efficiently learn the parameters of a mixture of ktranslated
copies of D. In the following, let Dµdenote the distribution of X+µwhen Xis drawn from D.
Theorem 1.3 (Learning more general mixtures, informal).Let P=1
kPk
j=1 Dµjfor -separated
µ1, . . . , µkRd. There is an algorithm that, given  > 0and samples from P, runs in time
poly k, /, 1/δ, max
kξk2ME
X∼D he>Xi
1!
2An exception is the d= 1 case: A result of [Moi15] implies that ∆ = Ω(1/log k) suffices (see Section 1.3), and
a matching lower bound was given by [MV10].
3In terms of the computational complexity, all the green regions (except a small portion of Region III) admit
efficient algorithms. The algorithm of [LL22] is efficient when ∆ = Ω((log k)1/2+c) for any c > 0 and thus almost
covers Region III. For Region IV, [RV17] gave an efficient algorithm only for d=O(1), whereas Theorem 1.1 covers
the entire Region IV.
3
for some δ=δ(D, )and M=M(k, d, , ), and outputs ˆµ1,...,ˆµkthat are w.h.p. -close to the
true parameters.
See Theorem E.1 and Corollary E.2 for a more formal statement of the runtime. Theorem 1.3
applies to many well-known distribution families that are specified by either a “location parameter”
or a “scale parameter”. Table 1 gives a few examples of applying Theorem 1.3 to mixtures of single-
parameter univariate distributions; see Appendix E.2 and Appendix E.3 for more details.
Distribution Parameter Density Function Runtime
Cauchy µ1
π(1+(xµ)2)O(k3)·eO(log k/∆)
Logistic µe(xµ)
(1+e(xµ))2O(k3)·eO(log k/∆)
Laplace µ1
2e−|xµ|˜
O(k3/5)
Exponential ln λ λeλx ·[x0] O(k3)·eO(log k/∆)
Table 1: Implication of Theorem 1.3 for learning various families of mixtures of univariate distri-
butions beyond Gaussians, assuming that the parameters of different components are ∆-separated.
The algorithm outputs kparameters that are O(∆)-close to the true parameters.
Limitations of our work. The main limitation is that the positive results only apply to the
regime that the dimension dis logarithmic in the number of clusters, and that all clusters are
translated copies of the same distribution. A concrete future direction would be to extend our
results to learning mixtures of general Gaussians, even in one dimension.
1.2 Proof Overview
For simplicity, we focus on the one dimensional case that P=1
kPk
j=1 N(µj,1) for ∆-separated
means µ1, . . . , µkR. We index the components such that |µ1|≤|µ2|≤···≤|µk|, and focus
on the following testing problem: Given  > 0 and samples from P, determine whether µ1= 0
or µ1, assuming that one of them is true. Note that this testing problem is not harder than
estimating µ1, . . . , µkup to error /3—in the former case that µ1= 0, one of the mean estimates
would fall into [/3, /3], whereas all of them must be outside (2/3,2/3) in the latter case.
Conversely, as we will prove in Section 2, an algorithm for the testing version can be used for
recovering the means as well.
Examine the Fourier spectrum. We start by examining the Fourier transform of P, (FP)(ξ):=
EXPeX , more commonly known in the literature as the characteristic function. Since the
Fourier transform of a Gaussian is still a Gaussian, and a translation in the time domain shifts the
phase in the frequency domain, we have
E
XPeiξX =1
k
k
X
j=1
E
X∼N (µj,1) eiξX =eξ2/2
k
k
X
j=1
ejξ.(1)
4
We will focus on the quantity Aµ(ξ):=Pk
j=1 ejξ, which is the “total phase” over the kcomponents
of P.Equation (1) essentially states that Aµ(ξ) can be estimated by averaging eX over samples
XP.
The key observation is that each term ejξof Aµ(ξ) behaves quite differently depending on the
magnitude of µj: If µj= 0, ejξ= 1 is a constant, whereas ejξis a high-frequency wave when |µj|
is large. This suggests that the two cases (µ1= 0 and |µ1| ≥ ) can be distinguished by estimating
Aµ(ξ) at different frequencies. The cost of estimating Aµ(ξ), however, depends heavily on the
frequency ξEquation (1) together with a simple concentration bound shows that O(k2)·eO(ξ2)
samples are sufficient for estimating Aµ(ξ) up to a constant error.
Therefore, the crux of this approach is to find the minimum M > 0 such that the two cases can
be distinguished by estimating Aµ(ξ) over ξ[M, M ]. (This is known as the super-resolution
problem, which we discuss in Section 1.3.) The sample complexity of the testing problem can then
be roughly bounded by O(k2)·eO(M2). In the following, we explore different ways of picking ξfrom
the range [M, M].
Choosing ξrandomly. Our overall strategy is to draw ξrandomly from some distribution Dξ
over interval [M, M] and evaluate Eξ∼Dξ[Aµ(ξ)]. We will argue that this expectation is very close
to its first term Eξ∼Dξe1ξ, which takes different values depending on whether µ1= 0 or |µ1| ≥ .
Then, Dξneeds to be chosen such that: (1) There is a gap in the value of Eξ∼Dξe1ξbetween the
two cases; (2) Pk
j=2 Eξejξis small enough for the gap in the first term to be easily identified.
As a warmup, let Dξbe the uniform distribution over [0, M]. A simple calculation shows that
the gap in the first term Eξ∼Dξe1ξbetween the two cases (µ1= 0 or |µ1| ≥ ) is lower bounded
by Ω(min{M, 1}), whereas the contribution from the j-th component satisfies Eξ∼Dξejξ=
O1
M|µj|. Furthermore, the ∆-separation between the means implies |µj|= Ω(∆j). Thus,
k
X
j=2
E
ξejξ.
k
X
j=2
1
M|µj|.1
M
k
X
j=2
1
j.log k
M.
The above is much smaller than min{M , 1}if we set M&max log k
,qlog k
. Unfortunately,
even when ∆ and are constants, we have O(k2)·eO(M2)=kO(log k)and the resulting sample
complexity is already super-polynomial in k.
A better choice of Dξ.It turns out that choosing Dξto be a truncated Gaussian leads to a
much lower sample complexity. For some σM, we draw ξ∼ N(0, σ2) and then truncate it to
[M, M]. Without the truncation, the expectation of ejξhas a nice closed form:
E
ξ∼N(02)hejξi=eσ2µ2
j/2,
which is exactly the Fourier weight of N(0, σ2) at frequency µj. Note that this decreases very fast
as |µj|grows, compared to the previous rate of 1
M|µj|when Dξis uniform.
It again follows from a simple calculation that: (1) Depending on whether µ1= 0 or |µ1| ≥ ,
the gap between Eξe1ξis Ω(min{σ22,1}); (2) The total contribution from j= 2,3, . . . , k is
5
摘要:

AFourierApproachtoMixtureLearningMingdaQiao*1,GuruGuruganesh2,AnkitSinghRawat2,AvinavaDubey2,andManzilZaheer31StanfordUniversity,mqiao@stanford.edu2GoogleResearch,fgurug,ankitsrawat,avinavadubeyg@google.com3GoogleDeepMind,manzilzaheer@google.comAbstractWerevisittheproblemoflearningmixturesofspherica...

展开>> 收起<<
A Fourier Approach to Mixture Learning Mingda Qiao1 Guru Guruganesh2 Ankit Singh Rawat2 Avinava Dubey2 and Manzil Zaheer3.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:603.41KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注