A Fourier Approach to Mixture Learning Mingda Qiao1 Guru Guruganesh2 Ankit Singh Rawat2 Avinava Dubey2 and Manzil Zaheer3

2025-04-27 0 0 603.41KB 30 页 10玖币

侵权投诉

A Fourier Approach to Mixture Learning

Mingda Qiao∗1, Guru Guruganesh2, Ankit Singh Rawat2, Avinava Dubey2, and

Manzil Zaheer3

1Stanford University, mqiao@stanford.edu

2Google Research, {gurug,ankitsrawat,avinavadubey}@google.com

3Google DeepMind, manzilzaheer@google.com

Abstract

We revisit the problem of learning mixtures of spherical Gaussians. Given samples from

mixture 1

kPk

j=1 N(µj, Id), the goal is to estimate the means µ1, µ2, . . . , µk∈Rdup to a small

error. The hardness of this learning problem can be measured by the separation ∆ deﬁned as the

minimum distance between all pairs of means. Regev and Vijayaraghavan [RV17] showed that

with ∆ = Ω(√log k) separation, the means can be learned using poly(k, d) samples, whereas

super-polynomially many samples are required if ∆ = o(√log k) and d= Ω(log k). This leaves

open the low-dimensional regime where d=o(log k).

In this work, we give an algorithm that eﬃciently learns the means in d=O(log k/ log log k)

dimensions under separation d/√log k(modulo doubly logarithmic factors). This separation is

strictly smaller than √log k, and is also shown to be necessary. Along with the results of [RV17],

our work almost pins down the critical separation threshold at which eﬃcient parameter learning

becomes possible for spherical Gaussian mixtures. More generally, our algorithm runs in time

poly(k)·f(d, ∆, ), and is thus ﬁxed-parameter tractable in parameters d, ∆ and .

Our approach is based on estimating the Fourier transform of the mixture at carefully chosen

frequencies, and both the algorithm and its analysis are simple and elementary. Our positive

results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild

condition on the Fourier spectrum of the distribution.

1 Introduction

Gaussian mixture models (GMMs) are one of the most well studied models for a population with

diﬀerent components. A GMM deﬁnes a distribution over the d-dimensional Euclidean space as

the weighted sum of normal distributions Pk

i=1 wi· N(µi,Σi), which are speciﬁed by following

quantities: the number of components k∈N, the component means µi∈Rd, the component

covariances Σi∈Rd×d, which are positive deﬁnite matrices, and the weights wi≥0 that sum up to

1. In this work, we consider the uniform spherical case, where the weights wiare uniform wi=1

and the covariance matrix Σi=Idis the identity matrix. The central problem in this setup is

to eﬃciently estimate the means µ1, . . . , µk. To avoid degenerate cases such as when some of the

means are the same, it is common to parameterize the problem by the separation of the means ∆,

which guarantees that kµi−µjk2≥∆ for all i6=j.

∗Part of this work was done while working as an intern at Google Research.

arXiv:2210.02415v2 [cs.LG] 6 Oct 2022

More precisely, the problem is to estimate the means µ1, . . . , µk∈Rdup to an error with

runtime that is poly(k, 1

, d) with as small a separation ∆ as possible. There has been a long line

of work on this problem which we survey in Section 1.3.

Recently, Regev and Vijayaraghavan [RV17] showed that a separation ∆ = Ω(√log k) is strictly

necessary when the dimension d= Ω(log k). Two natural questions arise immediately. First, if

∆ = √log kis suﬃcient when d= Ω(log k). Although the original work of [RV17] showed that it

was information theoretically possible, an actual eﬃcient algorithm was only recently developed

by [LL22] (who show nearly tight results). The second main question is determining the optimal

separation in low dimensions when d=o(log k). Previously, even in O(1) dimensions, the exact

separation necessary was unknown. In this paper, we settle the second question and give nearly

optimal bounds on the separation necessary in low dimensions (see Figure 1 for more details).

1.1 Overview of Results

We begin with a few deﬁnitions. A point set {x1, x2, . . . , xk}is called ∆-separated if kxj−xj0k2≥∆

for any j6=j0. We say that a Gaussian mixture is ∆-separated (or has separation ∆) if the means

of its components are ∆-separated. Two point sets {u1, . . . , uk}and {v1, . . . , vk}are -close if for

some permutation σover [k], kuj−vσ(j)k2≤holds for every j.

Our main result is an algorithm that eﬃciently learns the parameters of a mixture of kspherical

Gaussians under separation ∆ ≈d

√log k·poly(log log k) in d=O(log k/ log log k) dimensions. In

the low-dimensional regime, this separation is strictly smaller than min{√log k, √d}, the smallest

separation under which previous algorithms could provably learn the parameters in poly(k) time.

Theorem 1.1 (Upper bound, informal).Let Pbe a uniform mixture of kspherical Gaussians

in d=Olog k

log log kdimensions with separation ∆ = Ω d√log((log k)/d)

√log k. There is a poly(k)-time

algorithm that, given samples from P, outputs kvectors that are w.h.p. -close to the true means

for =O(∆/√d).

See Theorem 2.1 and Remark 2.2 for a more formal statement of our algorithmic result, which

holds for a wider range of separation ∆ and accuracy parameter . Our learning algorithm is

provably correct for arbitrarily small ∆,  > 0 (possibly with a longer runtime), whereas for most

of the previous algorithms, there is a breakdown point ∆∗such that the algorithm is not known

to work when ∆ <∆∗. Two exceptions are the algorithms of [MV10,BS10], both of which allow

an arbitrarily small separation but run in eΩ(k)time. We also remark that the runtime of our

algorithm scales as ˜

O(k3)·f(d, ∆, ), and is thus ﬁxed-parameter tractable in parameters d, ∆ and

.1

We complement Theorem 1.1 with an almost-matching lower bound, showing that the d/√log k

separation is necessary for eﬃcient parameter learning in low dimensions.

Theorem 1.2 (Lower bound, informal).For d=Olog k

log log kand ∆ = od

√log k, there are two

mixtures of kspherical Gaussians in Rdsuch that: (1) both have separation ∆; (2) their means are

not (∆/2)-close; and (3) the total variation (TV) distance between them is k−ω(1).

1While we focus on the uniform-weight case for simplicity, Theorem 1.1 can be easily extended to the setting

where each weight is in [1/(Ck), C/k] for some constant C > 1.

separation Δ

dimension "

log &

1/ log &log &

III

Learnable with poly &samples

Not poly &-sample learnable

Theorem 2

Theorem 1

Figure 1: Region I is a direct corollary of [MV10, Proposition 15]. Regions II, III, and IV are

shown by [RV17, Theorems 1.2, 1.3, and 1.4] respectively. The upper boundary of Region IV is

the curve ∆ = √d. Theorems 1.1 and 1.2 settle the learnability in the remaining area, by proving

that the line ∆ = d/√log kis the boundary between polynomial and super-polynomial sample

complexities (up to a doubly-logarithmic factor).

See Theorem D.1 for a more formal version of the lower bound.

Theorem 1.1 and Theorem 1.2 together nearly settle the polynomial learnability of spherical

Gaussian mixtures in the low-dimensional regime. Up to a doubly logarithmic factor, the “critical

separation” where eﬃcient learning becomes possible is d/√log k. To the best of our knowledge,

this was previously unknown even for d=O(1).2

See Figure 1below for a plot of our results in the context of prior work. The green regions cover

the parameters (∆, d) such that mixtures of kspherical Gaussians in ddimensions with separation

∆ are learnable (up to O(∆) error) using poly(k) samples.3The red regions contain the parameters

under which polynomial-sample learning is provably impossible.

The algorithm that underlies Theorem 1.1 can be easily extended beyond the spherical Gaussian

case. The following more general result states that for any distribution Dwhose the Fourier

transform does not decay too fast, we can eﬃciently learn the parameters of a mixture of ktranslated

copies of D. In the following, let Dµdenote the distribution of X+µwhen Xis drawn from D.

Theorem 1.3 (Learning more general mixtures, informal).Let P=1

kPk

j=1 Dµjfor ∆-separated

µ1, . . . , µk∈Rd. There is an algorithm that, given  > 0and samples from P, runs in time

poly k, ∆/, 1/δ, max

kξk2≤ME

X∼D heiξ>Xi

−1!

2An exception is the d= 1 case: A result of [Moi15] implies that ∆ = Ω(1/√log k) suﬃces (see Section 1.3), and

a matching lower bound was given by [MV10].

3In terms of the computational complexity, all the green regions (except a small portion of Region III) admit

eﬃcient algorithms. The algorithm of [LL22] is eﬃcient when ∆ = Ω((log k)1/2+c) for any c > 0 and thus almost

covers Region III. For Region IV, [RV17] gave an eﬃcient algorithm only for d=O(1), whereas Theorem 1.1 covers

the entire Region IV.

for some δ=δ(D, )and M=M(k, d, ∆, ), and outputs ˆµ1,...,ˆµkthat are w.h.p. -close to the

true parameters.

See Theorem E.1 and Corollary E.2 for a more formal statement of the runtime. Theorem 1.3

applies to many well-known distribution families that are speciﬁed by either a “location parameter”

or a “scale parameter”. Table 1 gives a few examples of applying Theorem 1.3 to mixtures of single-

parameter univariate distributions; see Appendix E.2 and Appendix E.3 for more details.

Distribution Parameter Density Function Runtime

Cauchy µ1

π(1+(x−µ)2)O(k3)·eO(√log k/∆)

Logistic µe−(x−µ)

(1+e−(x−µ))2O(k3)·eO(√log k/∆)

Laplace µ1

2e−|x−µ|˜

O(k3/∆5)

Exponential ln λ λe−λx ·[x≥0] O(k3)·eO(√log k/∆)

Table 1: Implication of Theorem 1.3 for learning various families of mixtures of univariate distri-

butions beyond Gaussians, assuming that the parameters of diﬀerent components are ∆-separated.

The algorithm outputs kparameters that are O(∆)-close to the true parameters.

Limitations of our work. The main limitation is that the positive results only apply to the

regime that the dimension dis logarithmic in the number of clusters, and that all clusters are

translated copies of the same distribution. A concrete future direction would be to extend our

results to learning mixtures of general Gaussians, even in one dimension.

1.2 Proof Overview

For simplicity, we focus on the one dimensional case that P=1

kPk

j=1 N(µj,1) for ∆-separated

means µ1, . . . , µk∈R. We index the components such that |µ1|≤|µ2|≤···≤|µk|, and focus

on the following testing problem: Given  > 0 and samples from P, determine whether µ1= 0

or µ1≥, assuming that one of them is true. Note that this testing problem is not harder than

estimating µ1, . . . , µkup to error /3—in the former case that µ1= 0, one of the mean estimates

would fall into [−/3, /3], whereas all of them must be outside (−2/3,2/3) in the latter case.

Conversely, as we will prove in Section 2, an algorithm for the testing version can be used for

recovering the means as well.

Examine the Fourier spectrum. We start by examining the Fourier transform of P, (FP)(ξ):=

EX∼PeiξX , more commonly known in the literature as the characteristic function. Since the

Fourier transform of a Gaussian is still a Gaussian, and a translation in the time domain shifts the

phase in the frequency domain, we have

X∼PeiξX =1

j=1

X∼N (µj,1) eiξX =e−ξ2/2

j=1

eiµjξ.(1)

We will focus on the quantity Aµ(ξ):=Pk

j=1 eiµjξ, which is the “total phase” over the kcomponents

of P.Equation (1) essentially states that Aµ(ξ) can be estimated by averaging eiξX over samples

X∼P.

The key observation is that each term eiµjξof Aµ(ξ) behaves quite diﬀerently depending on the

magnitude of µj: If µj= 0, eiµjξ= 1 is a constant, whereas eiµjξis a high-frequency wave when |µj|

is large. This suggests that the two cases (µ1= 0 and |µ1| ≥ ) can be distinguished by estimating

Aµ(ξ) at diﬀerent frequencies. The cost of estimating Aµ(ξ), however, depends heavily on the

frequency ξ–Equation (1) together with a simple concentration bound shows that O(k2)·eO(ξ2)

samples are suﬃcient for estimating Aµ(ξ) up to a constant error.

Therefore, the crux of this approach is to ﬁnd the minimum M > 0 such that the two cases can

be distinguished by estimating Aµ(ξ) over ξ∈[−M, M ]. (This is known as the super-resolution

problem, which we discuss in Section 1.3.) The sample complexity of the testing problem can then

be roughly bounded by O(k2)·eO(M2). In the following, we explore diﬀerent ways of picking ξfrom

the range [−M, M].

Choosing ξrandomly. Our overall strategy is to draw ξrandomly from some distribution Dξ

over interval [−M, M] and evaluate Eξ∼Dξ[Aµ(ξ)]. We will argue that this expectation is very close

to its ﬁrst term Eξ∼Dξeiµ1ξ, which takes diﬀerent values depending on whether µ1= 0 or |µ1| ≥ .

Then, Dξneeds to be chosen such that: (1) There is a gap in the value of Eξ∼Dξeiµ1ξbetween the

two cases; (2) Pk

j=2 Eξeiµjξis small enough for the gap in the ﬁrst term to be easily identiﬁed.

As a warmup, let Dξbe the uniform distribution over [0, M]. A simple calculation shows that

the gap in the ﬁrst term Eξ∼Dξeiµ1ξbetween the two cases (µ1= 0 or |µ1| ≥ ) is lower bounded

by Ω(min{M, 1}), whereas the contribution from the j-th component satisﬁes Eξ∼Dξeiµjξ=

O1

M|µj|. Furthermore, the ∆-separation between the means implies |µj|= Ω(∆j). Thus,



j=2

ξeiµjξ.

j=2

M|µj|.1

M∆

j=2

j.log k

M∆.

The above is much smaller than min{M , 1}if we set M&max log k

∆,qlog k

∆. Unfortunately,

even when ∆ and are constants, we have O(k2)·eO(M2)=kO(log k)and the resulting sample

complexity is already super-polynomial in k.

A better choice of Dξ.It turns out that choosing Dξto be a truncated Gaussian leads to a

much lower sample complexity. For some σM, we draw ξ∼ N(0, σ2) and then truncate it to

[−M, M]. Without the truncation, the expectation of eiµjξhas a nice closed form:

ξ∼N(0,σ2)heiµjξi=e−σ2µ2

j/2,

which is exactly the Fourier weight of N(0, σ2) at frequency µj. Note that this decreases very fast

as |µj|grows, compared to the previous rate of 1

M|µj|when Dξis uniform.

It again follows from a simple calculation that: (1) Depending on whether µ1= 0 or |µ1| ≥ ,

the gap between Eξeiµ1ξis Ω(min{σ22,1}); (2) The total contribution from j= 2,3, . . . , k is

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AFourierApproachtoMixtureLearningMingdaQiao*1,GuruGuruganesh2,AnkitSinghRawat2,AvinavaDubey2,andManzilZaheer31StanfordUniversity,mqiao@stanford.edu2GoogleResearch,fgurug,ankitsrawat,avinavadubeyg@google.com3GoogleDeepMind,manzilzaheer@google.comAbstractWerevisittheproblemoflearningmixturesofspherica...

展开>> 收起<<

A Fourier Approach to Mixture Learning Mingda Qiao1 Guru Guruganesh2 Ankit Singh Rawat2 Avinava Dubey2 and Manzil Zaheer3.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Fourier Approach to Mixture Learning Mingda Qiao1 Guru Guruganesh2 Ankit Singh Rawat2 Avinava Dubey2 and Manzil Zaheer3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: