Moment Estimation of Nonparametric Mixture Models Through Implicit Tensor Decomposition Yifan ZhangJoe Kileel

2025-05-06 0 0 5.93MB 38 页 10玖币

侵权投诉

Moment Estimation of Nonparametric Mixture Models Through

Implicit Tensor Decomposition∗

Yifan Zhang†Joe Kileel‡

Abstract

We present an alternating least squares type numerical optimization scheme to estimate

conditionally-independent mixture models in Rn, without parameterizing the distributions.

Following the method of moments, we tackle an incomplete tensor decomposition problem

to learn the mixing weights and componentwise means. Then we compute the cumulative

distribution functions, higher moments and other statistics of the component distributions

through linear solves. Crucially for computations in high dimensions, the steep costs asso-

ciated with high-order tensors are evaded, via the development of eﬃcient tensor-free oper-

ations. Numerical experiments demonstrate the competitive performance of the algorithm,

and its applicability to many models and applications. Furthermore we provide theoretical

analyses, establishing identiﬁability from low-order moments of the mixture and guaranteeing

local linear convergence of the ALS algorithm.

1 Introduction

Mixture models have been intensively studied for their strong expressive power [37,12,40].

They are a common choice in data science problems when data is believed to come from distinct

subgroups, or when simple distributions such as one Gaussian or one Poisson do not oﬀer a good

ﬁt on their own.

In this paper, we develop new numerical tools for the nonparametric estimation of (ﬁnite)

conditionally-independent mixture models in Rn, also known as mixtures of product distribu-

tions:

D=Xr

j=1 wjDjwhere Dj=On

i=1 Dij.(1)

Here ris the number of components, and each Djis a product of ndistributions Dij (i= 1, . . . , n)

on the real line. No parametric assumptions are placed on Dij except that their moments should

exist. Our new methods are based on tensor decomposition. Our approach is particularly well-

suited for computations in high dimensions.

1.1 Why This Model?

The nonparametric conditionally-independent mixture model (1) was brought to broader at-

tention by Hall and Zhou [20]. Their motivation came from clinical applications, modeling

heterogeneity of repeated test data from patient groups with diﬀerent diseases. Later the model

was used to describe heterogeneity in econometrics by Kasahara and Shimotsu [32], who con-

sidered dynamic discrete choice models. By modeling a choice policy as a mixture of Markovian

policies, they reduced their problem to an instance of (1). Heterogeneity analysis using (1) also

appears in auction analysis [25,26] and incomplete information games [51,2]; see the survey

∗Both authors were supported in part by start-up grants to JK provided by the College of Natural Sciences

and Oden Institute at UT Austin. We thank Jonathan Niles-Weed for helpful conversations.

†Oden Institute, University of Texas at Austin, yf.zhang@utexas.edu

‡Department of Mathematics and Oden Institute, University of Texas at Austin, jkileel@math.utexas.edu

arXiv:2210.14386v3 [math.NA] 7 Aug 2023

[15]. In statistics, Bonhomme and coworkers [13] considered repeatedly measuring a real-valued

mixture and reduced that situation to instance of (1). Extending their framework, conditionally-

independent mixtures can model repeated measurements of rdeterministic objects described by

nfeatures, when there is independent measurement noise for each feature. This setting ap-

proximates the class averaging problem of cryo-electron microscopy [11]. In computer science,

Jain and Oh studied (1) when Dij are ﬁnite discrete distributions, motivated by connections

to crowdsourcing, genetics, and recommendation systems. Chen and Moitra [16] studied (1) to

learn stochastic decision trees, where each Dij are Bernoulli variables.

Conditional independence can also be viewed as a reduced model or regularization of the full

model. Here (1) is a universal approximator of densities on Rn. But also, it reduces the number

of parameters and prevents possible over-ﬁtting. For instance, experiments in [30] show that

for many real datasets, diagonal Gaussian mixtures give a comparable or better classiﬁcation

accuracy compared to full Gaussian mixtures.

1.2 Existing Methods

Many approaches for estimating model (1) have been studied. Hall and Zhou [20] took a nonpara-

metric maximum-likelihood-based approach to learn the component distributions and weights

when r= 2. Subsequently in [21], Hall et al. provided another consistent density estimator for

(1) when r= 2, with errors bounded by O(p−1/2). A semi-parametric expectation maximization

framework has been developed in a series of works [8,36,14], where Djis a superposition of

parametrized kernel functions. Bonhomme and coauthors [13] pursued a diﬀerent approach that

applies to repeated 1-D measurements. They learned the map ϕ7→ EX∼Dj[ϕ(X)], instead of the

distributions Djdirectly.

Tensor decomposition and method of moment (MoM) based algorithms for (1) have also

received attention. Jain and Oh [29] as well as Chen and Moitra [16] developed moment-

based approaches to learn discrete mixtures. [5] is another example of a MoM approach for

(1) in the discrete case. For continuous distributions and MoM, most attention was devoted

to solving diagonal Gaussian mixtures [24,19,34]. For nonparametric estimation, the works

[30] and [52] used tensor techniques to learn discretizations of Dj. In terms of computational

advances, Sherman and Kolda [46] proposed implicit tensor decomposition to avoid expensive

tensor formation when applying MoM. Subsequently, Pereira and coauthors [42] developed an

implicit MoM algorithm for diagonal and full Gaussian mixture models.

Regarding theory, identiﬁability has been investigated. This includes identiﬁability of the

number of components (r), mixing weights, and component distributions. Allman and coauthors

[4] proved that mixing weights and component distributions can be identiﬁed from the joint

distribution, assuming ris known. In the large sample limit, this implies identiﬁability from

data. For learning r, recent works are [31,35].

1.3 Our Contributions

We learn the nonparametric mixture model (1) through the method of moments. The number

of components ris assumed to be known, or already estimated by an existing method. Given a

dataset V={v1,...,vp}of size pin Rnsampled independently from D, we learn the weights

w={w1, . . . , wr}and a functional similar to the one in Bonhomme et al.’s paper [13]:

g7→ EX∼Dj[g(X)],where g(x)=(g1(x1), . . . , gn(xn)),1(2)

for all j∈[r]. We call the output of the functional a general mean. In particular, this includes

the component distributions (gi(xi) = 1xi⩽ti), component moments (gi(xi) = xmi

i), moment

generating functions (gi(xi) = etixi), etc.

1The functional applies to functions g:Rn→Rnthat are Cartesian products of gi:R→R.

We propose a novel two-step approach. First (step 1), we learn the mixing weights and com-

ponent means using tensor decomposition and an alternating least squares (ALS) type algorithm

(Algorithm 1 and 6). This exploits special algebraic structure in the conditional-independent

model. Once the weights and means are learned, (step 2) we make an important observation

that for any gin (2), the general means EX∼Dj[g(X)] (j= 1, . . . , r) are computed by a linear

solve (Algorithm 5)

We pay great attention to the eﬃciency and scalability of our MoM based algorithm, partic-

ularly for high dimensions. Our MoM algorithm is implicit in the sense of [46], meaning that no

moment tensor is explicitly computed during the estimation. For model (1) in Rnof sample size

p, this allows us to achieve an O(npr +nr3) computational complexity to update the means and

weights in ALS in step 1, similar to the per-iteration cost of the EM algorithms. The storage

cost is only O(n(r+p)) – the same order needed to store the data and computed means. This is

a great improvement compared to explicitly computing moment tensors, which requires O(pnd)

computation and O(nd) storage for the dth moment. The cost to solve for the general means

(step 2) is O(npr +nr3) time.

Compared to other recent tensor-based methods [52,30], it may seem indirect to learn the

functional (2) rather than learning the component distributions directly. However our approach

has major advantages for large n. In [30], the authors need CP decompositions of 3-D histograms

for all marginalizations of the data onto 3 features. This scales cubically in n. Additionally, the

needed resolution of the histograms can be hard to estimate a priori. In [52], the authors learn

the densities by running kernel density estimation. Then they compute a CP decomposition of

an explicit rntensor, which is exponential in n. By contrast, our algorithm has much lower com-

plexity. Moreover, the functional-based approach enables adaptive localization of the component

distributions by estimating e.g. their means ajand standard deviations σjfor j= 1, . . . , r. We

can then evaluate the cdf of each component at (say) aj±kσjfor prescribed values of k. See a

numerical illustration in Section 6.1.

We also have signiﬁcant theoretical contributions. In Theorem 2.3, we derive a novel system

of coupled CP decompositions, relating the moment tensors of Dto the mixing weights and

moments of Dj. This has already seen interest in computational algebraic geometry [3]. In

Theorem 2.6 and 2.8, we prove for the model (1) in Rnthat if d⩾3 and r=O(n⌊d/2⌋)

then the mixing weights and means of Djare identiﬁable from the ﬁrst dmoments of D. In

Theorem 2.10, we prove that the general means (2) are identiﬁable from the weights and means

and certain expectations over D. These guarantees are new, and more relevant for MoM than

prior identiﬁability results which assumed access to the joint distribution (1) [4]. In Theorem 3.9,

we prove a guarantee for our algorithm, namely local convergence of ALS (step 1).

Our last contribution is extensive numerical experiments presented in Section 6. We test

our procedures on various simulated estimation problems as well as real data sets. The tests

demonstrate the scalability of our methods to high dimensions n, and their competitiveness

with leading alternatives. The experiments also showcase the applicability of the model (1) to

diﬀerent problems in data science. A Python implementation of our algorithms is available at

https://github.com/yifan8/moment_estimation_incomplete_tensor_decomposition.

1.4 Notation

Vectors, matrices and tensors are denoted by lower-case, capital and calligraphic letters respec-

tively, for example by a,A,A. The 2-norm of a vector and the Frobenius norm of a matrix or

tensor are denoted by ∥·∥. The Euclidean inner product between vectors and the Frobenius

inner product between matrices or tensors are denoted by ⟨·,·⟩. Operators ∗and /denote entry-

wise multiplication (or exponentiation) and division respectively. We write the tensor product

as ⊗. We write Rn⊗. . . ⊗Rn=Rnd

. The subspace of order dsymmetric tensors is denoted

Sd(Rn). For a matrix A, its ith column and ith row are aiand ai, respectively. The probability

simplex is denoted as ∆r−1={x∈Rr

⩾0:Pixi= 1}. The all-ones vector or matrix is 1. An

integer partition of d∈Nis a tuple of non-increasing positive integers λ= (λ1, λ2, . . .) such that

λ1+λ2+. . . =d, denoted λ⊢d. We write λ= 1b12b2. . . if b1elements are 1, etc. The length

of λis ℓ(λ) = b1+b2+. . .

2 Method of Moments and Identiﬁability

In this section we formulate the MoM framework for solving model (1). We derive moment

equations that we will solve and present important theory on the uniqueness of the equations’

solution.

2.1 Setup

Let Dbe a conditionally-independent mixture of rdistributions in Rn(1). We assume ris

known (or has been estimated). For data we assume we are given a matrix of pindependent

draws from D:

V= (v1|. . . |vp)∈Rn×p.(3)

For the nonparametric estimation of (1) by MoM, our goal is to evaluate the functional

g7→ EX∼Dj[g(X)],where g(x)=(g1(x1), . . . , gn(xn)),(4)

which acts on functions gthat are Cartesian products of functions gi. We do this using appro-

priate expectations over Dapproximated by sample averages over V.

To this end we use the dth moment tensor of D, deﬁned as

Md=EX∼D[X⊗d]∈Sd(Rn).(5)

It is approximated by the dth sample moment computed from the data matrix V:

d=1

pXp

i=1 v⊗d

i∈Sd(Rn).(6)

For functions g:Rn→Rnas in (4), we will also use the expectations

EX∼D[g(X)⊗X⊗d−1]∈Rnd.(7)

These are approximated by the sample averages

pXp

i=1 g(vi)⊗v⊗d−1

i∈Rnd.(8)

In terms of the variables we wish to solve for, let the dth moment of component Djbe

j=EX∼Dj[X∗d]∈Rn.(9)

For the “general means” in (4), let

yj=EX∼Dj[g(X)] ∈Rn(10)

and Y= (y1|. . . |yr)∈Rn×r, when gis ﬁxed and integrable with respect to Dj. We want to

solve for (9) (including the componentwise means) from (6), and (10) from (8).

Remark 2.1. To achieve scalability of MoM to high dimensions, it is mandatory that we never

explicitly form high-dimensional high-order tensors. The costs of forming ndtensors are pro-

hibitive. We evade them by developing new implicit tensor decomposition methods, motivated

by [46,42] and kernel methods. See Section 3 and 4.

2.2 Systems of Equations

We derive the relationship between the moments Mdand c

Mdof the mixture Dand the moments

jand general means yjof the components Dj. The relationship is interesting algebraically

[3], and forms the basis of our algorithm development.

Deﬁnition 2.2. Let λ⊢dbe an integer partition. Deﬁne a projection operator Pλ:Rnd→

Rnℓ(λ)acting on order dtensors and outputing order ℓ(λ)(see Section 1.4 for the deﬁnition)

tensors via

Pλ(A)i1,...,iℓ(λ)=(0if i1, . . . , iℓ(λ)are not distinct

Ai1,...,i1(λ1times),i2,...,i2(λ2times),... otherwise.

We write P≡P(1,...,1) when λconsists of all 1’s.

In particular, Pacts by oﬀ-diagonal projection, by zeroing out all entries with non-distinct

indices. Figure 1 illustrates the other projections Pλwhen d= 3.

Figure 1: Illustration of Deﬁnition 2.2 when d= 3. Left: the entries selected by P(2,1) (green)

with indices iij,i̸=j, resulting in a matrix indexed by (i, j). Right: the entries selected by

P(3) with indices iii, resulting in a vector indexed by i.

Theorem 2.3. Let Dbe a conditionally-independent mixture, and Mdits dth population mo-

ment tensor (5). Then for each partition λ= (λ1, . . . , λℓ(λ))of d,

Pλ(Md) = PXr

j=1 wjmλ1

j⊗. . . ⊗mλℓ(λ)

j.(11)

Proof. From (1), (5) and conditioning on the class label j,

Md=EX∼D[X⊗d] = Xr

j=1 wjEX∼Dj[X⊗d].(12)

Fix the partition λ⊢dand write ℓ=ℓ(λ). We consider an index

i= (i1, . . . , i1

| {z }

λ1times

, . . . , iℓ, . . . , iℓ

| {z }

λℓtimes

)

such that i1, i2, . . . , iℓ∈[n] are distinct. By coordinatewise independence in Dj,

EX∼Dj[X⊗d]i=EX∼Dj[Xλ1

i1. . . Xλℓ

iℓ] = EX∼Dj[Xλ1

i1]. . . EX∼Dj[Xλℓ

iℓ]

= (mλ1

j)i1. . . (mλℓ

j)iℓ= (mλ1

j⊗. . . ⊗mλℓ

j)i1,i2,...,iℓ

where we used the deﬁnition (9). Ranging over such indices igives

PλEX∼Dj[X⊗d]=P(mλ1

j⊗. . . ⊗mλℓ

j).(13)

Finally we put (12) together with (13) to deduce (11). ■

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MomentEstimationofNonparametricMixtureModelsThroughImplicitTensorDecomposition∗YifanZhang†JoeKileel‡AbstractWepresentanalternatingleastsquarestypenumericaloptimizationschemetoestimateconditionally-independentmixturemodelsinRn,withoutparameterizingthedistributions.Followingthemethodofmoments,wetackle...

展开>> 收起<<

Moment Estimation of Nonparametric Mixture Models Through Implicit Tensor Decomposition Yifan ZhangJoe Kileel.pdf

共38页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Moment Estimation of Nonparametric Mixture Models Through Implicit Tensor Decomposition Yifan ZhangJoe Kileel

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: