Moment Estimation of Nonparametric Mixture Models Through Implicit Tensor Decomposition Yifan ZhangJoe Kileel

2025-05-06 0 0 5.93MB 38 页 10玖币
侵权投诉
Moment Estimation of Nonparametric Mixture Models Through
Implicit Tensor Decomposition
Yifan ZhangJoe Kileel
Abstract
We present an alternating least squares type numerical optimization scheme to estimate
conditionally-independent mixture models in Rn, without parameterizing the distributions.
Following the method of moments, we tackle an incomplete tensor decomposition problem
to learn the mixing weights and componentwise means. Then we compute the cumulative
distribution functions, higher moments and other statistics of the component distributions
through linear solves. Crucially for computations in high dimensions, the steep costs asso-
ciated with high-order tensors are evaded, via the development of efficient tensor-free oper-
ations. Numerical experiments demonstrate the competitive performance of the algorithm,
and its applicability to many models and applications. Furthermore we provide theoretical
analyses, establishing identifiability from low-order moments of the mixture and guaranteeing
local linear convergence of the ALS algorithm.
1 Introduction
Mixture models have been intensively studied for their strong expressive power [37,12,40].
They are a common choice in data science problems when data is believed to come from distinct
subgroups, or when simple distributions such as one Gaussian or one Poisson do not offer a good
fit on their own.
In this paper, we develop new numerical tools for the nonparametric estimation of (finite)
conditionally-independent mixture models in Rn, also known as mixtures of product distribu-
tions:
D=Xr
j=1 wjDjwhere Dj=On
i=1 Dij.(1)
Here ris the number of components, and each Djis a product of ndistributions Dij (i= 1, . . . , n)
on the real line. No parametric assumptions are placed on Dij except that their moments should
exist. Our new methods are based on tensor decomposition. Our approach is particularly well-
suited for computations in high dimensions.
1.1 Why This Model?
The nonparametric conditionally-independent mixture model (1) was brought to broader at-
tention by Hall and Zhou [20]. Their motivation came from clinical applications, modeling
heterogeneity of repeated test data from patient groups with different diseases. Later the model
was used to describe heterogeneity in econometrics by Kasahara and Shimotsu [32], who con-
sidered dynamic discrete choice models. By modeling a choice policy as a mixture of Markovian
policies, they reduced their problem to an instance of (1). Heterogeneity analysis using (1) also
appears in auction analysis [25,26] and incomplete information games [51,2]; see the survey
Both authors were supported in part by start-up grants to JK provided by the College of Natural Sciences
and Oden Institute at UT Austin. We thank Jonathan Niles-Weed for helpful conversations.
Oden Institute, University of Texas at Austin, yf.zhang@utexas.edu
Department of Mathematics and Oden Institute, University of Texas at Austin, jkileel@math.utexas.edu
1
arXiv:2210.14386v3 [math.NA] 7 Aug 2023
[15]. In statistics, Bonhomme and coworkers [13] considered repeatedly measuring a real-valued
mixture and reduced that situation to instance of (1). Extending their framework, conditionally-
independent mixtures can model repeated measurements of rdeterministic objects described by
nfeatures, when there is independent measurement noise for each feature. This setting ap-
proximates the class averaging problem of cryo-electron microscopy [11]. In computer science,
Jain and Oh studied (1) when Dij are finite discrete distributions, motivated by connections
to crowdsourcing, genetics, and recommendation systems. Chen and Moitra [16] studied (1) to
learn stochastic decision trees, where each Dij are Bernoulli variables.
Conditional independence can also be viewed as a reduced model or regularization of the full
model. Here (1) is a universal approximator of densities on Rn. But also, it reduces the number
of parameters and prevents possible over-fitting. For instance, experiments in [30] show that
for many real datasets, diagonal Gaussian mixtures give a comparable or better classification
accuracy compared to full Gaussian mixtures.
1.2 Existing Methods
Many approaches for estimating model (1) have been studied. Hall and Zhou [20] took a nonpara-
metric maximum-likelihood-based approach to learn the component distributions and weights
when r= 2. Subsequently in [21], Hall et al. provided another consistent density estimator for
(1) when r= 2, with errors bounded by O(p1/2). A semi-parametric expectation maximization
framework has been developed in a series of works [8,36,14], where Djis a superposition of
parametrized kernel functions. Bonhomme and coauthors [13] pursued a different approach that
applies to repeated 1-D measurements. They learned the map ϕ7→ EX∼Dj[ϕ(X)], instead of the
distributions Djdirectly.
Tensor decomposition and method of moment (MoM) based algorithms for (1) have also
received attention. Jain and Oh [29] as well as Chen and Moitra [16] developed moment-
based approaches to learn discrete mixtures. [5] is another example of a MoM approach for
(1) in the discrete case. For continuous distributions and MoM, most attention was devoted
to solving diagonal Gaussian mixtures [24,19,34]. For nonparametric estimation, the works
[30] and [52] used tensor techniques to learn discretizations of Dj. In terms of computational
advances, Sherman and Kolda [46] proposed implicit tensor decomposition to avoid expensive
tensor formation when applying MoM. Subsequently, Pereira and coauthors [42] developed an
implicit MoM algorithm for diagonal and full Gaussian mixture models.
Regarding theory, identifiability has been investigated. This includes identifiability of the
number of components (r), mixing weights, and component distributions. Allman and coauthors
[4] proved that mixing weights and component distributions can be identified from the joint
distribution, assuming ris known. In the large sample limit, this implies identifiability from
data. For learning r, recent works are [31,35].
1.3 Our Contributions
We learn the nonparametric mixture model (1) through the method of moments. The number
of components ris assumed to be known, or already estimated by an existing method. Given a
dataset V={v1,...,vp}of size pin Rnsampled independently from D, we learn the weights
w={w1, . . . , wr}and a functional similar to the one in Bonhomme et al.’s paper [13]:
g7→ EX∼Dj[g(X)],where g(x)=(g1(x1), . . . , gn(xn)),1(2)
for all j[r]. We call the output of the functional a general mean. In particular, this includes
the component distributions (gi(xi) = 1xiti), component moments (gi(xi) = xmi
i), moment
generating functions (gi(xi) = etixi), etc.
1The functional applies to functions g:RnRnthat are Cartesian products of gi:RR.
2
We propose a novel two-step approach. First (step 1), we learn the mixing weights and com-
ponent means using tensor decomposition and an alternating least squares (ALS) type algorithm
(Algorithm 1 and 6). This exploits special algebraic structure in the conditional-independent
model. Once the weights and means are learned, (step 2) we make an important observation
that for any gin (2), the general means EX∼Dj[g(X)] (j= 1, . . . , r) are computed by a linear
solve (Algorithm 5)
We pay great attention to the efficiency and scalability of our MoM based algorithm, partic-
ularly for high dimensions. Our MoM algorithm is implicit in the sense of [46], meaning that no
moment tensor is explicitly computed during the estimation. For model (1) in Rnof sample size
p, this allows us to achieve an O(npr +nr3) computational complexity to update the means and
weights in ALS in step 1, similar to the per-iteration cost of the EM algorithms. The storage
cost is only O(n(r+p)) – the same order needed to store the data and computed means. This is
a great improvement compared to explicitly computing moment tensors, which requires O(pnd)
computation and O(nd) storage for the dth moment. The cost to solve for the general means
(step 2) is O(npr +nr3) time.
Compared to other recent tensor-based methods [52,30], it may seem indirect to learn the
functional (2) rather than learning the component distributions directly. However our approach
has major advantages for large n. In [30], the authors need CP decompositions of 3-D histograms
for all marginalizations of the data onto 3 features. This scales cubically in n. Additionally, the
needed resolution of the histograms can be hard to estimate a priori. In [52], the authors learn
the densities by running kernel density estimation. Then they compute a CP decomposition of
an explicit rntensor, which is exponential in n. By contrast, our algorithm has much lower com-
plexity. Moreover, the functional-based approach enables adaptive localization of the component
distributions by estimating e.g. their means ajand standard deviations σjfor j= 1, . . . , r. We
can then evaluate the cdf of each component at (say) aj±kσjfor prescribed values of k. See a
numerical illustration in Section 6.1.
We also have significant theoretical contributions. In Theorem 2.3, we derive a novel system
of coupled CP decompositions, relating the moment tensors of Dto the mixing weights and
moments of Dj. This has already seen interest in computational algebraic geometry [3]. In
Theorem 2.6 and 2.8, we prove for the model (1) in Rnthat if d3 and r=O(nd/2)
then the mixing weights and means of Djare identifiable from the first dmoments of D. In
Theorem 2.10, we prove that the general means (2) are identifiable from the weights and means
and certain expectations over D. These guarantees are new, and more relevant for MoM than
prior identifiability results which assumed access to the joint distribution (1) [4]. In Theorem 3.9,
we prove a guarantee for our algorithm, namely local convergence of ALS (step 1).
Our last contribution is extensive numerical experiments presented in Section 6. We test
our procedures on various simulated estimation problems as well as real data sets. The tests
demonstrate the scalability of our methods to high dimensions n, and their competitiveness
with leading alternatives. The experiments also showcase the applicability of the model (1) to
different problems in data science. A Python implementation of our algorithms is available at
https://github.com/yifan8/moment_estimation_incomplete_tensor_decomposition.
1.4 Notation
Vectors, matrices and tensors are denoted by lower-case, capital and calligraphic letters respec-
tively, for example by a,A,A. The 2-norm of a vector and the Frobenius norm of a matrix or
tensor are denoted by ∥·∥. The Euclidean inner product between vectors and the Frobenius
inner product between matrices or tensors are denoted by ⟨·,·⟩. Operators and /denote entry-
wise multiplication (or exponentiation) and division respectively. We write the tensor product
as . We write Rn. . . Rn=Rnd
. The subspace of order dsymmetric tensors is denoted
Sd(Rn). For a matrix A, its ith column and ith row are aiand ai, respectively. The probability
simplex is denoted as ∆r1={xRr
0:Pixi= 1}. The all-ones vector or matrix is 1. An
3
integer partition of dNis a tuple of non-increasing positive integers λ= (λ1, λ2, . . .) such that
λ1+λ2+. . . =d, denoted λd. We write λ= 1b12b2. . . if b1elements are 1, etc. The length
of λis (λ) = b1+b2+. . .
2 Method of Moments and Identifiability
In this section we formulate the MoM framework for solving model (1). We derive moment
equations that we will solve and present important theory on the uniqueness of the equations’
solution.
2.1 Setup
Let Dbe a conditionally-independent mixture of rdistributions in Rn(1). We assume ris
known (or has been estimated). For data we assume we are given a matrix of pindependent
draws from D:
V= (v1|. . . |vp)Rn×p.(3)
For the nonparametric estimation of (1) by MoM, our goal is to evaluate the functional
g7→ EX∼Dj[g(X)],where g(x)=(g1(x1), . . . , gn(xn)),(4)
which acts on functions gthat are Cartesian products of functions gi. We do this using appro-
priate expectations over Dapproximated by sample averages over V.
To this end we use the dth moment tensor of D, defined as
Md=EX∼D[Xd]Sd(Rn).(5)
It is approximated by the dth sample moment computed from the data matrix V:
c
M
d=1
pXp
i=1 vd
iSd(Rn).(6)
For functions g:RnRnas in (4), we will also use the expectations
EX∼D[g(X)Xd1]Rnd.(7)
These are approximated by the sample averages
1
pXp
i=1 g(vi)vd1
iRnd.(8)
In terms of the variables we wish to solve for, let the dth moment of component Djbe
md
j=EX∼Dj[Xd]Rn.(9)
For the “general means” in (4), let
yj=EX∼Dj[g(X)] Rn(10)
and Y= (y1|. . . |yr)Rn×r, when gis fixed and integrable with respect to Dj. We want to
solve for (9) (including the componentwise means) from (6), and (10) from (8).
Remark 2.1. To achieve scalability of MoM to high dimensions, it is mandatory that we never
explicitly form high-dimensional high-order tensors. The costs of forming ndtensors are pro-
hibitive. We evade them by developing new implicit tensor decomposition methods, motivated
by [46,42] and kernel methods. See Section 3 and 4.
4
2.2 Systems of Equations
We derive the relationship between the moments Mdand c
Mdof the mixture Dand the moments
md
jand general means yjof the components Dj. The relationship is interesting algebraically
[3], and forms the basis of our algorithm development.
Definition 2.2. Let λdbe an integer partition. Define a projection operator Pλ:Rnd
Rn(λ)acting on order dtensors and outputing order (λ)(see Section 1.4 for the definition)
tensors via
Pλ(A)i1,...,i(λ)=(0if i1, . . . , i(λ)are not distinct
Ai1,...,i1(λ1times),i2,...,i2(λ2times),... otherwise.
We write PP(1,...,1) when λconsists of all 1’s.
In particular, Pacts by off-diagonal projection, by zeroing out all entries with non-distinct
indices. Figure 1 illustrates the other projections Pλwhen d= 3.
Figure 1: Illustration of Definition 2.2 when d= 3. Left: the entries selected by P(2,1) (green)
with indices iij,i̸=j, resulting in a matrix indexed by (i, j). Right: the entries selected by
P(3) with indices iii, resulting in a vector indexed by i.
Theorem 2.3. Let Dbe a conditionally-independent mixture, and Mdits dth population mo-
ment tensor (5). Then for each partition λ= (λ1, . . . , λ(λ))of d,
Pλ(Md) = PXr
j=1 wjmλ1
j. . . mλ(λ)
j.(11)
Proof. From (1), (5) and conditioning on the class label j,
Md=EX∼D[Xd] = Xr
j=1 wjEX∼Dj[Xd].(12)
Fix the partition λdand write =(λ). We consider an index
i= (i1, . . . , i1
| {z }
λ1times
, . . . , i, . . . , i
| {z }
λtimes
)
such that i1, i2, . . . , i[n] are distinct. By coordinatewise independence in Dj,
EX∼Dj[Xd]i=EX∼Dj[Xλ1
i1. . . Xλ
i] = EX∼Dj[Xλ1
i1]. . . EX∼Dj[Xλ
i]
= (mλ1
j)i1. . . (mλ
j)i= (mλ1
j. . . mλ
j)i1,i2,...,i
where we used the definition (9). Ranging over such indices igives
PλEX∼Dj[Xd]=P(mλ1
j. . . mλ
j).(13)
Finally we put (12) together with (13) to deduce (11).
5
摘要:

MomentEstimationofNonparametricMixtureModelsThroughImplicitTensorDecomposition∗YifanZhang†JoeKileel‡AbstractWepresentanalternatingleastsquarestypenumericaloptimizationschemetoestimateconditionally-independentmixturemodelsinRn,withoutparameterizingthedistributions.Followingthemethodofmoments,wetackle...

展开>> 收起<<
Moment Estimation of Nonparametric Mixture Models Through Implicit Tensor Decomposition Yifan ZhangJoe Kileel.pdf

共38页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:38 页 大小:5.93MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 38
客服
关注