We propose a novel two-step approach. First (step 1), we learn the mixing weights and com-
ponent means using tensor decomposition and an alternating least squares (ALS) type algorithm
(Algorithm 1 and 6). This exploits special algebraic structure in the conditional-independent
model. Once the weights and means are learned, (step 2) we make an important observation
that for any gin (2), the general means EX∼Dj[g(X)] (j= 1, . . . , r) are computed by a linear
solve (Algorithm 5)
We pay great attention to the efficiency and scalability of our MoM based algorithm, partic-
ularly for high dimensions. Our MoM algorithm is implicit in the sense of [46], meaning that no
moment tensor is explicitly computed during the estimation. For model (1) in Rnof sample size
p, this allows us to achieve an O(npr +nr3) computational complexity to update the means and
weights in ALS in step 1, similar to the per-iteration cost of the EM algorithms. The storage
cost is only O(n(r+p)) – the same order needed to store the data and computed means. This is
a great improvement compared to explicitly computing moment tensors, which requires O(pnd)
computation and O(nd) storage for the dth moment. The cost to solve for the general means
(step 2) is O(npr +nr3) time.
Compared to other recent tensor-based methods [52,30], it may seem indirect to learn the
functional (2) rather than learning the component distributions directly. However our approach
has major advantages for large n. In [30], the authors need CP decompositions of 3-D histograms
for all marginalizations of the data onto 3 features. This scales cubically in n. Additionally, the
needed resolution of the histograms can be hard to estimate a priori. In [52], the authors learn
the densities by running kernel density estimation. Then they compute a CP decomposition of
an explicit rntensor, which is exponential in n. By contrast, our algorithm has much lower com-
plexity. Moreover, the functional-based approach enables adaptive localization of the component
distributions by estimating e.g. their means ajand standard deviations σjfor j= 1, . . . , r. We
can then evaluate the cdf of each component at (say) aj±kσjfor prescribed values of k. See a
numerical illustration in Section 6.1.
We also have significant theoretical contributions. In Theorem 2.3, we derive a novel system
of coupled CP decompositions, relating the moment tensors of Dto the mixing weights and
moments of Dj. This has already seen interest in computational algebraic geometry [3]. In
Theorem 2.6 and 2.8, we prove for the model (1) in Rnthat if d⩾3 and r=O(n⌊d/2⌋)
then the mixing weights and means of Djare identifiable from the first dmoments of D. In
Theorem 2.10, we prove that the general means (2) are identifiable from the weights and means
and certain expectations over D. These guarantees are new, and more relevant for MoM than
prior identifiability results which assumed access to the joint distribution (1) [4]. In Theorem 3.9,
we prove a guarantee for our algorithm, namely local convergence of ALS (step 1).
Our last contribution is extensive numerical experiments presented in Section 6. We test
our procedures on various simulated estimation problems as well as real data sets. The tests
demonstrate the scalability of our methods to high dimensions n, and their competitiveness
with leading alternatives. The experiments also showcase the applicability of the model (1) to
different problems in data science. A Python implementation of our algorithms is available at
https://github.com/yifan8/moment_estimation_incomplete_tensor_decomposition.
1.4 Notation
Vectors, matrices and tensors are denoted by lower-case, capital and calligraphic letters respec-
tively, for example by a,A,A. The 2-norm of a vector and the Frobenius norm of a matrix or
tensor are denoted by ∥·∥. The Euclidean inner product between vectors and the Frobenius
inner product between matrices or tensors are denoted by ⟨·,·⟩. Operators ∗and /denote entry-
wise multiplication (or exponentiation) and division respectively. We write the tensor product
as ⊗. We write Rn⊗. . . ⊗Rn=Rnd
. The subspace of order dsymmetric tensors is denoted
Sd(Rn). For a matrix A, its ith column and ith row are aiand ai, respectively. The probability
simplex is denoted as ∆r−1={x∈Rr
⩾0:Pixi= 1}. The all-ones vector or matrix is 1. An
3