Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition_2

2025-04-27 0 0 2.04MB 14 页 10玖币
侵权投诉
Distributed-Memory Randomized Algorithms for Sparse Tensor
CP Decomposition
Vivek Bharadwaj
University of California, Berkeley
Berkeley, CA, USA
vivek_bharadwaj@berkeley.edu
Osman Asif Malik
Encube Technologies
Stockholm, Sweden
osman@getencube.com
Riley Murray
Sandia National Laboratories
Albuquerque, NM, USA
rjmurr@sandia.gov
Aydın Buluç
Lawrence Berkeley National
Laboratory
Berkeley, CA, USA
abuluc@lbl.gov
James Demmel
University of California, Berkeley
Berkeley, CA, USA
demmel@berkeley.edu
ABSTRACT
Candecomp / PARAFAC (CP) decomposition, a generalization of
the matrix singular value decomposition to higher-dimensional
tensors, is a popular tool for analyzing multidimensional sparse
data. On tensors with billions of nonzero entries, computing a CP
decomposition is a computationally intensive task. We propose the
rst distributed-memory implementations of two randomized CP
decomposition algorithms, CP-ARLS-LEV and STS-CP, that oer
nearly an order-of-magnitude speedup at high decomposition ranks
over well-tuned non-randomized decomposition packages. Both
algorithms rely on leverage score sampling and enjoy strong theo-
retical guarantees, each with varying time and accuracy tradeos.
We tailor the communication schedule for our random sampling
algorithms, eliminating expensive reduction collectives and forcing
communication costs to scale with the random sample count. Finally,
we optimize the local storage format for our methods, switching
between analogues of compressed sparse column and compressed
sparse row formats. Experiments show that our methods are fast
and scalable, producing 11x speedup over SPLATT by decompos-
ing the billion-scale Reddit tensor on 512 CPU cores in under two
minutes.
CCS CONCEPTS
Theory of computation
Massively parallel algorithms;
Random projections and metric embeddings.
KEYWORDS
Sparse Tensors, CP Decomposition, Randomized Linear Algebra,
Leverage Score Sampling
Work completed while author was aliated with Lawrence Berkeley National
Laboratory.
Work completed while author was aliated with Lawrence Berkeley National Labora-
tory, International Computer Science Institute, and University of California, Berkeley.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
SPAA ’24, June 17–21, 2024, Nantes, France
©2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0416-1/24/06.
https://doi.org/10.1145/3626183.3659980
ACM Reference Format:
Vivek Bharadwaj, Osman Asif Malik, Riley Murray, Aydın Buluç, and James
Demmel. 2024. Distributed-Memory Randomized Algorithms for Sparse
Tensor CP Decomposition. In Proceedings of the 36th ACM Symposium on
Parallelism in Algorithms and Architectures (SPAA ’24), June 17–21, 2024,
Nantes, France. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/
3626183.3659980
1 INTRODUCTION
+ . . . +
U1
U1[:, 1]
U2U3
U2[:, 1]
U3[:, 1]
U1[:, R]
U2[:, R]
U3[:, R]
σ[1]
σ
σ[R]
Figure 1: A subset of entries from the 3D Amazon Review
sparse tensor [
1
] and its illustrated CP decomposition. The
sparse tensor is approximated by a sum of 3D outer products
of vectors, which are columns of factor matrices
𝑈1, 𝑈2
, and
𝑈3. Outer products are scaled by elements of 𝜎.
Randomized algorithms for numerical linear algebra have be-
come increasingly popular in the past decade, but their distributed-
memory communication characteristics and scaling properties have
received less attention. In this work, we examine randomized algo-
rithms to compute the Candecomp / PARAFAC (CP) decomposition,
a generalization of the matrix singular-value decomposition to a
number of modes
𝑁>
2. Given a tensor
T R𝐼1×...×𝐼𝑁
and a
target rank
𝑅
, the goal of CP decomposition (illustrated in Figure
1) is to nd a set of factor matrices
𝑈1, ..., 𝑈𝑁, 𝑈 𝑗R𝐼𝑗×𝑅
with unit
norm columns and a nonnegative vector 𝜎R𝑅satisfying
T[𝑖1, ..., 𝑖𝑁]
𝑅
𝑟=1
𝜎[𝑟]𝑈1[𝑖1, 𝑟]... 𝑈𝑁[𝑖𝑁, 𝑟].(1)
arXiv:2210.05105v3 [math.NA] 28 Apr 2024
SPAA ’24, June 17–21, 2024, Nantes, France Vivek Bharadwaj, Osman Asif Malik, Riley Murray, Aydın Buluç, & James Demmel
Figure 2: Running maximum accuracy over time for SPLATT,
a state-of-the-art distributed CP decomposition software
package, and our randomized algorithms on the Reddit ten-
sor, target rank
𝑅=
100, on 512 CPU cores. Curves are aver-
ages of 5 trials, 80 ALS rounds.
We consider real sparse tensors
T
with
𝑁
3, all entries known,
and billions of nonzero entries. Sparse tensors are a exible ab-
straction for a variety of data, such as network trac logs [
2
], text
corpora [1], and knowledge graphs [3].
1.1 Motivation
Why is a low-rank approximation of a sparse tensor useful? We can
view the sparse CP decomposition as an extension of well-studied
sparse matrix factorization methods, which can mine patterns from
large datasets [
4
]. Each row of the CP factors is a dense embedding
vector for an index
𝑖𝑗𝐼𝑗
,1
𝑗𝑁
. Because each embedding is
a small dense vector while the input tensor is sparse, sparse tensor
CP decomposition may incur high relative error with respect to
the input and rarely captures the tensor sparsity structure exactly.
Nevertheless, the learned embeddings contain valuable information.
CP factor matrices have been successfully used to identify patterns
in social networks [
5
,
6
], detect anomalies in packet traces [
2
], and
monitor trends in internal network trac [
7
]. As we discuss below,
a wealth of software packages exist to meet the demand for sparse
tensor decomposition.
One of the most popular methods for computing a sparse CP
decomposition, the Alternating-Least-Squares (ALS) algorithm, in-
volves repeatedly solving large, overdetermined linear least-squares
problems with structured design matrices [
8
]. High-performance
libraries DFacto[
9
], SPLATT [
10
], HyperTensor [
11
], and BigTensor
[
12
] distribute these expensive computations to a cluster of proces-
sors that communicate through an interconnect. Separately, several
works use randomized sampling methods to accelerate the least-
squares solves, with prototypes implemented in a shared-memory
setting [
6
,
13
15
]. These randomized algorithms have strong theo-
retical guarantees and oer signicant asymptotic advantages over
non-randomized ALS. Unfortunately, prototypes of these methods
require hours to run [
6
,
15
] and are neither competitive nor scalable
compared to existing libraries with distributed-memory parallelism.
1.2 Our Contributions
We propose the rst distributed-memory parallel formulations of
two randomized algorithms, CP-ARLS-LEV [
6
] and STS-CP [
15
],
with accuracy identical to their shared-memory prototypes. We
then provide implementations of these methods that scale to thou-
sands of CPU cores. We face dual technical challenges to parallel
scaling. First, sparse tensor decomposition generally has lower
arithmetic intensity (FLOPs / data word communicated between
processors) than dense tensor decomposition, since computation
scales linearly with the tensor nonzero count. Some sparse ten-
sors exhibit nonzero fractions as low as 4
×
10
10
(see Table 4),
while the worst-case communication costs for sparse CP decom-
position remain identical to the dense tensor case [
16
]. Second,
randomized algorithms can save an order of magnitude in compu-
tation over their non-randomized counterparts [
17
19
], but their
inter-processor communication costs remain unaltered unless care-
fully optimized. Despite these compounding factors that reduce
arithmetic intensity, we achieve both speedup and scaling through
several key innovations, three of which we highlight:
Novel Distributed-Memory Sampling Procedures. Random sample
selection is challenging to implement when the CP factor matrices
and sparse tensor are divided among
𝑃
processors. We introduce
two distinct communication-avoiding algorithms for randomized
sample selection from the Khatri-Rao product. First, we show how
to implement the CP-ARLS-LEV algorithm by computing an inde-
pendent probability distribution on the factor block row owned
by each processor. The resulting distributed algorithm has mini-
mal compute / communication overhead compared to the other
phases of CP decomposition. The second algorithm, STS-CP, re-
quires higher sampling time, but achieves lower error by performing
random walks on a binary tree for each sample. By distributing
leaf nodes uniquely to processors and replicating internal nodes,
we give a sampling algorithm with per-processor communication
bandwidth scaling as 𝑂(log 𝑃/𝑃)(see Table 2).
Communication-Optimized MTTKRP. We show that communication-
optimal schedules for non-randomized ALS may exhibit dispropor-
tionately high communication costs for randomized algorithms.
To combat this, we use an “accumulator-stationary" schedule that
eliminates expensive
Reduce-scatter
collectives, causing all com-
munication costs to scale with the number of random samples taken.
This alternate schedule signicantly reduces communication on
tensors with large dimensions (Figure 8) and empirically improves
the computational load balance (Figure 11).
Local Tensor Storage Format. Existing storage formats developed
for sparse CP decomposition [
10
,
20
] are not optimized for ran-
dom access into the sparse tensor, which our algorithms require.
In response, we use a modied compressed-sparse-column format
to store each matricization of our tensor, allowing ecient selec-
tion of nonzero entries by our random sampling algorithms. We
then transform the selected nonzero entries into compressed sparse
row format, which eliminates shared-memory data races in the
subsequent sparse-dense matrix multiplication. The cost of the
transposition is justied and provides a roughly 1.7x speedup over
using atomics in a hybrid OpenMP / MPI implementation.
Our distributed-memory randomized algorithms have signicant
advantages over existing libraries while preserving the accuracy of
the nal approximation. As Figure 2 shows, our method d-STS-CP
computes a rank 100 decomposition of the Reddit tensor (
4
.
7
Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition SPAA ’24, June 17–21, 2024, Nantes, France
𝑈3
·
𝑈1
𝑈
2
mat( T,2)
min
𝑈2
𝐹
ˆ
𝑈2
:=
mat( T,2)
·
𝑈3
·
𝑈1
𝐺+
MTTKRP
Random Sampling
˜
𝑈2
:=
mat( T,2)𝑆
·
𝑆(𝑈3𝑈1)
·˜
𝐺+
Sparse-Dense Matrix Multiplication
Figure 3: Top: the linear least-squares problem to optimize
factor matrix
𝑈2
during the ALS algorithm for a 3D tensor
(column dimension of
mat(T ,
2
)
not to scale). Middle: the
exact solution to the problem using the Matricized Tensor
Times Khatri-Rao Product (MTTKRP). Shaded columns of
mat(T ,
2
)
and rows of
(𝑈3𝑈1)
are selected by our random
sampling algorithm. Bottom: the downsampled linear least-
squares problem after applying random sampling matrix 𝑆.
billion nonzero entries) with a 11x speedup over SPLATT, a state-
of-the-art distributed-memory CP decomposition package. The
reported speedup was achieved on 512 CPU cores, with a nal t
within 0
.
8% of non-randomized ALS for the same iteration count.
While the distributed algorithm d-CP-ARLS-LEV achieves a lower
nal accuracy, it makes progress faster than SPLATT and spends
less time on sampling (completing 80 rounds in an average of 81
seconds). We demonstrate that it is well-suited to smaller tensors
and lower target ranks.
Symbol Description
TSparse tensor of dimensions 𝐼1×. . . ×𝐼𝑁
𝑅Target Rank of CP Decomposition
𝑈1, . . . , 𝑈𝑁Dense factor matrices, 𝑈𝑗R𝐼𝑗×𝑅
𝜎Vector of scaling factors, 𝜎R𝑅
𝐽Sample count for randomized ALS
·Matrix multiplication
Elementwise multiplication
Kronecker product
Khatri-Rao product
𝑃Total processor count
𝑃1, . . . , 𝑃𝑁Dimensions of processor grid, Î𝑖𝑃𝑖=𝑃
𝑈(𝑝𝑗)
𝑖Block row of 𝑈𝑖owned by processor 𝑝𝑗
Table 1: Symbol Denitions
2 NOTATION AND PRELIMINARIES
Table 1 summarizes our notation. We use script characters (e.g.
T
) to
denote tensors with at least three modes, capital letters for matrices,
and lowercase letters for vectors. Bracketed tuples following any of
these objects, e.g.
𝐴[𝑖, 𝑗]
, represent indexes into each object, and
the symbol “:" in place of any index indicates a slicing of a tensor.
We use
to denote the Khatri-Rao product, which is a column-wise
Kronecker product of a pair of matrices with the same number of
columns. For
𝐴R𝐼×𝑅, 𝐵 R𝐽×𝑅
,
𝐴𝐵
produces a matrix of
dimensions (𝐼 𝐽 ) × 𝑅such that for 1𝑗𝑅,
(𝐴𝐵)[:, 𝑗]=𝐴[:, 𝑗]𝐵[:, 𝑗].
Let
T
be an
𝑁
-dimensional tensor indexed by tuples
(𝑖1, ..., 𝑖𝑁) ∈
[𝐼1]×... ×[𝐼𝑁]
, with
nnz(T )
as the number of nonzero entries.
In this work, sparse tensors are always represented as a collection
of
(𝑁+
1
)
-tuples, with the rst
𝑁
elements giving the indices of
a nonzero element and the last element giving the value. We seek
a low-rank approximation of
T
given by Equation
(1)
, the right-
hand-side of which we abbreviate as
[𝜎;𝑈1, ..., 𝑈𝑁]
. By convention,
each column of
𝑈1, ..., 𝑈𝑁
has unit norm. Our goal is to minimize
the sum of squared dierences between our approximation and the
provided tensor:
argmin𝜎,𝑈1,...,𝑈𝑁 [𝜎;𝑈1, ..., 𝑈𝑁] T 2
F.(2)
2.1 Non-Randomized ALS CP Decomposition
Minimizing Equation
(2)
jointly over
𝑈1, ..., 𝑈𝑁
is still a non-convex
problem (the vector
𝜎
can be computed directly from the factor
matrices by renormalizing each column). Alternating least squares
is a popular heuristic algorithm that iteratively drives down the
approximation error. The algorithm begins with a set of random
factor matrices and optimizes the approximation in rounds, each
involving
𝑁
subproblems. The
𝑗
-th subproblem in a round holds
all factor matrices but
𝑈𝑗
constant and solves for a new matrix
ˆ
𝑈𝑗
minimizing the squared Frobenius norm error [
8
]. The updated
matrix
ˆ
𝑈𝑗
is the solution to the overdetermined linear least-squares
problem
ˆ
𝑈𝑗:=min
𝑋
𝑈𝑗·𝑋mat(T , 𝑗)
𝐹.(3)
摘要:

Distributed-MemoryRandomizedAlgorithmsforSparseTensorCPDecompositionVivekBharadwajUniversityofCalifornia,BerkeleyBerkeley,CA,USAvivek_bharadwaj@berkeley.eduOsmanAsifMalik∗EncubeTechnologiesStockholm,Swedenosman@getencube.comRileyMurray†SandiaNationalLaboratoriesAlbuquerque,NM,USArjmurr@sandia.govAyd...

展开>> 收起<<
Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition_2.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:2.04MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注