Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition_2

2025-04-27 1 0 2.04MB 14 页 10玖币

侵权投诉

Distributed-Memory Randomized Algorithms for Sparse Tensor

CP Decomposition

Vivek Bharadwaj

University of California, Berkeley

Berkeley, CA, USA

vivek_bharadwaj@berkeley.edu

Osman Asif Malik∗

Encube Technologies

Stockholm, Sweden

osman@getencube.com

Riley Murray†

Sandia National Laboratories

Albuquerque, NM, USA

rjmurr@sandia.gov

Aydın Buluç

Lawrence Berkeley National

Laboratory

Berkeley, CA, USA

abuluc@lbl.gov

James Demmel

University of California, Berkeley

Berkeley, CA, USA

demmel@berkeley.edu

ABSTRACT

Candecomp / PARAFAC (CP) decomposition, a generalization of

the matrix singular value decomposition to higher-dimensional

tensors, is a popular tool for analyzing multidimensional sparse

data. On tensors with billions of nonzero entries, computing a CP

decomposition is a computationally intensive task. We propose the

rst distributed-memory implementations of two randomized CP

decomposition algorithms, CP-ARLS-LEV and STS-CP, that oer

nearly an order-of-magnitude speedup at high decomposition ranks

over well-tuned non-randomized decomposition packages. Both

algorithms rely on leverage score sampling and enjoy strong theo-

retical guarantees, each with varying time and accuracy tradeos.

We tailor the communication schedule for our random sampling

algorithms, eliminating expensive reduction collectives and forcing

communication costs to scale with the random sample count. Finally,

we optimize the local storage format for our methods, switching

between analogues of compressed sparse column and compressed

sparse row formats. Experiments show that our methods are fast

and scalable, producing 11x speedup over SPLATT by decompos-

ing the billion-scale Reddit tensor on 512 CPU cores in under two

minutes.

CCS CONCEPTS

•Theory of computation

→

Massively parallel algorithms;

Random projections and metric embeddings.

KEYWORDS

Sparse Tensors, CP Decomposition, Randomized Linear Algebra,

Leverage Score Sampling

∗

Work completed while author was aliated with Lawrence Berkeley National

Laboratory.

†

Work completed while author was aliated with Lawrence Berkeley National Labora-

tory, International Computer Science Institute, and University of California, Berkeley.

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

SPAA ’24, June 17–21, 2024, Nantes, France

ACM ISBN 979-8-4007-0416-1/24/06.

https://doi.org/10.1145/3626183.3659980

ACM Reference Format:

Vivek Bharadwaj, Osman Asif Malik, Riley Murray, Aydın Buluç, and James

Demmel. 2024. Distributed-Memory Randomized Algorithms for Sparse

Tensor CP Decomposition. In Proceedings of the 36th ACM Symposium on

Parallelism in Algorithms and Architectures (SPAA ’24), June 17–21, 2024,

Nantes, France. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/

3626183.3659980

1 INTRODUCTION

+ . . . +

U1[:, 1]

U2U3

U2[:, 1]

U3[:, 1]

U1[:, R]

U2[:, R]

U3[:, R]

σ[1]

≈

σ[R]

Figure 1: A subset of entries from the 3D Amazon Review

sparse tensor [

] and its illustrated CP decomposition. The

sparse tensor is approximated by a sum of 3D outer products

of vectors, which are columns of factor matrices

𝑈1, 𝑈2

, and

𝑈3. Outer products are scaled by elements of 𝜎.

Randomized algorithms for numerical linear algebra have be-

come increasingly popular in the past decade, but their distributed-

memory communication characteristics and scaling properties have

received less attention. In this work, we examine randomized algo-

rithms to compute the Candecomp / PARAFAC (CP) decomposition,

a generalization of the matrix singular-value decomposition to a

number of modes

𝑁>

2. Given a tensor

T ∈ R𝐼1×...×𝐼𝑁

and a

target rank

𝑅

, the goal of CP decomposition (illustrated in Figure

1) is to nd a set of factor matrices

𝑈1, ..., 𝑈𝑁, 𝑈 𝑗∈R𝐼𝑗×𝑅

with unit

norm columns and a nonnegative vector 𝜎∈R𝑅satisfying

T[𝑖1, ..., 𝑖𝑁]≈

𝑅



𝑟=1

𝜎[𝑟]∗𝑈1[𝑖1, 𝑟]∗... ∗𝑈𝑁[𝑖𝑁, 𝑟].(1)

arXiv:2210.05105v3 [math.NA] 28 Apr 2024

SPAA ’24, June 17–21, 2024, Nantes, France Vivek Bharadwaj, Osman Asif Malik, Riley Murray, Aydın Buluç, & James Demmel

Figure 2: Running maximum accuracy over time for SPLATT,

a state-of-the-art distributed CP decomposition software

package, and our randomized algorithms on the Reddit ten-

sor, target rank

𝑅=

100, on 512 CPU cores. Curves are aver-

ages of 5 trials, 80 ALS rounds.

We consider real sparse tensors

with

𝑁≥

3, all entries known,

and billions of nonzero entries. Sparse tensors are a exible ab-

straction for a variety of data, such as network trac logs [

], text

corpora [1], and knowledge graphs [3].

1.1 Motivation

Why is a low-rank approximation of a sparse tensor useful? We can

view the sparse CP decomposition as an extension of well-studied

sparse matrix factorization methods, which can mine patterns from

large datasets [

]. Each row of the CP factors is a dense embedding

vector for an index

𝑖𝑗∈𝐼𝑗

≤𝑗≤𝑁

. Because each embedding is

a small dense vector while the input tensor is sparse, sparse tensor

CP decomposition may incur high relative error with respect to

the input and rarely captures the tensor sparsity structure exactly.

Nevertheless, the learned embeddings contain valuable information.

CP factor matrices have been successfully used to identify patterns

in social networks [

], detect anomalies in packet traces [

], and

monitor trends in internal network trac [

]. As we discuss below,

a wealth of software packages exist to meet the demand for sparse

tensor decomposition.

One of the most popular methods for computing a sparse CP

decomposition, the Alternating-Least-Squares (ALS) algorithm, in-

volves repeatedly solving large, overdetermined linear least-squares

problems with structured design matrices [

]. High-performance

libraries DFacto[

], SPLATT [

], HyperTensor [

], and BigTensor

[

] distribute these expensive computations to a cluster of proces-

sors that communicate through an interconnect. Separately, several

works use randomized sampling methods to accelerate the least-

squares solves, with prototypes implemented in a shared-memory

setting [

–

]. These randomized algorithms have strong theo-

retical guarantees and oer signicant asymptotic advantages over

non-randomized ALS. Unfortunately, prototypes of these methods

require hours to run [

] and are neither competitive nor scalable

compared to existing libraries with distributed-memory parallelism.

1.2 Our Contributions

We propose the rst distributed-memory parallel formulations of

two randomized algorithms, CP-ARLS-LEV [

] and STS-CP [

with accuracy identical to their shared-memory prototypes. We

then provide implementations of these methods that scale to thou-

sands of CPU cores. We face dual technical challenges to parallel

scaling. First, sparse tensor decomposition generally has lower

arithmetic intensity (FLOPs / data word communicated between

processors) than dense tensor decomposition, since computation

scales linearly with the tensor nonzero count. Some sparse ten-

sors exhibit nonzero fractions as low as 4

−10

(see Table 4),

while the worst-case communication costs for sparse CP decom-

position remain identical to the dense tensor case [

]. Second,

randomized algorithms can save an order of magnitude in compu-

tation over their non-randomized counterparts [

–

], but their

inter-processor communication costs remain unaltered unless care-

fully optimized. Despite these compounding factors that reduce

arithmetic intensity, we achieve both speedup and scaling through

several key innovations, three of which we highlight:

Novel Distributed-Memory Sampling Procedures. Random sample

selection is challenging to implement when the CP factor matrices

and sparse tensor are divided among

𝑃

processors. We introduce

two distinct communication-avoiding algorithms for randomized

sample selection from the Khatri-Rao product. First, we show how

to implement the CP-ARLS-LEV algorithm by computing an inde-

pendent probability distribution on the factor block row owned

by each processor. The resulting distributed algorithm has mini-

mal compute / communication overhead compared to the other

phases of CP decomposition. The second algorithm, STS-CP, re-

quires higher sampling time, but achieves lower error by performing

random walks on a binary tree for each sample. By distributing

leaf nodes uniquely to processors and replicating internal nodes,

we give a sampling algorithm with per-processor communication

bandwidth scaling as 𝑂(log 𝑃/𝑃)(see Table 2).

Communication-Optimized MTTKRP. We show that communication-

optimal schedules for non-randomized ALS may exhibit dispropor-

tionately high communication costs for randomized algorithms.

To combat this, we use an “accumulator-stationary" schedule that

eliminates expensive

Reduce-scatter

collectives, causing all com-

munication costs to scale with the number of random samples taken.

This alternate schedule signicantly reduces communication on

tensors with large dimensions (Figure 8) and empirically improves

the computational load balance (Figure 11).

Local Tensor Storage Format. Existing storage formats developed

for sparse CP decomposition [

] are not optimized for ran-

dom access into the sparse tensor, which our algorithms require.

In response, we use a modied compressed-sparse-column format

to store each matricization of our tensor, allowing ecient selec-

tion of nonzero entries by our random sampling algorithms. We

then transform the selected nonzero entries into compressed sparse

row format, which eliminates shared-memory data races in the

subsequent sparse-dense matrix multiplication. The cost of the

transposition is justied and provides a roughly 1.7x speedup over

using atomics in a hybrid OpenMP / MPI implementation.

Our distributed-memory randomized algorithms have signicant

advantages over existing libraries while preserving the accuracy of

the nal approximation. As Figure 2 shows, our method d-STS-CP

computes a rank 100 decomposition of the Reddit tensor (

∼

Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition SPAA ’24, June 17–21, 2024, Nantes, France

𝑈3

⊙

𝑈1

𝑈⊤

−

mat( T,2)⊤

min

𝑈2

𝐹

𝑈2

mat( T,2)

𝑈3

⊙

𝑈1

𝐺+

MTTKRP

Random Sampling

𝑈2

mat( T,2)𝑆⊤

𝑆(𝑈3⊙𝑈1)

·˜

𝐺+

Sparse-Dense Matrix Multiplication

Figure 3: Top: the linear least-squares problem to optimize

factor matrix

𝑈2

during the ALS algorithm for a 3D tensor

(column dimension of

mat(T ,

)

not to scale). Middle: the

exact solution to the problem using the Matricized Tensor

Times Khatri-Rao Product (MTTKRP). Shaded columns of

mat(T ,

)

and rows of

(𝑈3⊙𝑈1)

are selected by our random

sampling algorithm. Bottom: the downsampled linear least-

squares problem after applying random sampling matrix 𝑆.

billion nonzero entries) with a 11x speedup over SPLATT, a state-

of-the-art distributed-memory CP decomposition package. The

reported speedup was achieved on 512 CPU cores, with a nal t

within 0

8% of non-randomized ALS for the same iteration count.

While the distributed algorithm d-CP-ARLS-LEV achieves a lower

nal accuracy, it makes progress faster than SPLATT and spends

less time on sampling (completing 80 rounds in an average of 81

seconds). We demonstrate that it is well-suited to smaller tensors

and lower target ranks.

Symbol Description

TSparse tensor of dimensions 𝐼1×. . . ×𝐼𝑁

𝑅Target Rank of CP Decomposition

𝑈1, . . . , 𝑈𝑁Dense factor matrices, 𝑈𝑗∈R𝐼𝑗×𝑅

𝜎Vector of scaling factors, 𝜎∈R𝑅

𝐽Sample count for randomized ALS

·Matrix multiplication

⊛Elementwise multiplication

⊗Kronecker product

⊙Khatri-Rao product

𝑃Total processor count

𝑃1, . . . , 𝑃𝑁Dimensions of processor grid, Î𝑖𝑃𝑖=𝑃

𝑈(𝑝𝑗)

𝑖Block row of 𝑈𝑖owned by processor 𝑝𝑗

Table 1: Symbol Denitions

2 NOTATION AND PRELIMINARIES

Table 1 summarizes our notation. We use script characters (e.g.

) to

denote tensors with at least three modes, capital letters for matrices,

and lowercase letters for vectors. Bracketed tuples following any of

these objects, e.g.

𝐴[𝑖, 𝑗]

, represent indexes into each object, and

the symbol “:" in place of any index indicates a slicing of a tensor.

We use

⊙

to denote the Khatri-Rao product, which is a column-wise

Kronecker product of a pair of matrices with the same number of

columns. For

𝐴∈R𝐼×𝑅, 𝐵 ∈R𝐽×𝑅

𝐴⊙𝐵

produces a matrix of

dimensions (𝐼 𝐽 ) × 𝑅such that for 1≤𝑗≤𝑅,

(𝐴⊙𝐵)[:, 𝑗]=𝐴[:, 𝑗]⊗𝐵[:, 𝑗].

Let

be an

𝑁

-dimensional tensor indexed by tuples

(𝑖1, ..., 𝑖𝑁) ∈

[𝐼1]×... ×[𝐼𝑁]

, with

nnz(T )

as the number of nonzero entries.

In this work, sparse tensors are always represented as a collection

(𝑁+

)

-tuples, with the rst

𝑁

elements giving the indices of

a nonzero element and the last element giving the value. We seek

a low-rank approximation of

given by Equation

(1)

, the right-

hand-side of which we abbreviate as

[𝜎;𝑈1, ..., 𝑈𝑁]

. By convention,

each column of

𝑈1, ..., 𝑈𝑁

has unit norm. Our goal is to minimize

the sum of squared dierences between our approximation and the

provided tensor:

argmin𝜎,𝑈1,...,𝑈𝑁∥ [𝜎;𝑈1, ..., 𝑈𝑁]− T ∥2

F.(2)

2.1 Non-Randomized ALS CP Decomposition

Minimizing Equation

(2)

jointly over

𝑈1, ..., 𝑈𝑁

is still a non-convex

problem (the vector

𝜎

can be computed directly from the factor

matrices by renormalizing each column). Alternating least squares

is a popular heuristic algorithm that iteratively drives down the

approximation error. The algorithm begins with a set of random

factor matrices and optimizes the approximation in rounds, each

involving

𝑁

subproblems. The

𝑗

-th subproblem in a round holds

all factor matrices but

𝑈𝑗

constant and solves for a new matrix

𝑈𝑗

minimizing the squared Frobenius norm error [

]. The updated

matrix

𝑈𝑗

is the solution to the overdetermined linear least-squares

problem

𝑈𝑗:=min

𝑋

𝑈≠𝑗·𝑋⊤−mat(T , 𝑗)⊤

𝐹.(3)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Distributed-MemoryRandomizedAlgorithmsforSparseTensorCPDecompositionVivekBharadwajUniversityofCalifornia,BerkeleyBerkeley,CA,USAvivek_bharadwaj@berkeley.eduOsmanAsifMalik∗EncubeTechnologiesStockholm,Swedenosman@getencube.comRileyMurray†SandiaNationalLaboratoriesAlbuquerque,NM,USArjmurr@sandia.govAyd...

展开>> 收起<<

Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition_2.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: