
SPAA ’24, June 17–21, 2024, Nantes, France Vivek Bharadwaj, Osman Asif Malik, Riley Murray, Aydın Buluç, & James Demmel
Figure 2: Running maximum accuracy over time for SPLATT,
a state-of-the-art distributed CP decomposition software
package, and our randomized algorithms on the Reddit ten-
sor, target rank
𝑅=
100, on 512 CPU cores. Curves are aver-
ages of 5 trials, 80 ALS rounds.
We consider real sparse tensors
T
with
𝑁≥
3, all entries known,
and billions of nonzero entries. Sparse tensors are a exible ab-
straction for a variety of data, such as network trac logs [
2
], text
corpora [1], and knowledge graphs [3].
1.1 Motivation
Why is a low-rank approximation of a sparse tensor useful? We can
view the sparse CP decomposition as an extension of well-studied
sparse matrix factorization methods, which can mine patterns from
large datasets [
4
]. Each row of the CP factors is a dense embedding
vector for an index
𝑖𝑗∈𝐼𝑗
,1
≤𝑗≤𝑁
. Because each embedding is
a small dense vector while the input tensor is sparse, sparse tensor
CP decomposition may incur high relative error with respect to
the input and rarely captures the tensor sparsity structure exactly.
Nevertheless, the learned embeddings contain valuable information.
CP factor matrices have been successfully used to identify patterns
in social networks [
5
,
6
], detect anomalies in packet traces [
2
], and
monitor trends in internal network trac [
7
]. As we discuss below,
a wealth of software packages exist to meet the demand for sparse
tensor decomposition.
One of the most popular methods for computing a sparse CP
decomposition, the Alternating-Least-Squares (ALS) algorithm, in-
volves repeatedly solving large, overdetermined linear least-squares
problems with structured design matrices [
8
]. High-performance
libraries DFacto[
9
], SPLATT [
10
], HyperTensor [
11
], and BigTensor
[
12
] distribute these expensive computations to a cluster of proces-
sors that communicate through an interconnect. Separately, several
works use randomized sampling methods to accelerate the least-
squares solves, with prototypes implemented in a shared-memory
setting [
6
,
13
–
15
]. These randomized algorithms have strong theo-
retical guarantees and oer signicant asymptotic advantages over
non-randomized ALS. Unfortunately, prototypes of these methods
require hours to run [
6
,
15
] and are neither competitive nor scalable
compared to existing libraries with distributed-memory parallelism.
1.2 Our Contributions
We propose the rst distributed-memory parallel formulations of
two randomized algorithms, CP-ARLS-LEV [
6
] and STS-CP [
15
],
with accuracy identical to their shared-memory prototypes. We
then provide implementations of these methods that scale to thou-
sands of CPU cores. We face dual technical challenges to parallel
scaling. First, sparse tensor decomposition generally has lower
arithmetic intensity (FLOPs / data word communicated between
processors) than dense tensor decomposition, since computation
scales linearly with the tensor nonzero count. Some sparse ten-
sors exhibit nonzero fractions as low as 4
×
10
−10
(see Table 4),
while the worst-case communication costs for sparse CP decom-
position remain identical to the dense tensor case [
16
]. Second,
randomized algorithms can save an order of magnitude in compu-
tation over their non-randomized counterparts [
17
–
19
], but their
inter-processor communication costs remain unaltered unless care-
fully optimized. Despite these compounding factors that reduce
arithmetic intensity, we achieve both speedup and scaling through
several key innovations, three of which we highlight:
Novel Distributed-Memory Sampling Procedures. Random sample
selection is challenging to implement when the CP factor matrices
and sparse tensor are divided among
𝑃
processors. We introduce
two distinct communication-avoiding algorithms for randomized
sample selection from the Khatri-Rao product. First, we show how
to implement the CP-ARLS-LEV algorithm by computing an inde-
pendent probability distribution on the factor block row owned
by each processor. The resulting distributed algorithm has mini-
mal compute / communication overhead compared to the other
phases of CP decomposition. The second algorithm, STS-CP, re-
quires higher sampling time, but achieves lower error by performing
random walks on a binary tree for each sample. By distributing
leaf nodes uniquely to processors and replicating internal nodes,
we give a sampling algorithm with per-processor communication
bandwidth scaling as 𝑂(log 𝑃/𝑃)(see Table 2).
Communication-Optimized MTTKRP. We show that communication-
optimal schedules for non-randomized ALS may exhibit dispropor-
tionately high communication costs for randomized algorithms.
To combat this, we use an “accumulator-stationary" schedule that
eliminates expensive
Reduce-scatter
collectives, causing all com-
munication costs to scale with the number of random samples taken.
This alternate schedule signicantly reduces communication on
tensors with large dimensions (Figure 8) and empirically improves
the computational load balance (Figure 11).
Local Tensor Storage Format. Existing storage formats developed
for sparse CP decomposition [
10
,
20
] are not optimized for ran-
dom access into the sparse tensor, which our algorithms require.
In response, we use a modied compressed-sparse-column format
to store each matricization of our tensor, allowing ecient selec-
tion of nonzero entries by our random sampling algorithms. We
then transform the selected nonzero entries into compressed sparse
row format, which eliminates shared-memory data races in the
subsequent sparse-dense matrix multiplication. The cost of the
transposition is justied and provides a roughly 1.7x speedup over
using atomics in a hybrid OpenMP / MPI implementation.
Our distributed-memory randomized algorithms have signicant
advantages over existing libraries while preserving the accuracy of
the nal approximation. As Figure 2 shows, our method d-STS-CP
computes a rank 100 decomposition of the Reddit tensor (
∼
4
.
7