
rst cost-based optimizer for a declarative tensor language. We
show in Sec. 6 that, due to the rewrite rules, STOREL signicantly
outperforms both Taco [
27
] (a tensor algebra) and DuckDB [
37
] (an
optimized relational engine) for several tensor programs, although
their physical execution engines are as good as, or even better than
ours.
Fig. 1 illustrates how STOREL processes the matricized tensor
times Khatri-Rao product, MTTKRP [
27
],
𝐴(𝑖, 𝑗)=Í𝑘,𝑙 𝐵(𝑖, 𝑘, 𝑙 ) ·
𝐶(𝑘, 𝑗) · 𝐷(𝑙, 𝑗)
. Fig. 1 (a) shows the tensor program written in our
declarative tensor language SDQLite (described in Sec. 3), while (b)
shows the Compressed Sparse Row (CSR) memory layout of matrix
C
, which is one of several formats described in [
14
,
27
] (reviewed
in Sec. 2). For each matrix or tensor, the user describes its memory
layout by writing a storage mapping, also in SDQLite; the storage
mapping for the matrix
C
is shown in Fig. 1 (c), and similar storage
mappings need to be dened for
B
and
D
. To execute the program, the
system composes the tensor program with the storage mappings,
then chooses an optimal plan using a cost-based optimizer; the
optimal plan is shown in Fig. 1 (d). While the plan could be further
optimized for some sophisticated schedule (as done by Halide, TVM,
and Taco), we currently do not support schedules and simply run
the optimal plan directly in Julia.
The main challenge in developing the cost-based optimizer is the
right choice of tensor processing language. All query optimizers use
the relational algebra as intermediate language. However, we found
that a calculus, rather than an algebra, is better suited for optimiz-
ing tensors; here calculus refers to a language with explicit use of
variables, while algebra refers to a variable-free language. There are
two reasons for that. First, the physical plan of a tensor program
consists of for-loops with explicit variables. They look like this:
for i=1:m do for j=1:n do ...
instead of this:
𝐴Z(𝐵Z· · · )
,
and optimizing directly expressions with variables simplies the
generation of the physical plan. Second, the intermediate language
for tensor programs needs to support nested collections, which oc-
cur in sparse formats like CSR, CSC, CSF, while standard relational
algebra, as well as some recent extensions to linear algebra [
23
]
support only at collections. Algebras for nested collections exists,
but they tend to be much harder to read than calculus, making it
harder to design and debug optimization rules, e.g. compare the
rules in Table 2 [
55
] to those in Fig. 3 in our paper. For these rea-
sons, we opted for a calculus-based intermediate language. We
are not the rst to use a calculus-based intermediate language for
query optimization: Worst Case Optimal Join algorithms are also
described as nested loops, in eect using a calculus as intermediate
language [18, 35, 51].
We describe in this paper a language, called SDQLite, used both
for writing tensor programs, and for performing optimizations.
SDQLite has a syntax that is reminiscent of Unions of Conjunc-
tive Queries, but where
∧,∨,∃
are replaced by
*, +, sum
, to which
we add
let
-bindings, and nested dictionaries as data model; our
dictionaries are similar to those in SDQL (Semiring-Dictionary
Query Language) [
42
], hence we call our language
𝑆𝐷𝑄𝐿𝑖𝑡𝑒
. Our
language can express tensor programs in a notation close to mathe-
matics, and can express complex storage mappings corresponding
to sophisticated tensor memory layouts, including those described
in [
14
,
27
]. Any
𝑆𝐷𝑄𝐿𝑖𝑡𝑒
query can easily be converted directly to
a physical, nested for-loop plan, because each quantied variable
𝑖, 𝑗, . . .
becomes directly a
for
loop over that variable. However, it
is more dicult to design an optimizer. For example, Selinger’s
dynamic programming algorithm for join re-ordering [
5
,
32
] no
longer applies, because in a calculus there is no explicit binary join.
Instead, our system is entirely rule-based, and the rules must be
designed for a calculus rather than an algebra. We designed a suite
of 44 SDQLite-rewrite rules, and use the equality saturation system
Egg [
54
] as rewrite engine. Egg uses a compact data structure called
an e-graph to represent a large number of equivalent expressions
as a graph. However, like most term rewriting systems, Egg does
not understand variables in rules. For our optimizer, we developed
a variant of the De Bruijn index that removes the need for explicit
variable representation.
One major motivation for our work is that most of existing tensor
and linear algebra systems in the compilers and HPC communities
focus on dense data; in contrast, our focus in this work is on sparse
data. The reason for the traditional focus on dense data is that Linear
Algebra packages were originally developed for use in Physics and
Engineering, where tensors are dense, and they support highly
optimized kernels for specic operations on dense data. Support
in these packages for sparse data is rare.
2
TACO [
27
] was the rst
recognized the need to optimize tensor programs over sparse data;
our work falls into the latter category.
In summary, we make the following contributions in this paper:
•
We describe the architecture of STOREL, where tensor pro-
grams and tensor storage mappings are dened in a common
language, and optimized jointly (Sec. 3).
•
We describe a declarative tensor calculus,
𝑆𝐷𝑄𝐿𝑖𝑡𝑒
, for both
tensor programs and storage mappings, and show that it
can express a rich variety of previously proposed storage
formats, and beyond (Sec 4).
•
We describe a cost-based optimizer for the tensor calculus,
which supports a rich suite of optimizations, such as factor-
ization, loop fusion, and join reordering (Sec. 5).
•
Finally, we conduct an experimental evaluation showing that
STOREL can signicantly outperform other tensor process-
ing systems, by using a cost-based optimizer to choose the
best plan for the given storage representation (Sec. 6).
2 BACKGROUND
Tensors
Given a number
𝑛
, we denote by
[𝑛)def
={
0
,
1
,
2
, . . . , 𝑛 −
1
}
.
Let
𝑑≥
1and let
𝑛1, 𝑛2, . . . , 𝑛𝑑
be natural numbers. A tensor with
𝑑
dimensions, is element
𝑨∈R[𝑛1)×···×[𝑛𝑑)
. A scalar, a vector, and a
matrix are tensors with 0, 1, and 2 dimensions respectively. Given
𝑑
indices,
𝑖1∈ [𝑛1), . . . , 𝑖𝑑∈ [𝑛𝑑)
, we denote by
𝑨(𝑖1, . . . , 𝑖𝑑)
the
value of the tensor at those positions; we call each 𝑖𝑗adimension.
Tensor formats
We briey review some popular tensor formats
following [
14
,
27
]. A dense representation of a tensor consists of
a memory array with
𝑛1𝑛2· · · 𝑛𝑑
elements. The coordinate format,
COO, stores only the non-zero elements in an array, and their
coordinates in
𝑑
separate arrays. For example, the dense and COO
representations of the vector 𝒗=(9,0,7,5)are:
2Cf. GitHub issues #43497 for TensorFlow, #72065 for PyTorch, #4332 for TVM.
2