Optimizing Tensor Programs on Flexible Storage

2025-04-26 0 0 930.11KB 15 页 10玖币
侵权投诉
Optimizing Tensor Programs on Flexible Storage
Maximilian Schleich
RelationalAI
USA
Amir Shaikhha
University of Edinburgh
United Kingdom
Dan Suciu
University of Washington
USA
ABSTRACT
Tensor programs often need to process large tensors (vectors, ma-
trices, or higher order tensors) that require a specialized storage
format for their memory layout. Several such layouts have been
proposed in the literature, such as the Coordinate Format, the Com-
pressed Sparse Row format, and many others, that were especially
designed to optimally store tensors with specic sparsity properties.
However, existing tensor processing systems require specialized
extensions in order to take advantage of every new storage format.
In this paper we describe a system that allows users to dene exi-
ble storage formats in a declarative tensor query language, similar
to the language used by the tensor program. The programmer only
needs to write storage mappings, which describe, in a declarative
way, how the tensors are laid out in main memory. Then, we de-
scribe a cost-based optimizer that optimizes the tensor program
for the specic memory layout. We demonstrate empirically sig-
nicant performance improvements compared to state-of-the-art
tensor processing systems.
1 INTRODUCTION
Linear algebra and, more generally, tensor algebra is used in a wide
variety of domains, such as science, engineering, machine learning,
data analysis. Tensors are natural generalizations of vectors and
matrices from 1 and 2 dimensions to arbitrary dimensions, and
highly optimized implementations of tensor algebra operations are
available today in several popular libraries, such as SciPy, PyTorch,
Julia, TensorFlow, or Matlab. While these libraries are highly op-
timized for individual operations, compound operations require
users to create temporary tensors, which often destroys the locality
and may even lead to out of memory errors, when the intermediate
results are too large. Such operations are frequently encountered
in complex tensor programs, or tensor kernels, terms that we will
use interchangeably in this paper.
Several domain specic languages have been proposed for ex-
pressing and optimizing entire tensor programs. Examples include
SystemML [
10
], TVM [
12
], Halide [
39
], Taco [
27
], TASO [
24
]. The
compiler community has addressed one challenge of the optimiza-
tion problem, namely the separation of the algorithm from the
schedule. This idea was introduced by the Halide language, which
was designed for high-performance code generation for image pro-
cessing pipelines [
38
,
39
]. The programmer writes the algorithm in
an imperative, high-level language, and writes separately a schedule,
which species low level optimizations, such as tiling, vectorization,
or loop unrolling. TVM [
12
] extends this principle from image pro-
cessing to tensor processing for general-purpose ML applications.
In this work we are not concerned with schedules, but with a
dierent challenge in tensor processing: optimizing the query plan
based on how the tensors are stored in memory. Storage refers in
this paper to memory layout, and not to persistent representation.
When a tensor is sparse, the programmer has many choices for
representing it in main memory, and the best plan for a tensor
program varies dramatically depending on what storage was chosen,
the statistics of the data (e.g. how sparse or dense), and the particular
tensor program. For example, a vector
𝑎(𝑖)
can be stored as a dense
array, or as a hash table indexed by
𝑖
, or as two parallel arrays storing
the indices
𝑖
and the values
𝑎(𝑖)
. When the tensors are dense, then
the best plan may be to use a linear algebra library [
31
,
48
], while for
sparse tensors a better plan may be to use relational query operators,
e.g. hash joins. Tensors can be easily represented as relations, and
tensor programs can be expressed as SQL queries [
9
], but relational
engines are not designed to support storage formats specically
optimized for sparse tensors (e.g. the CSR format discussed later).
To address the storage problem, Taco [
27
] separates the tensor
storage format from the tensor program. In the storage format the
user can specify separately for each dimension whether it is dense
or sparse, and can also order the dimensions, leading to
𝑑!·
2
𝑑
possible formats for a
𝑑
-dimensional tensor.
1
Given a tensor pro-
gram, the Taco compiler generates code that optimizes the access
to the storage formats. Taco does not perform cost-based optimiza-
tions, which means that the programmer still needs to be aware
of the storage specication. For example, if the vector
𝑎
is sparse
while
𝑏, 𝑐
are dense, then
𝑎(𝑖) ∗ (𝑏(𝑖) + 𝑐(𝑖))
is best rewritten as
𝑎(𝑖) 𝑏(𝑖) + 𝑎(𝑖) 𝑐(𝑖)
, because computing
𝑏(𝑖) + 𝑐(𝑖)
results in a
large, dense vector, while
𝑎(𝑖) 𝑏(𝑖)
is a small, sparse vector, no
larger than
𝑎
, and similarly for
𝑎(𝑖) 𝑐(𝑖)
. However, the task of
rewriting the expression is left to the programmer.
In this paper we propose a rule-base approach to optimizing
tensor programs over exible tensor storage, using a cost-based
optimizer. The main novelty in our approach is that the storage
descriptors themselves are also dened in the same declarative
language as the tensor program. To specify how a tensor is stored,
the user writes a storage mapping from the physical data structures
(arrays and/or hash tables) to the logical tensor. Our system eval-
uates the tensor program by rst composing it with the storage
mappings, then optimizing it using rewrite rules. This improves
in two ways over previous systems. First, the storage formats are
no longer hard coded, but the user is free to dene their own. For
example, users may describe one of the popular storage formats
COO, CSR, etc, or dene a format optimized for upper-triangular
matrices, or for band-matrices, or a space-lling curve, etc. There is
no bound on the number of storage representations, the only limit
is the expressivity of the query language and the power of the opti-
mizer. Second, the optimizer is now able to perform a rich collection
of high-level optimizations, such as factorization, loop fusion, or
join reordering, and optimize the tensor program specically for
the given storage. For example, the optimizer may consider both
expressions
𝑎(𝑖)∗(𝑏(𝑖) +𝑐(𝑖))
and
𝑎(𝑖) ∗𝑏(𝑖) +𝑎(𝑖) ∗𝑐(𝑖)
, and choose
the optimal one based on their physical storage and data statistics.
To the best of our knowledge, our system, called STOREL, is the
1Taco was later extended to support 6 formats per dimension [14].
arXiv:2210.06267v1 [cs.DB] 12 Oct 2022
rst cost-based optimizer for a declarative tensor language. We
show in Sec. 6 that, due to the rewrite rules, STOREL signicantly
outperforms both Taco [
27
] (a tensor algebra) and DuckDB [
37
] (an
optimized relational engine) for several tensor programs, although
their physical execution engines are as good as, or even better than
ours.
Fig. 1 illustrates how STOREL processes the matricized tensor
times Khatri-Rao product, MTTKRP [
27
],
𝐴(𝑖, 𝑗)=Í𝑘,𝑙 𝐵(𝑖, 𝑘, 𝑙 ) ·
𝐶(𝑘, 𝑗) · 𝐷(𝑙, 𝑗)
. Fig. 1 (a) shows the tensor program written in our
declarative tensor language SDQLite (described in Sec. 3), while (b)
shows the Compressed Sparse Row (CSR) memory layout of matrix
C
, which is one of several formats described in [
14
,
27
] (reviewed
in Sec. 2). For each matrix or tensor, the user describes its memory
layout by writing a storage mapping, also in SDQLite; the storage
mapping for the matrix
C
is shown in Fig. 1 (c), and similar storage
mappings need to be dened for
B
and
D
. To execute the program, the
system composes the tensor program with the storage mappings,
then chooses an optimal plan using a cost-based optimizer; the
optimal plan is shown in Fig. 1 (d). While the plan could be further
optimized for some sophisticated schedule (as done by Halide, TVM,
and Taco), we currently do not support schedules and simply run
the optimal plan directly in Julia.
The main challenge in developing the cost-based optimizer is the
right choice of tensor processing language. All query optimizers use
the relational algebra as intermediate language. However, we found
that a calculus, rather than an algebra, is better suited for optimiz-
ing tensors; here calculus refers to a language with explicit use of
variables, while algebra refers to a variable-free language. There are
two reasons for that. First, the physical plan of a tensor program
consists of for-loops with explicit variables. They look like this:
for i=1:m do for j=1:n do ...
instead of this:
𝐴Z(𝐵Z· · · )
,
and optimizing directly expressions with variables simplies the
generation of the physical plan. Second, the intermediate language
for tensor programs needs to support nested collections, which oc-
cur in sparse formats like CSR, CSC, CSF, while standard relational
algebra, as well as some recent extensions to linear algebra [
23
]
support only at collections. Algebras for nested collections exists,
but they tend to be much harder to read than calculus, making it
harder to design and debug optimization rules, e.g. compare the
rules in Table 2 [
55
] to those in Fig. 3 in our paper. For these rea-
sons, we opted for a calculus-based intermediate language. We
are not the rst to use a calculus-based intermediate language for
query optimization: Worst Case Optimal Join algorithms are also
described as nested loops, in eect using a calculus as intermediate
language [18, 35, 51].
We describe in this paper a language, called SDQLite, used both
for writing tensor programs, and for performing optimizations.
SDQLite has a syntax that is reminiscent of Unions of Conjunc-
tive Queries, but where
,,
are replaced by
*, +, sum
, to which
we add
let
-bindings, and nested dictionaries as data model; our
dictionaries are similar to those in SDQL (Semiring-Dictionary
Query Language) [
42
], hence we call our language
𝑆𝐷𝑄𝐿𝑖𝑡𝑒
. Our
language can express tensor programs in a notation close to mathe-
matics, and can express complex storage mappings corresponding
to sophisticated tensor memory layouts, including those described
in [
14
,
27
]. Any
𝑆𝐷𝑄𝐿𝑖𝑡𝑒
query can easily be converted directly to
a physical, nested for-loop plan, because each quantied variable
𝑖, 𝑗, . . .
becomes directly a
for
loop over that variable. However, it
is more dicult to design an optimizer. For example, Selinger’s
dynamic programming algorithm for join re-ordering [
5
,
32
] no
longer applies, because in a calculus there is no explicit binary join.
Instead, our system is entirely rule-based, and the rules must be
designed for a calculus rather than an algebra. We designed a suite
of 44 SDQLite-rewrite rules, and use the equality saturation system
Egg [
54
] as rewrite engine. Egg uses a compact data structure called
an e-graph to represent a large number of equivalent expressions
as a graph. However, like most term rewriting systems, Egg does
not understand variables in rules. For our optimizer, we developed
a variant of the De Bruijn index that removes the need for explicit
variable representation.
One major motivation for our work is that most of existing tensor
and linear algebra systems in the compilers and HPC communities
focus on dense data; in contrast, our focus in this work is on sparse
data. The reason for the traditional focus on dense data is that Linear
Algebra packages were originally developed for use in Physics and
Engineering, where tensors are dense, and they support highly
optimized kernels for specic operations on dense data. Support
in these packages for sparse data is rare.
2
TACO [
27
] was the rst
recognized the need to optimize tensor programs over sparse data;
our work falls into the latter category.
In summary, we make the following contributions in this paper:
We describe the architecture of STOREL, where tensor pro-
grams and tensor storage mappings are dened in a common
language, and optimized jointly (Sec. 3).
We describe a declarative tensor calculus,
𝑆𝐷𝑄𝐿𝑖𝑡𝑒
, for both
tensor programs and storage mappings, and show that it
can express a rich variety of previously proposed storage
formats, and beyond (Sec 4).
We describe a cost-based optimizer for the tensor calculus,
which supports a rich suite of optimizations, such as factor-
ization, loop fusion, and join reordering (Sec. 5).
Finally, we conduct an experimental evaluation showing that
STOREL can signicantly outperform other tensor process-
ing systems, by using a cost-based optimizer to choose the
best plan for the given storage representation (Sec. 6).
2 BACKGROUND
Tensors
Given a number
𝑛
, we denote by
[𝑛)def
={
0
,
1
,
2
, . . . , 𝑛
1
}
.
Let
𝑑
1and let
𝑛1, 𝑛2, . . . , 𝑛𝑑
be natural numbers. A tensor with
𝑑
dimensions, is element
𝑨R[𝑛1)×···×[𝑛𝑑)
. A scalar, a vector, and a
matrix are tensors with 0, 1, and 2 dimensions respectively. Given
𝑑
indices,
𝑖1∈ [𝑛1), . . . , 𝑖𝑑∈ [𝑛𝑑)
, we denote by
𝑨(𝑖1, . . . , 𝑖𝑑)
the
value of the tensor at those positions; we call each 𝑖𝑗adimension.
Tensor formats
We briey review some popular tensor formats
following [
14
,
27
]. A dense representation of a tensor consists of
a memory array with
𝑛1𝑛2· · · 𝑛𝑑
elements. The coordinate format,
COO, stores only the non-zero elements in an array, and their
coordinates in
𝑑
separate arrays. For example, the dense and COO
representations of the vector 𝒗=(9,0,7,5)are:
2Cf. GitHub issues #43497 for TensorFlow, #72065 for PyTorch, #4332 for TVM.
2
CREATE TENSOR AAS
sum(<(i,k,l), B_v> in B, <(k,j), C_v> in C, <(j,l), D_v> in D)
{ (i, j) -> B_v * C_v * D_v }
(a) MTTKRP kernel 𝐴(𝑖, 𝑗)=Í𝑘,𝑙 𝐵(𝑖, 𝑘, 𝑙 ) · 𝐶(𝑘, 𝑗) · 𝐷(𝑙, 𝑗 )in SDQLite.
C:
6098
0000
5007
C_len1: 3
C_pos2: 0 3 3 5
C_idx2: 023 03
C_val: 698 57
(b) Matrix 𝐶and its CSR format from [27, Fig.5(f)].
CREATE TENSOR CAS sum(<row,_> in 0:C_len1)
{ row ->
sum(<off,col> in C_idx2( C_pos2(row):C_pos2(row+1) ))
{ col -> C_val(off) }
}
(c) The SDQLite storage mapping for CSR.
sum(<i_pos, i> in B_idx1)
{ i ->
sum(<k_pos, k> in
B_idx2( B_pos2(i_pos):B_pos2(i_pos+1) ))
sum(<j_pos, j> in
C_idx( C_pos(k):C_pos(k+1) ))
{
j ->
C_val(j_pos) * (
merge(<l_posB,l_posD,l> in
<B_idx3( B_pos3(k_pos):B_pos3(k_pos+1) ),
D_idx( D_pos(j):D_pos(j+1) )>)
B_val(l_posB) * D_val(l_posD)
)
}
}
(d) Optimized MTTKRP in SDQLite.
Figure 1: Illustration of STOREL. (a) MTTKRP kernel, (b) CSR memory layout, (c) CSR storage mapping, (d) optimized plan.
DENSE: COO:
v_len: 4
v_val: 9075
v_pos: 0 3
v_idx: 023
v_val: 975
To access
𝑣(𝑖)
using the COO representation one has to rst nd
𝑖
in v_idx, in other words one has to nd 𝑝such that v_idx(𝑝)=𝑖,
then return
v_val(𝑝)
; the role of
v_pos
will become clear shortly.
The COO representation of a matrix has two index arrays,
v_idx1
,
v_idx2
, storing the rows and columns of the non-zero element
respectively. The COO representation is compact, but no longer
enables constant-time lookup. A hash-map representation of the
matrix is a hash-map where the keys are pairs
(𝑖, 𝑗)
. It is compact
and allows access in time
𝑂(
1
)
, but no longer supports a scan in
either row-major or column-major order.
The Taco system [
27
] describes a general scheme for storage
formats where the user can choose an order of the
𝑑
dimensions,
and specify, independently for each dimension, whether it is dense
or sparse. This allows for
𝑑!·
2
𝑑
formats. The storage uses segmented
arrays, which consist of the concatenation of several sub-arrays
stored in a single array, with their starting positions stored in a
separate array. For example, the sparse-sparse representation of the
matrix 𝑪in Fig. 1 (b) is the following (taken from [27, Fig.5(g)]):
C_pos1: 0 2
C_idx1: 0 2
C_pos2: 0 3 5
C_idx2: 023 03
C_val: 698 57
The arrays
C_idx2
and
C_val
contain two segments each: the rst
segment represents row 0of the matrix
𝑪
,
(
6
,
0
,
9
,
8
)
; the second
segment represents row 2,
(
5
,
0
,
0
,
7
)
. The segments are delimited
by
C_pos2
, which indicates their starting point. The row number
of each segment is stored in
C_idx1
: only the values
𝑖=
0and
𝑖=
2
occur here because row 1 is empty. Alternatively, the dense-sparse
representation, shown in Fig. 1 (b) stores every row, including row
1, and for that reason there is no need to store the vector
C_idx1
(since this vector would be
(
0
,
1
,
2
)
), but we only store its length,
C_len1 =
3. The dense-sparse representation is called compressed
sparse row, or CSR, and the sparse-sparse representation is called
doubly CSR, or DCSR. In a later reference [
14
] the authors extended
the number of choices available at each dimension from 2 to 6.
Tensors as Relations
Any
𝑑
-dimensional tensor can be repre-
sented as a relation with
𝑑+
1attributes. For example, a matrix
𝐴
can be represented as a relation
𝑅(𝑖, 𝑗, 𝑣)
, where
𝑖, 𝑗
is the primary
key, and
𝑣
the value of
𝐴(𝑖, 𝑗)
. A clustered index on
(𝑖, 𝑗, 𝑣)
corre-
sponds roughly to a row-major ordering of the matrix; a hash-index
corresponds to a hash-map representation; while a column-oriented
storage [
2
] corresponds to a COO representation. However, since
relations are unordered, it is not possible to use some of the other
formats, like CSR or CSC.
Semiring Dictionary
Asemiring is a quintuple
(𝑆, +,,
0
,
1
)
,
where
𝑆
is a set, the operations
+,
are associative with identities
0 and 1 respectively,
+
is commutative,
distributes over
+
, and
0
𝑥=𝑥
0
=
0. For example, the real numbers form a semi-
ring,
(R,+,,
0
,
1
)
. A semiring dictionary, or simply a dictionary,
is a mapping
𝐾𝑆
, from a nite set of keys
𝐾
to values in
some semiring
𝑆
[
42
]. If
𝑘1, . . . , 𝑘𝑚
are distinct keys, then
{𝑘1
𝑣1, . . . , 𝑘𝑚𝑣𝑚}
denotes the dictionary that maps each key
𝑘𝑖
to the value
𝑣𝑖
, and maps each key
𝑘{𝑘1, . . . , 𝑘𝑚}
to 0. In other
words, missing keys default to 0; it follows that
{𝑘1
0
, 𝑘2
0
, . . .}={ }
, in other words a dictionary containing only 0 values is
the same as empty dictionary. In this paper the key space is always
of the form
𝐾def
=[𝑛1) × · · · × [𝑛𝑑)
, and we view interchangeably a
tensor
𝑨R𝐾
as a dictionary
𝑨
:
𝐾R
. When
𝑑=
0, then the
dictionary is of the form
{() 𝑣}
, which we identify, with some
abuse, with the scalar value 𝑣.
It was observed in [
42
] that semiring dictionaries generalize K-
relations [20]; the set of semiring dictionaries over a xed key-set
𝐾
forms another semiring, where the plus and multiplication are
dened element-wise. For example, if
𝑨,𝑩
are two
𝑚×𝑛
matrices,
then they both can be viewed as dictionaries
[𝑚) × [𝑛) R
, and
𝑨+𝑩
,
𝑨𝑩
denote their element-wise sum and product respectively.
One consequence is that we obtain the following rule:
{𝑘1𝑣1}+{𝑘2𝑣2}=({𝑘𝑣1+𝑣2}if 𝑘1=𝑘2=𝑘
{𝑘1𝑣1, 𝑘2𝑣2}if 𝑘1𝑘2
3
摘要:

OptimizingTensorProgramsonFlexibleStorageMaximilianSchleichRelationalAIUSAAmirShaikhhaUniversityofEdinburghUnitedKingdomDanSuciuUniversityofWashingtonUSAABSTRACTTensorprogramsoftenneedtoprocesslargetensors(vectors,ma-trices,orhigherordertensors)thatrequireaspecializedstorageformatfortheirmemorylayou...

展开>> 收起<<
Optimizing Tensor Programs on Flexible Storage.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:930.11KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注