Optimizing Tensor Programs on Flexible Storage

2025-04-26 0 0 930.11KB 15 页 10玖币

侵权投诉

Maximilian Schleich

RelationalAI

USA

Amir Shaikhha

University of Edinburgh

United Kingdom

Dan Suciu

University of Washington

USA

ABSTRACT

Tensor programs often need to process large tensors (vectors, ma-

trices, or higher order tensors) that require a specialized storage

format for their memory layout. Several such layouts have been

proposed in the literature, such as the Coordinate Format, the Com-

pressed Sparse Row format, and many others, that were especially

designed to optimally store tensors with specic sparsity properties.

However, existing tensor processing systems require specialized

extensions in order to take advantage of every new storage format.

In this paper we describe a system that allows users to dene exi-

ble storage formats in a declarative tensor query language, similar

to the language used by the tensor program. The programmer only

needs to write storage mappings, which describe, in a declarative

way, how the tensors are laid out in main memory. Then, we de-

scribe a cost-based optimizer that optimizes the tensor program

for the specic memory layout. We demonstrate empirically sig-

nicant performance improvements compared to state-of-the-art

tensor processing systems.

1 INTRODUCTION

Linear algebra and, more generally, tensor algebra is used in a wide

variety of domains, such as science, engineering, machine learning,

data analysis. Tensors are natural generalizations of vectors and

matrices from 1 and 2 dimensions to arbitrary dimensions, and

highly optimized implementations of tensor algebra operations are

available today in several popular libraries, such as SciPy, PyTorch,

Julia, TensorFlow, or Matlab. While these libraries are highly op-

timized for individual operations, compound operations require

users to create temporary tensors, which often destroys the locality

and may even lead to out of memory errors, when the intermediate

results are too large. Such operations are frequently encountered

in complex tensor programs, or tensor kernels, terms that we will

use interchangeably in this paper.

Several domain specic languages have been proposed for ex-

pressing and optimizing entire tensor programs. Examples include

SystemML [

], TVM [

], Halide [

], Taco [

], TASO [

]. The

compiler community has addressed one challenge of the optimiza-

tion problem, namely the separation of the algorithm from the

schedule. This idea was introduced by the Halide language, which

was designed for high-performance code generation for image pro-

cessing pipelines [

]. The programmer writes the algorithm in

an imperative, high-level language, and writes separately a schedule,

which species low level optimizations, such as tiling, vectorization,

or loop unrolling. TVM [

] extends this principle from image pro-

cessing to tensor processing for general-purpose ML applications.

In this work we are not concerned with schedules, but with a

dierent challenge in tensor processing: optimizing the query plan

based on how the tensors are stored in memory. Storage refers in

this paper to memory layout, and not to persistent representation.

When a tensor is sparse, the programmer has many choices for

representing it in main memory, and the best plan for a tensor

program varies dramatically depending on what storage was chosen,

the statistics of the data (e.g. how sparse or dense), and the particular

tensor program. For example, a vector

𝑎(𝑖)

can be stored as a dense

array, or as a hash table indexed by

𝑖

, or as two parallel arrays storing

the indices

𝑖

and the values

𝑎(𝑖)

. When the tensors are dense, then

the best plan may be to use a linear algebra library [

], while for

sparse tensors a better plan may be to use relational query operators,

e.g. hash joins. Tensors can be easily represented as relations, and

tensor programs can be expressed as SQL queries [

], but relational

engines are not designed to support storage formats specically

optimized for sparse tensors (e.g. the CSR format discussed later).

To address the storage problem, Taco [

] separates the tensor

storage format from the tensor program. In the storage format the

user can specify separately for each dimension whether it is dense

or sparse, and can also order the dimensions, leading to

𝑑!·

𝑑

possible formats for a

𝑑

-dimensional tensor.

Given a tensor pro-

gram, the Taco compiler generates code that optimizes the access

to the storage formats. Taco does not perform cost-based optimiza-

tions, which means that the programmer still needs to be aware

of the storage specication. For example, if the vector

𝑎

is sparse

while

𝑏, 𝑐

are dense, then

𝑎(𝑖) ∗ (𝑏(𝑖) + 𝑐(𝑖))

is best rewritten as

𝑎(𝑖) ∗ 𝑏(𝑖) + 𝑎(𝑖) ∗ 𝑐(𝑖)

, because computing

𝑏(𝑖) + 𝑐(𝑖)

results in a

large, dense vector, while

𝑎(𝑖) ∗ 𝑏(𝑖)

is a small, sparse vector, no

larger than

𝑎

, and similarly for

𝑎(𝑖) ∗ 𝑐(𝑖)

. However, the task of

rewriting the expression is left to the programmer.

In this paper we propose a rule-base approach to optimizing

tensor programs over exible tensor storage, using a cost-based

optimizer. The main novelty in our approach is that the storage

descriptors themselves are also dened in the same declarative

language as the tensor program. To specify how a tensor is stored,

the user writes a storage mapping from the physical data structures

(arrays and/or hash tables) to the logical tensor. Our system eval-

uates the tensor program by rst composing it with the storage

mappings, then optimizing it using rewrite rules. This improves

in two ways over previous systems. First, the storage formats are

no longer hard coded, but the user is free to dene their own. For

example, users may describe one of the popular storage formats

COO, CSR, etc, or dene a format optimized for upper-triangular

matrices, or for band-matrices, or a space-lling curve, etc. There is

no bound on the number of storage representations, the only limit

is the expressivity of the query language and the power of the opti-

mizer. Second, the optimizer is now able to perform a rich collection

of high-level optimizations, such as factorization, loop fusion, or

join reordering, and optimize the tensor program specically for

the given storage. For example, the optimizer may consider both

expressions

𝑎(𝑖)∗(𝑏(𝑖) +𝑐(𝑖))

and

𝑎(𝑖) ∗𝑏(𝑖) +𝑎(𝑖) ∗𝑐(𝑖)

, and choose

the optimal one based on their physical storage and data statistics.

To the best of our knowledge, our system, called STOREL, is the

1Taco was later extended to support 6 formats per dimension [14].

arXiv:2210.06267v1 [cs.DB] 12 Oct 2022

rst cost-based optimizer for a declarative tensor language. We

show in Sec. 6 that, due to the rewrite rules, STOREL signicantly

outperforms both Taco [

] (a tensor algebra) and DuckDB [

] (an

optimized relational engine) for several tensor programs, although

their physical execution engines are as good as, or even better than

ours.

Fig. 1 illustrates how STOREL processes the matricized tensor

times Khatri-Rao product, MTTKRP [

𝐴(𝑖, 𝑗)=Í𝑘,𝑙 𝐵(𝑖, 𝑘, 𝑙 ) ·

𝐶(𝑘, 𝑗) · 𝐷(𝑙, 𝑗)

. Fig. 1 (a) shows the tensor program written in our

declarative tensor language SDQLite (described in Sec. 3), while (b)

shows the Compressed Sparse Row (CSR) memory layout of matrix

, which is one of several formats described in [

] (reviewed

in Sec. 2). For each matrix or tensor, the user describes its memory

layout by writing a storage mapping, also in SDQLite; the storage

mapping for the matrix

is shown in Fig. 1 (c), and similar storage

mappings need to be dened for

and

. To execute the program, the

system composes the tensor program with the storage mappings,

then chooses an optimal plan using a cost-based optimizer; the

optimal plan is shown in Fig. 1 (d). While the plan could be further

optimized for some sophisticated schedule (as done by Halide, TVM,

and Taco), we currently do not support schedules and simply run

the optimal plan directly in Julia.

The main challenge in developing the cost-based optimizer is the

right choice of tensor processing language. All query optimizers use

the relational algebra as intermediate language. However, we found

that a calculus, rather than an algebra, is better suited for optimiz-

ing tensors; here calculus refers to a language with explicit use of

variables, while algebra refers to a variable-free language. There are

two reasons for that. First, the physical plan of a tensor program

consists of for-loops with explicit variables. They look like this:

for i=1:m do for j=1:n do ...

instead of this:

𝐴Z(𝐵Z· · · )

and optimizing directly expressions with variables simplies the

generation of the physical plan. Second, the intermediate language

for tensor programs needs to support nested collections, which oc-

cur in sparse formats like CSR, CSC, CSF, while standard relational

algebra, as well as some recent extensions to linear algebra [

]

support only at collections. Algebras for nested collections exists,

but they tend to be much harder to read than calculus, making it

harder to design and debug optimization rules, e.g. compare the

rules in Table 2 [

] to those in Fig. 3 in our paper. For these rea-

sons, we opted for a calculus-based intermediate language. We

are not the rst to use a calculus-based intermediate language for

query optimization: Worst Case Optimal Join algorithms are also

described as nested loops, in eect using a calculus as intermediate

language [18, 35, 51].

We describe in this paper a language, called SDQLite, used both

for writing tensor programs, and for performing optimizations.

SDQLite has a syntax that is reminiscent of Unions of Conjunc-

tive Queries, but where

∧,∨,∃

are replaced by

*, +, sum

, to which

we add

let

-bindings, and nested dictionaries as data model; our

dictionaries are similar to those in SDQL (Semiring-Dictionary

Query Language) [

], hence we call our language

𝑆𝐷𝑄𝐿𝑖𝑡𝑒

. Our

language can express tensor programs in a notation close to mathe-

matics, and can express complex storage mappings corresponding

to sophisticated tensor memory layouts, including those described

in [

]. Any

𝑆𝐷𝑄𝐿𝑖𝑡𝑒

query can easily be converted directly to

a physical, nested for-loop plan, because each quantied variable

𝑖, 𝑗, . . .

becomes directly a

for

loop over that variable. However, it

is more dicult to design an optimizer. For example, Selinger’s

dynamic programming algorithm for join re-ordering [

] no

longer applies, because in a calculus there is no explicit binary join.

Instead, our system is entirely rule-based, and the rules must be

designed for a calculus rather than an algebra. We designed a suite

of 44 SDQLite-rewrite rules, and use the equality saturation system

Egg [

] as rewrite engine. Egg uses a compact data structure called

an e-graph to represent a large number of equivalent expressions

as a graph. However, like most term rewriting systems, Egg does

not understand variables in rules. For our optimizer, we developed

a variant of the De Bruijn index that removes the need for explicit

variable representation.

One major motivation for our work is that most of existing tensor

and linear algebra systems in the compilers and HPC communities

focus on dense data; in contrast, our focus in this work is on sparse

data. The reason for the traditional focus on dense data is that Linear

Algebra packages were originally developed for use in Physics and

Engineering, where tensors are dense, and they support highly

optimized kernels for specic operations on dense data. Support

in these packages for sparse data is rare.

TACO [

] was the rst

recognized the need to optimize tensor programs over sparse data;

our work falls into the latter category.

In summary, we make the following contributions in this paper:

•

We describe the architecture of STOREL, where tensor pro-

grams and tensor storage mappings are dened in a common

language, and optimized jointly (Sec. 3).

•

We describe a declarative tensor calculus,

𝑆𝐷𝑄𝐿𝑖𝑡𝑒

, for both

tensor programs and storage mappings, and show that it

can express a rich variety of previously proposed storage

formats, and beyond (Sec 4).

•

We describe a cost-based optimizer for the tensor calculus,

which supports a rich suite of optimizations, such as factor-

ization, loop fusion, and join reordering (Sec. 5).

•

Finally, we conduct an experimental evaluation showing that

STOREL can signicantly outperform other tensor process-

ing systems, by using a cost-based optimizer to choose the

best plan for the given storage representation (Sec. 6).

2 BACKGROUND

Tensors

Given a number

𝑛

, we denote by

[𝑛)def

, . . . , 𝑛 −

}

Let

𝑑≥

1and let

𝑛1, 𝑛2, . . . , 𝑛𝑑

be natural numbers. A tensor with

𝑑

dimensions, is element

𝑨∈R[𝑛1)×···×[𝑛𝑑)

. A scalar, a vector, and a

matrix are tensors with 0, 1, and 2 dimensions respectively. Given

𝑑

indices,

𝑖1∈ [𝑛1), . . . , 𝑖𝑑∈ [𝑛𝑑)

, we denote by

𝑨(𝑖1, . . . , 𝑖𝑑)

the

value of the tensor at those positions; we call each 𝑖𝑗adimension.

Tensor formats

We briey review some popular tensor formats

following [

]. A dense representation of a tensor consists of

a memory array with

𝑛1𝑛2· · · 𝑛𝑑

elements. The coordinate format,

COO, stores only the non-zero elements in an array, and their

coordinates in

𝑑

separate arrays. For example, the dense and COO

representations of the vector 𝒗=(9,0,7,5)are:

2Cf. GitHub issues #43497 for TensorFlow, #72065 for PyTorch, #4332 for TVM.

CREATE TENSOR AAS

sum(<(i,k,l), B_v> in B, <(k,j), C_v> in C, <(j,l), D_v> in D)

{ (i, j) -> B_v * C_v * D_v }

(a) MTTKRP kernel 𝐴(𝑖, 𝑗)=Í𝑘,𝑙 𝐵(𝑖, 𝑘, 𝑙 ) · 𝐶(𝑘, 𝑗) · 𝐷(𝑙, 𝑗 )in SDQLite.

6098

0000

5007

C_len1: 3

C_pos2: 0 3 3 5

C_idx2: 023 03

C_val: 698 57

(b) Matrix 𝐶and its CSR format from [27, Fig.5(f)].

CREATE TENSOR CAS sum(<row,_> in 0:C_len1)

{ row ->

sum(<off,col> in C_idx2( C_pos2(row):C_pos2(row+1) ))

{ col -> C_val(off) }

}

sum(<i_pos, i> in B_idx1)

{ i ->

sum(<k_pos, k> in

B_idx2( B_pos2(i_pos):B_pos2(i_pos+1) ))

sum(<j_pos, j> in

C_idx( C_pos(k):C_pos(k+1) ))

{

j ->

C_val(j_pos) * (

merge(<l_posB,l_posD,l> in

<B_idx3( B_pos3(k_pos):B_pos3(k_pos+1) ),

D_idx( D_pos(j):D_pos(j+1) )>)

B_val(l_posB) * D_val(l_posD)

)

}

(d) Optimized MTTKRP in SDQLite.

Figure 1: Illustration of STOREL. (a) MTTKRP kernel, (b) CSR memory layout, (c) CSR storage mapping, (d) optimized plan.

DENSE: COO:

v_len: 4

v_val: 9075

v_pos: 0 3

v_idx: 023

v_val: 975

To access

𝑣(𝑖)

using the COO representation one has to rst nd

𝑖

in v_idx, in other words one has to nd 𝑝such that v_idx(𝑝)=𝑖,

then return

v_val(𝑝)

; the role of

v_pos

will become clear shortly.

The COO representation of a matrix has two index arrays,

v_idx1

v_idx2

, storing the rows and columns of the non-zero element

respectively. The COO representation is compact, but no longer

enables constant-time lookup. A hash-map representation of the

matrix is a hash-map where the keys are pairs

(𝑖, 𝑗)

. It is compact

and allows access in time

𝑂(

)

, but no longer supports a scan in

either row-major or column-major order.

The Taco system [

] describes a general scheme for storage

formats where the user can choose an order of the

𝑑

dimensions,

and specify, independently for each dimension, whether it is dense

or sparse. This allows for

𝑑!·

𝑑

formats. The storage uses segmented

arrays, which consist of the concatenation of several sub-arrays

stored in a single array, with their starting positions stored in a

separate array. For example, the sparse-sparse representation of the

matrix 𝑪in Fig. 1 (b) is the following (taken from [27, Fig.5(g)]):

C_pos1: 0 2

C_idx1: 0 2

C_pos2: 0 3 5

C_idx2: 023 03

C_val: 698 57

The arrays

C_idx2

and

C_val

contain two segments each: the rst

segment represents row 0of the matrix

𝑪

(

)

; the second

segment represents row 2,

(

)

. The segments are delimited

C_pos2

, which indicates their starting point. The row number

of each segment is stored in

C_idx1

: only the values

𝑖=

0and

𝑖=

occur here because row 1 is empty. Alternatively, the dense-sparse

representation, shown in Fig. 1 (b) stores every row, including row

1, and for that reason there is no need to store the vector

C_idx1

(since this vector would be

(

)

), but we only store its length,

C_len1 =

3. The dense-sparse representation is called compressed

sparse row, or CSR, and the sparse-sparse representation is called

doubly CSR, or DCSR. In a later reference [

] the authors extended

the number of choices available at each dimension from 2 to 6.

Tensors as Relations

Any

𝑑

-dimensional tensor can be repre-

sented as a relation with

𝑑+

1attributes. For example, a matrix

𝐴

can be represented as a relation

𝑅(𝑖, 𝑗, 𝑣)

, where

𝑖, 𝑗

is the primary

key, and

𝑣

the value of

𝐴(𝑖, 𝑗)

. A clustered index on

(𝑖, 𝑗, 𝑣)

corre-

sponds roughly to a row-major ordering of the matrix; a hash-index

corresponds to a hash-map representation; while a column-oriented

storage [

] corresponds to a COO representation. However, since

relations are unordered, it is not possible to use some of the other

formats, like CSR or CSC.

Semiring Dictionary

Asemiring is a quintuple

(𝑆, +,∗,

)

where

𝑆

is a set, the operations

+,∗

are associative with identities

0 and 1 respectively,

is commutative,

∗

distributes over

, and

∗𝑥=𝑥∗

0. For example, the real numbers form a semi-

ring,

(R,+,∗,

)

. A semiring dictionary, or simply a dictionary,

is a mapping

𝐾→𝑆

, from a nite set of keys

𝐾

to values in

some semiring

𝑆

[

]. If

𝑘1, . . . , 𝑘𝑚

are distinct keys, then

{𝑘1→

𝑣1, . . . , 𝑘𝑚→𝑣𝑚}

denotes the dictionary that maps each key

𝑘𝑖

to the value

𝑣𝑖

, and maps each key

𝑘∉{𝑘1, . . . , 𝑘𝑚}

to 0. In other

words, missing keys default to 0; it follows that

{𝑘1→

, 𝑘2→

, . . .}={ }

, in other words a dictionary containing only 0 values is

the same as empty dictionary. In this paper the key space is always

of the form

𝐾def

=[𝑛1) × · · · × [𝑛𝑑)

, and we view interchangeably a

tensor

𝑨∈R𝐾

as a dictionary

𝑨

𝐾→R

. When

𝑑=

0, then the

dictionary is of the form

{() → 𝑣}

, which we identify, with some

abuse, with the scalar value 𝑣.

It was observed in [

] that semiring dictionaries generalize K-

relations [20]; the set of semiring dictionaries over a xed key-set

𝐾

forms another semiring, where the plus and multiplication are

dened element-wise. For example, if

𝑨,𝑩

are two

𝑚×𝑛

matrices,

then they both can be viewed as dictionaries

[𝑚) × [𝑛) → R

, and

𝑨+𝑩

𝑨∗𝑩

denote their element-wise sum and product respectively.

One consequence is that we obtain the following rule:

{𝑘1→𝑣1}+{𝑘2→𝑣2}=({𝑘→𝑣1+𝑣2}if 𝑘1=𝑘2=𝑘

{𝑘1→𝑣1, 𝑘2→𝑣2}if 𝑘1≠𝑘2

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OptimizingTensorProgramsonFlexibleStorageMaximilianSchleichRelationalAIUSAAmirShaikhhaUniversityofEdinburghUnitedKingdomDanSuciuUniversityofWashingtonUSAABSTRACTTensorprogramsoftenneedtoprocesslargetensors(vectors,ma-trices,orhigherordertensors)thatrequireaspecializedstorageformatfortheirmemorylayou...

展开>> 收起<<

Optimizing Tensor Programs on Flexible Storage.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Optimizing Tensor Programs on Flexible Storage

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: