ALT Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations

2025-04-30 0 0 1.68MB 14 页 10玖币

侵权投诉

ALT: Boosting Deep Learning Performance by Breaking the Wall

between Graph and Operator Level Optimizations

Zhiying Xu

zyxu@smail.nju.edu.cn

Nanjing University

China

Jiafan Xu

mf20330099@smail.nju.edu.cn

Nanjing University

China

Hongding Peng

mg20330044@smail.nju.edu.cn

Nanjing University

China

Wei Wang

ww@nju.edu.cn

Nanjing University

China

Xiaoliang Wang

waxili@nju.edu.cn

Nanjing University

China

Haoran Wan

wanhr@smail.nju.edu.cn

Nanjing University

China

Haipeng Dai

haipengdai@nju.edu.cn

Nanjing University

China

Yixu Xu

xuyixu@huawei.com

Huawei Technologies

China

Hao Cheng

chenghao49@hisilicon.com

Huawei Technologies

China

Kun Wang

kun.wang1981@gmail.com

The Hong Kong Polytechnic

University

China

Guihai Chen

gchen@nju.edu.cn

Nanjing University

China

ABSTRACT

Deep learning models rely on highly optimized tensor libraries for

ecient inference on heterogeneous hardware. Current deep com-

pilers typically predetermine layouts of tensors and then optimize

loops of operators. However, such unidirectional and one-o work-

ow strictly separates graph-level optimization and operator-level

optimization into dierent system layers, missing opportunities for

unied tuning.

This paper proposes ALT, a compiler that performs joint graph-

and operator-level optimizations for deep models. ALT provides

a generic transformation module to manipulate layouts and loops

with easy-to-use primitive functions. ALT further integrates an

auto-tuning module that jointly optimizes graph-level data layouts

and operator-level loops while guaranteeing eciency. Experimen-

tal results show that ALT signicantly outperforms state-of-the-art

compilers (e.g., Ansor) in terms of both single operator performance

(e.g.,1

speedup on average) and end-to-end inference perfor-

mance (e.g.,1.4×speedup on average).

1 INTRODUCTION

Deep learning has become one of the essential building blocks

for emerging applications, such as machine translation and au-

tonomous driving systems. To provide ubiquitous services, develop-

ers craft high-performance programs supporting various tensor op-

erators (e.g., 2-D convolution and matrix multiplication) on dierent

hardware platforms (e.g., NVIDIA GPU and ARM CPU). However,

current vendor libraries (e.g., MKL-DNN [

] and cuDNN [

])

typically demand signicant engineering eort on manual opti-

mization. Moreover, the hand-tuning approach can hardly catch up

with the fast evolution of deep learning techniques that constantly

introduce new tensor operators [

] and new hardware (e.g., neural

processing units). Therefore, researchers develop deep compilers

[

] to achieve automatic performance optimization

by auto-tuning and code generation techniques.

Two key categories of optimizations during compilation are

graph-level optimization and operator-level optimization. Graph-

level optimization represents operators as nodes and tensors as

edges in a computational graph and rewrites nodes or edges to ob-

tain a more ecient graph for inference. For instance, data layout

optimization rewrites the tensor storage format to improve memory

accessing performance [

]. Constant folding

[

] and common subexpression elimination [

] removes

redundant nodes. Operator-level optimization, which mainly in-

volves loop optimization, transforms the nested loops in the source

code of each operator to schedule the execution of instructions

[

]. In this work, we focus on data layout optimization

and loop optimization because they yield signicant performance

improvements and their tuning strongly correlates with operator

and hardware characteristics.

Unfortunately, existing deep compilers (e.g., TVM [

], Tensor

Comprehension [

], Tiramisu [

], AKG [

]) and auto-tuning tech-

niques (e.g., AutoTVM [

], NeoCPU [

], FlexTensor [

] and

Ansor [

]), fail to combine data layout and loop optimizations

eectively. These systems rst predetermine tensor layouts either

manually or via setting a hyper-parameter from a predened tem-

plate and then perform loop optimization based on these layouts.

There are three major limitations in this unidirectional and one-

o workow. First, manual layout selection implies that only a

limited number of layout choices can be explored, hence prone

to be suboptimal. Second, altering the tensor layout demands the

time-consuming re-implementation of operators that access the

arXiv:2210.12415v4 [cs.LG] 29 Oct 2022

, , Zhiying Xu, Jiafan Xu, Hongding Peng, Wei Wang, Xiaoliang Wang, Haoran Wan, Haipeng Dai, Yixu Xu, Hao Cheng, Kun Wang, and Guihai Chen

tensor. Third, layout optimization and loop optimization are sepa-

rated into dierent system layers. Such strict boundary seriously

compromises the performance of the generated tensor programs.

For instance, we observe that optimizing loops based on the best

of three candidate layouts for 2-D convolutional operators can

improve the performance by 55

9% on average on the Intel CPU.

Moreover, the performance of a specic data layout is sensitive to

operator congurations (e.g., tensor shapes) and hardware. Thus

it is hard to determine data layouts for each workload without

feedback from loop optimization.

This paper proposes ALT, a deep compiler that jointly performs

graph-level and operator-level optimizations for deep models. The

design of ALT originates from the following insight. Graph-level

data layout optimization and operator-level loop optimization could

benet from each other. In the meanwhile, the root cause of the

inability to perform cross-layer joint tuning is the coupling between

data storage and operator implementation in prior arts, such that

altering the data layout requires re-implementing operators. Such

high cost for changing layouts further leads to the unidirectional

and one-o optimization ow. Therefore, ALT abstracts layout

manipulation as easy-to-use primitive functions, such that the task

of re-implementing operators can be delegated to a compilation

pass without human interference. After reducing the cost, ALT

further incorporates layout and loop optimizations into a unied

auto-tuning framework, which breaks the boundary between graph-

and operator-level optimizations to open new opportunities.

It is not trivial to achieve our goals. We need to strike the fol-

lowing challenges.

Challenge 1: How can we eliminate the overhead of layout transforma-

tion? We discover two types of potential overhead when altering the

tensor layouts: layout-conversion overhead and fusion-conict over-

head. First, operators along the data stream may require dierent

tensor layouts to reach their optimal performance. To transform the

layouts of tensors produced by other operators at runtime, directly

inserting conversion operators will incur extra overhead on data

movements. Second, altering the output tensor layout of an operator

needs the reconstruction of its loop nest. Such reconstruction may

inhibit the operator from being fused with its consumer, which is an

important loop optimization technique to improve inter-operator

data locality.

Challenge 2: How can we prevent ineciency due to the search space

reconstruction during joint tuning? Changing the output layout of

an operator will induce the loop nest reconstruction, which will

further lead to the variation of the loop tuning space. For joint

tuning, such space variation prohibits a direct iterative exploration.

Otherwise, the points we have searched in the last iteration may

be invalid in the changing space. This leads to inecient tuning

for most search methods, including genetic and learning-based

algorithms, since the accumulated knowledge of the search space

structure cannot be further exploited in the newly reconstructed

space.

Challenge 3: How can we improve eciency given the search space

explosion with the combination of layout and loop tuning? Along

with the joint tuning, the combined search space will be tremen-

dously large, hence inecient to explore directly. For example, in a

typical 2-D convolutional operator, the loop transformation space

can contain up to

𝑂(

points for its seven nested loops. After

combining the layout transformation, the joint search space can be

at a scale of

𝑂(

19)

considering three tensors, each of which fur-

ther involves four dimensions. Moreover, end-to-end optimization

is more challenging due to the inter-dependency of many operators

and tensors.

To eliminate the two types of overhead brought by layout trans-

formation, we propose a layout propagation mechanism. Suppose

we have chosen a dierent layout for the input tensor of an oper-

ator. We let the upstream operator, which is the producer of this

tensor, directly yield elements based on this new layout, hence no

conversion operator is required. To promote operator fusion, we

propagate the new layout downstream along the computational

graph to let the consumer operator trigger the same loop recon-

struction, which helps to align loop nests of multiple operators for

fusion. As such, we can safely transform data layouts with minimal

overhead, and without sabotaging loop optimization.

To alleviate the search space reconstruction issue in the co-

tuning, our solution is two folds. First, we divide the co-tuning into

two stages: joint stage and loop-only stage. The joint stage searches

for optimal tensor layouts, while the loop-only stage only performs

loop tuning with the searched layouts remaining unchanged. Sec-

ond, we propose a cross-exploration architecture for the joint stage,

rather than the direct exploration. For a new feasible layout, we

reconstruct the loop space and then perform multiple rounds of

loop optimization to assess the new layout. This design avoids inef-

cient loop space reconstruction since the loop-only stage keeps

layouts unchanged. It also achieves the expected bidirectional and

unied tuning ow in the joint stage, because each candidate layout

is evaluated based on feedback from loop optimization through our

novel tuning architecture.

To avoid the search space explosion due to the combination of

layout and loop tuning, we prune the space at two levels. First, we

only build layout transformation spaces for tensors accessed by

complex operators. In this work, we take convolutions and general

matrix multiplication as complex operators, the performance of

which are layout sensitive. For other tensors, we further exploit the

layout propagation mechanism to propagate the searched layouts

onto them without more searching. Second, we identify a promising

subspace by tailoring a tuning template for each tensor accessed

by complex operators. These templates are constructed based on

our analysis of layout optimization considering both operator and

hardware characteristics.

By addressing these challenges, ALT achieves joint and ecient

graph-level data layout optimization and operator-level loop opti-

mization automatically.

We comprehensively evaluate ALT on Intel CPU, NVIDIA GPU,

and ARM CPU. Compared with state-of-the-art vendor libraries

(e.g., MKL-DNN [

], cuDNN [

], and XNNPACK [

]) and auto-

tuning frameworks (e.g., Ansor [

]), ALT achieves an average of

speedup in terms of single operator performance, and 1

speedup in terms of end-to-end inference performance. Our evalua-

tion also shows that ALT can nd data layouts that are not explored

in prior arts. Additionally, we have deployed ALT in production

environments for four months, boosting a broad spectrum of real

workloads (e.g., speech recognition and super resolution).

In summary, we make the following contributions:

ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations , ,

1 3 5 7 9 11 13 15 17 19 21 23

Operator Conﬁguration

0.2

1.0

4.0

16.0

64.0

256.0

Latency (ms)

NOHW NHWO HWON

(a) C2D on Intel CPU.

1 3 5 7 9 11 13 15 17 19 21 23 25 27

Operator Conﬁguration

0.1

0.5

2.0

8.0

32.0

Latency (ms)

NOHW NHWO HWON

(b) C2D on NVIDIA GPU.

1234567891011

Operator Conﬁguration

0.5

2.0

8.0

32.0

128.0

Latency (ms)

NOHW NHWO HWON

Figure 1: C2D latency with dierent data layouts on dierent hardware platforms.

•

We reveal the necessity of joint graph- and operator-level opti-

mizations for deep learning compilation, and that the root cause

of the inecient unidirectional and one-o optimization ow in

prior arts lies in the high cost of layout manipulation.

•

We design an easy-to-use generic infrastructure that covers a

rich layout transformation space. It allows users to manipulate

layouts without soiling the hands for re-implementation, and

without extra overhead via the layout propagation mechanism

during end-to-end optimization.

•

We devise a joint layout and loop auto-tuning framework. Via

eective space pruning and judicious exploration design, it not

only achieves a bidirectional and unied optimization ow but

also guarantees tuning eciency.

•

Our extensive evaluation shows that, without human interfer-

ence, ALT improves performance over state-of-the-art baselines

signicantly, which also veries the eectiveness of the proposed

techniques.

2 BACKGROUND AND MOTIVATION

A deep compiler typically compiles a neural network with multi-

stage lowering and optimization. The compiler takes a model that

can be generated by other frameworks (e.g., TensorFlow [

]) as

input. It then resolves the model to a computational graph where

operators and tensors are represented as nodes and edges, respec-

tively.

Data layout optimization [

] is to rewrite

the tensor storage format (i.e., the attributes of an edge) to alleviate

memory accessing overhead for operators that access the tensor.

Thus, data layout optimization is often classied as graph-level

optimization. The storage format refers to the arrangement of tensor

dimensions. Take the 2-D convolution (C2D) operator as an example.

Popular data layouts for the output tensor of C2D include

𝑁𝑂𝐻𝑊

𝑁 𝐻𝑊 𝑂

, and

𝐻𝑊 𝑂𝑁

, where

𝑁 , 𝑂, 𝐻,𝑊

represent the batch size,

the number of output channels, the output tensor height, and the

output tensor width, respectively.

𝑁𝑂𝐻𝑊

is widely used on GPU

[

𝑁 𝐻𝑊 𝑂

is the default layout on CPU in TensorFlow [

], and

𝐻𝑊 𝑂𝑁 is used in digital signal processing.

After graph-level optimization, the compiler will lower each node

in the computational graph to operator-level representation. An

operator can typically be represented as deeply nested loops. As the

major part of operator-level optimization, loop optimization (e.g.,

loop tiling, vectorization, etc.) [

] is to transform the

loop nest to schedule the execution of statements of each operator.

The motivation for this work is as follows.

Observation 1: It is benecial to jointly perform data layout

optimization and loop optimization.

We illustrate the benets

KH - 1

H / 2

W + (KW - 1)

Output tensorInput tensor

tile overlap

H + (KH - 1)

tile stride

Figure 2: Layout with overlapped tiling.

by an experiment that optimizes loops of C2D based on

𝑁𝑂𝐻𝑊

𝑁 𝐻𝑊 𝑂

, and

𝐻𝑊 𝑂𝑁

layouts, respectively. Our platforms include

32-core Intel Xeon Silver 4110 CPU@2.1GHz, NVIDIA RTX 2080Ti

GPU, and Kirin 990 ARM SoC. We report the performance in Fig. 1,

where the latency axis is in log scale, and each hardware involves

multiple operator congurations (dierent number of channels,

convolutional strides, etc.) to cover rich workloads. We observe that

the best layout could improve the performance of loop optimization

by 55.9%, 87.2%, and 48.8% on average on Intel CPU, NVIDIA GPU,

and ARM CPU, respectively. On the converse, making a choice

among dierent layouts is not easy when there is no feedback

from loop optimization, due to the highly divergent performance

with regard to operator congurations and platforms. For example,

although

𝑁 𝐻𝑊 𝑂

often outperforms

𝑁𝑂𝐻𝑊

and

𝐻𝑊 𝑂𝑁

on CPUs,

especially when the number of input channels is small, there is still

no clear rule that can t all congurations.

Observation 2: Existing solutions cannot eectively perform

joint tuning due to the high cost of layout manipulation.

Ex-

isting systems [

] typically couple the tensor storage with

the implementation of operators, thus changing layouts requires re-

implementation. Such a high cost of layout manipulation limits the

number of layout choices that can be explored, and further leads to

the unidirectional optimization ow. While there are works using

special layouts to improve versatility, e.g.,

𝑁𝑂

𝑜𝑡𝐻𝑊 𝑜𝑡

where

𝑜𝑡

is a

tiling parameter that can be changed without re-implementation

[

], they still only cover a small layout optimization space. More-

over, switching to another category of layouts still requires re-

implementing operators and even rewriting loop-tuning templates.

We use a more versatile layout as a motivating example. This

layout is outside the tuning space of

𝑁𝑂

𝑜𝑡𝐻𝑊 𝑜𝑡

and is hard to be

discovered manually or without joint tuning. It can achieve perfor-

mance improvement of 32.4% over

𝑁𝑂

𝑜𝑡𝐻𝑊 𝑜𝑡

. Besides tiling the

channel dimension, this layout further tiles the spatial dimensions

(the height and the width) of the output tensor into four blocks.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ALT:BoostingDeepLearningPerformancebyBreakingtheWallbetweenGraphandOperatorLevelOptimizationsZhiyingXuzyxu@smail.nju.edu.cnNanjingUniversityChinaJiafanXumf20330099@smail.nju.edu.cnNanjingUniversityChinaHongdingPengmg20330044@smail.nju.edu.cnNanjingUniversityChinaWeiWangww@nju.edu.cnNanjingUniversity...

展开>> 收起<<

ALT Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ALT Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: