ALT Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations

2025-04-30 0 0 1.68MB 14 页 10玖币
侵权投诉
ALT: Boosting Deep Learning Performance by Breaking the Wall
between Graph and Operator Level Optimizations
Zhiying Xu
zyxu@smail.nju.edu.cn
Nanjing University
China
Jiafan Xu
mf20330099@smail.nju.edu.cn
Nanjing University
China
Hongding Peng
mg20330044@smail.nju.edu.cn
Nanjing University
China
Wei Wang
ww@nju.edu.cn
Nanjing University
China
Xiaoliang Wang
waxili@nju.edu.cn
Nanjing University
China
Haoran Wan
wanhr@smail.nju.edu.cn
Nanjing University
China
Haipeng Dai
haipengdai@nju.edu.cn
Nanjing University
China
Yixu Xu
xuyixu@huawei.com
Huawei Technologies
China
Hao Cheng
chenghao49@hisilicon.com
Huawei Technologies
China
Kun Wang
kun.wang1981@gmail.com
The Hong Kong Polytechnic
University
China
Guihai Chen
gchen@nju.edu.cn
Nanjing University
China
ABSTRACT
Deep learning models rely on highly optimized tensor libraries for
ecient inference on heterogeneous hardware. Current deep com-
pilers typically predetermine layouts of tensors and then optimize
loops of operators. However, such unidirectional and one-o work-
ow strictly separates graph-level optimization and operator-level
optimization into dierent system layers, missing opportunities for
unied tuning.
This paper proposes ALT, a compiler that performs joint graph-
and operator-level optimizations for deep models. ALT provides
a generic transformation module to manipulate layouts and loops
with easy-to-use primitive functions. ALT further integrates an
auto-tuning module that jointly optimizes graph-level data layouts
and operator-level loops while guaranteeing eciency. Experimen-
tal results show that ALT signicantly outperforms state-of-the-art
compilers (e.g., Ansor) in terms of both single operator performance
(e.g.,1
.
5
×
speedup on average) and end-to-end inference perfor-
mance (e.g.,1.4×speedup on average).
1 INTRODUCTION
Deep learning has become one of the essential building blocks
for emerging applications, such as machine translation and au-
tonomous driving systems. To provide ubiquitous services, develop-
ers craft high-performance programs supporting various tensor op-
erators (e.g., 2-D convolution and matrix multiplication) on dierent
hardware platforms (e.g., NVIDIA GPU and ARM CPU). However,
current vendor libraries (e.g., MKL-DNN [
33
] and cuDNN [
12
])
typically demand signicant engineering eort on manual opti-
mization. Moreover, the hand-tuning approach can hardly catch up
with the fast evolution of deep learning techniques that constantly
introduce new tensor operators [
32
] and new hardware (e.g., neural
processing units). Therefore, researchers develop deep compilers
[
6
,
10
,
39
,
70
,
81
] to achieve automatic performance optimization
by auto-tuning and code generation techniques.
Two key categories of optimizations during compilation are
graph-level optimization and operator-level optimization. Graph-
level optimization represents operators as nodes and tensors as
edges in a computational graph and rewrites nodes or edges to ob-
tain a more ecient graph for inference. For instance, data layout
optimization rewrites the tensor storage format to improve memory
accessing performance [
4
,
14
,
36
,
56
,
61
,
63
,
69
]. Constant folding
[
8
,
49
,
57
] and common subexpression elimination [
49
,
57
] removes
redundant nodes. Operator-level optimization, which mainly in-
volves loop optimization, transforms the nested loops in the source
code of each operator to schedule the execution of instructions
[
7
,
10
,
26
,
28
,
55
]. In this work, we focus on data layout optimization
and loop optimization because they yield signicant performance
improvements and their tuning strongly correlates with operator
and hardware characteristics.
Unfortunately, existing deep compilers (e.g., TVM [
10
], Tensor
Comprehension [
70
], Tiramisu [
6
], AKG [
81
]) and auto-tuning tech-
niques (e.g., AutoTVM [
11
], NeoCPU [
44
], FlexTensor [
89
] and
Ansor [
83
]), fail to combine data layout and loop optimizations
eectively. These systems rst predetermine tensor layouts either
manually or via setting a hyper-parameter from a predened tem-
plate and then perform loop optimization based on these layouts.
There are three major limitations in this unidirectional and one-
o workow. First, manual layout selection implies that only a
limited number of layout choices can be explored, hence prone
to be suboptimal. Second, altering the tensor layout demands the
time-consuming re-implementation of operators that access the
arXiv:2210.12415v4 [cs.LG] 29 Oct 2022
, , Zhiying Xu, Jiafan Xu, Hongding Peng, Wei Wang, Xiaoliang Wang, Haoran Wan, Haipeng Dai, Yixu Xu, Hao Cheng, Kun Wang, and Guihai Chen
tensor. Third, layout optimization and loop optimization are sepa-
rated into dierent system layers. Such strict boundary seriously
compromises the performance of the generated tensor programs.
For instance, we observe that optimizing loops based on the best
of three candidate layouts for 2-D convolutional operators can
improve the performance by 55
.
9% on average on the Intel CPU.
Moreover, the performance of a specic data layout is sensitive to
operator congurations (e.g., tensor shapes) and hardware. Thus
it is hard to determine data layouts for each workload without
feedback from loop optimization.
This paper proposes ALT, a deep compiler that jointly performs
graph-level and operator-level optimizations for deep models. The
design of ALT originates from the following insight. Graph-level
data layout optimization and operator-level loop optimization could
benet from each other. In the meanwhile, the root cause of the
inability to perform cross-layer joint tuning is the coupling between
data storage and operator implementation in prior arts, such that
altering the data layout requires re-implementing operators. Such
high cost for changing layouts further leads to the unidirectional
and one-o optimization ow. Therefore, ALT abstracts layout
manipulation as easy-to-use primitive functions, such that the task
of re-implementing operators can be delegated to a compilation
pass without human interference. After reducing the cost, ALT
further incorporates layout and loop optimizations into a unied
auto-tuning framework, which breaks the boundary between graph-
and operator-level optimizations to open new opportunities.
It is not trivial to achieve our goals. We need to strike the fol-
lowing challenges.
Challenge 1: How can we eliminate the overhead of layout transforma-
tion? We discover two types of potential overhead when altering the
tensor layouts: layout-conversion overhead and fusion-conict over-
head. First, operators along the data stream may require dierent
tensor layouts to reach their optimal performance. To transform the
layouts of tensors produced by other operators at runtime, directly
inserting conversion operators will incur extra overhead on data
movements. Second, altering the output tensor layout of an operator
needs the reconstruction of its loop nest. Such reconstruction may
inhibit the operator from being fused with its consumer, which is an
important loop optimization technique to improve inter-operator
data locality.
Challenge 2: How can we prevent ineciency due to the search space
reconstruction during joint tuning? Changing the output layout of
an operator will induce the loop nest reconstruction, which will
further lead to the variation of the loop tuning space. For joint
tuning, such space variation prohibits a direct iterative exploration.
Otherwise, the points we have searched in the last iteration may
be invalid in the changing space. This leads to inecient tuning
for most search methods, including genetic and learning-based
algorithms, since the accumulated knowledge of the search space
structure cannot be further exploited in the newly reconstructed
space.
Challenge 3: How can we improve eciency given the search space
explosion with the combination of layout and loop tuning? Along
with the joint tuning, the combined search space will be tremen-
dously large, hence inecient to explore directly. For example, in a
typical 2-D convolutional operator, the loop transformation space
can contain up to
𝑂(
10
7)
points for its seven nested loops. After
combining the layout transformation, the joint search space can be
at a scale of
𝑂(
10
19)
considering three tensors, each of which fur-
ther involves four dimensions. Moreover, end-to-end optimization
is more challenging due to the inter-dependency of many operators
and tensors.
To eliminate the two types of overhead brought by layout trans-
formation, we propose a layout propagation mechanism. Suppose
we have chosen a dierent layout for the input tensor of an oper-
ator. We let the upstream operator, which is the producer of this
tensor, directly yield elements based on this new layout, hence no
conversion operator is required. To promote operator fusion, we
propagate the new layout downstream along the computational
graph to let the consumer operator trigger the same loop recon-
struction, which helps to align loop nests of multiple operators for
fusion. As such, we can safely transform data layouts with minimal
overhead, and without sabotaging loop optimization.
To alleviate the search space reconstruction issue in the co-
tuning, our solution is two folds. First, we divide the co-tuning into
two stages: joint stage and loop-only stage. The joint stage searches
for optimal tensor layouts, while the loop-only stage only performs
loop tuning with the searched layouts remaining unchanged. Sec-
ond, we propose a cross-exploration architecture for the joint stage,
rather than the direct exploration. For a new feasible layout, we
reconstruct the loop space and then perform multiple rounds of
loop optimization to assess the new layout. This design avoids inef-
cient loop space reconstruction since the loop-only stage keeps
layouts unchanged. It also achieves the expected bidirectional and
unied tuning ow in the joint stage, because each candidate layout
is evaluated based on feedback from loop optimization through our
novel tuning architecture.
To avoid the search space explosion due to the combination of
layout and loop tuning, we prune the space at two levels. First, we
only build layout transformation spaces for tensors accessed by
complex operators. In this work, we take convolutions and general
matrix multiplication as complex operators, the performance of
which are layout sensitive. For other tensors, we further exploit the
layout propagation mechanism to propagate the searched layouts
onto them without more searching. Second, we identify a promising
subspace by tailoring a tuning template for each tensor accessed
by complex operators. These templates are constructed based on
our analysis of layout optimization considering both operator and
hardware characteristics.
By addressing these challenges, ALT achieves joint and ecient
graph-level data layout optimization and operator-level loop opti-
mization automatically.
We comprehensively evaluate ALT on Intel CPU, NVIDIA GPU,
and ARM CPU. Compared with state-of-the-art vendor libraries
(e.g., MKL-DNN [
33
], cuDNN [
12
], and XNNPACK [
27
]) and auto-
tuning frameworks (e.g., Ansor [
83
]), ALT achieves an average of
1
.
5
×
speedup in terms of single operator performance, and 1
.
4
×
speedup in terms of end-to-end inference performance. Our evalua-
tion also shows that ALT can nd data layouts that are not explored
in prior arts. Additionally, we have deployed ALT in production
environments for four months, boosting a broad spectrum of real
workloads (e.g., speech recognition and super resolution).
In summary, we make the following contributions:
2
ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations , ,
1 3 5 7 9 11 13 15 17 19 21 23
Operator Configuration
0.2
1.0
4.0
16.0
64.0
256.0
Latency (ms)
NOHW NHWO HWON
(a) C2D on Intel CPU.
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Operator Configuration
0.1
0.5
2.0
8.0
32.0
Latency (ms)
NOHW NHWO HWON
(b) C2D on NVIDIA GPU.
1234567891011
Operator Configuration
0.5
2.0
8.0
32.0
128.0
Latency (ms)
NOHW NHWO HWON
(c) C2D on ARM CPU.
Figure 1: C2D latency with dierent data layouts on dierent hardware platforms.
We reveal the necessity of joint graph- and operator-level opti-
mizations for deep learning compilation, and that the root cause
of the inecient unidirectional and one-o optimization ow in
prior arts lies in the high cost of layout manipulation.
We design an easy-to-use generic infrastructure that covers a
rich layout transformation space. It allows users to manipulate
layouts without soiling the hands for re-implementation, and
without extra overhead via the layout propagation mechanism
during end-to-end optimization.
We devise a joint layout and loop auto-tuning framework. Via
eective space pruning and judicious exploration design, it not
only achieves a bidirectional and unied optimization ow but
also guarantees tuning eciency.
Our extensive evaluation shows that, without human interfer-
ence, ALT improves performance over state-of-the-art baselines
signicantly, which also veries the eectiveness of the proposed
techniques.
2 BACKGROUND AND MOTIVATION
A deep compiler typically compiles a neural network with multi-
stage lowering and optimization. The compiler takes a model that
can be generated by other frameworks (e.g., TensorFlow [
1
]) as
input. It then resolves the model to a computational graph where
operators and tensors are represented as nodes and edges, respec-
tively.
Data layout optimization [
4
,
14
,
36
,
56
,
61
,
63
,
69
] is to rewrite
the tensor storage format (i.e., the attributes of an edge) to alleviate
memory accessing overhead for operators that access the tensor.
Thus, data layout optimization is often classied as graph-level
optimization. The storage format refers to the arrangement of tensor
dimensions. Take the 2-D convolution (C2D) operator as an example.
Popular data layouts for the output tensor of C2D include
𝑁𝑂𝐻𝑊
,
𝑁 𝐻𝑊 𝑂
, and
𝐻𝑊 𝑂𝑁
, where
𝑁 , 𝑂, 𝐻,𝑊
represent the batch size,
the number of output channels, the output tensor height, and the
output tensor width, respectively.
𝑁𝑂𝐻𝑊
is widely used on GPU
[
53
],
𝑁 𝐻𝑊 𝑂
is the default layout on CPU in TensorFlow [
1
], and
𝐻𝑊 𝑂𝑁 is used in digital signal processing.
After graph-level optimization, the compiler will lower each node
in the computational graph to operator-level representation. An
operator can typically be represented as deeply nested loops. As the
major part of operator-level optimization, loop optimization (e.g.,
loop tiling, vectorization, etc.) [
7
,
10
,
26
,
28
,
55
] is to transform the
loop nest to schedule the execution of statements of each operator.
The motivation for this work is as follows.
Observation 1: It is benecial to jointly perform data layout
optimization and loop optimization.
We illustrate the benets
KH - 1
H
H / 2
W
W + (KW - 1)
Output tensorInput tensor
tile overlap
H + (KH - 1)
tile stride
Figure 2: Layout with overlapped tiling.
by an experiment that optimizes loops of C2D based on
𝑁𝑂𝐻𝑊
,
𝑁 𝐻𝑊 𝑂
, and
𝐻𝑊 𝑂𝑁
layouts, respectively. Our platforms include
32-core Intel Xeon Silver 4110 CPU@2.1GHz, NVIDIA RTX 2080Ti
GPU, and Kirin 990 ARM SoC. We report the performance in Fig. 1,
where the latency axis is in log scale, and each hardware involves
multiple operator congurations (dierent number of channels,
convolutional strides, etc.) to cover rich workloads. We observe that
the best layout could improve the performance of loop optimization
by 55.9%, 87.2%, and 48.8% on average on Intel CPU, NVIDIA GPU,
and ARM CPU, respectively. On the converse, making a choice
among dierent layouts is not easy when there is no feedback
from loop optimization, due to the highly divergent performance
with regard to operator congurations and platforms. For example,
although
𝑁 𝐻𝑊 𝑂
often outperforms
𝑁𝑂𝐻𝑊
and
𝐻𝑊 𝑂𝑁
on CPUs,
especially when the number of input channels is small, there is still
no clear rule that can t all congurations.
Observation 2: Existing solutions cannot eectively perform
joint tuning due to the high cost of layout manipulation.
Ex-
isting systems [
6
,
10
,
70
] typically couple the tensor storage with
the implementation of operators, thus changing layouts requires re-
implementation. Such a high cost of layout manipulation limits the
number of layout choices that can be explored, and further leads to
the unidirectional optimization ow. While there are works using
special layouts to improve versatility, e.g.,
𝑁𝑂
𝑜𝑡𝐻𝑊 𝑜𝑡
where
𝑜𝑡
is a
tiling parameter that can be changed without re-implementation
[
44
], they still only cover a small layout optimization space. More-
over, switching to another category of layouts still requires re-
implementing operators and even rewriting loop-tuning templates.
We use a more versatile layout as a motivating example. This
layout is outside the tuning space of
𝑁𝑂
𝑜𝑡𝐻𝑊 𝑜𝑡
and is hard to be
discovered manually or without joint tuning. It can achieve perfor-
mance improvement of 32.4% over
𝑁𝑂
𝑜𝑡𝐻𝑊 𝑜𝑡
. Besides tiling the
channel dimension, this layout further tiles the spatial dimensions
(the height and the width) of the output tensor into four blocks.
3
摘要:

ALT:BoostingDeepLearningPerformancebyBreakingtheWallbetweenGraphandOperatorLevelOptimizationsZhiyingXuzyxu@smail.nju.edu.cnNanjingUniversityChinaJiafanXumf20330099@smail.nju.edu.cnNanjingUniversityChinaHongdingPengmg20330044@smail.nju.edu.cnNanjingUniversityChinaWeiWangww@nju.edu.cnNanjingUniversity...

展开>> 收起<<
ALT Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level Optimizations.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.68MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注