, , Zhiying Xu, Jiafan Xu, Hongding Peng, Wei Wang, Xiaoliang Wang, Haoran Wan, Haipeng Dai, Yixu Xu, Hao Cheng, Kun Wang, and Guihai Chen
tensor. Third, layout optimization and loop optimization are sepa-
rated into dierent system layers. Such strict boundary seriously
compromises the performance of the generated tensor programs.
For instance, we observe that optimizing loops based on the best
of three candidate layouts for 2-D convolutional operators can
improve the performance by 55
.
9% on average on the Intel CPU.
Moreover, the performance of a specic data layout is sensitive to
operator congurations (e.g., tensor shapes) and hardware. Thus
it is hard to determine data layouts for each workload without
feedback from loop optimization.
This paper proposes ALT, a deep compiler that jointly performs
graph-level and operator-level optimizations for deep models. The
design of ALT originates from the following insight. Graph-level
data layout optimization and operator-level loop optimization could
benet from each other. In the meanwhile, the root cause of the
inability to perform cross-layer joint tuning is the coupling between
data storage and operator implementation in prior arts, such that
altering the data layout requires re-implementing operators. Such
high cost for changing layouts further leads to the unidirectional
and one-o optimization ow. Therefore, ALT abstracts layout
manipulation as easy-to-use primitive functions, such that the task
of re-implementing operators can be delegated to a compilation
pass without human interference. After reducing the cost, ALT
further incorporates layout and loop optimizations into a unied
auto-tuning framework, which breaks the boundary between graph-
and operator-level optimizations to open new opportunities.
It is not trivial to achieve our goals. We need to strike the fol-
lowing challenges.
Challenge 1: How can we eliminate the overhead of layout transforma-
tion? We discover two types of potential overhead when altering the
tensor layouts: layout-conversion overhead and fusion-conict over-
head. First, operators along the data stream may require dierent
tensor layouts to reach their optimal performance. To transform the
layouts of tensors produced by other operators at runtime, directly
inserting conversion operators will incur extra overhead on data
movements. Second, altering the output tensor layout of an operator
needs the reconstruction of its loop nest. Such reconstruction may
inhibit the operator from being fused with its consumer, which is an
important loop optimization technique to improve inter-operator
data locality.
Challenge 2: How can we prevent ineciency due to the search space
reconstruction during joint tuning? Changing the output layout of
an operator will induce the loop nest reconstruction, which will
further lead to the variation of the loop tuning space. For joint
tuning, such space variation prohibits a direct iterative exploration.
Otherwise, the points we have searched in the last iteration may
be invalid in the changing space. This leads to inecient tuning
for most search methods, including genetic and learning-based
algorithms, since the accumulated knowledge of the search space
structure cannot be further exploited in the newly reconstructed
space.
Challenge 3: How can we improve eciency given the search space
explosion with the combination of layout and loop tuning? Along
with the joint tuning, the combined search space will be tremen-
dously large, hence inecient to explore directly. For example, in a
typical 2-D convolutional operator, the loop transformation space
can contain up to
𝑂(
10
7)
points for its seven nested loops. After
combining the layout transformation, the joint search space can be
at a scale of
𝑂(
10
19)
considering three tensors, each of which fur-
ther involves four dimensions. Moreover, end-to-end optimization
is more challenging due to the inter-dependency of many operators
and tensors.
To eliminate the two types of overhead brought by layout trans-
formation, we propose a layout propagation mechanism. Suppose
we have chosen a dierent layout for the input tensor of an oper-
ator. We let the upstream operator, which is the producer of this
tensor, directly yield elements based on this new layout, hence no
conversion operator is required. To promote operator fusion, we
propagate the new layout downstream along the computational
graph to let the consumer operator trigger the same loop recon-
struction, which helps to align loop nests of multiple operators for
fusion. As such, we can safely transform data layouts with minimal
overhead, and without sabotaging loop optimization.
To alleviate the search space reconstruction issue in the co-
tuning, our solution is two folds. First, we divide the co-tuning into
two stages: joint stage and loop-only stage. The joint stage searches
for optimal tensor layouts, while the loop-only stage only performs
loop tuning with the searched layouts remaining unchanged. Sec-
ond, we propose a cross-exploration architecture for the joint stage,
rather than the direct exploration. For a new feasible layout, we
reconstruct the loop space and then perform multiple rounds of
loop optimization to assess the new layout. This design avoids inef-
cient loop space reconstruction since the loop-only stage keeps
layouts unchanged. It also achieves the expected bidirectional and
unied tuning ow in the joint stage, because each candidate layout
is evaluated based on feedback from loop optimization through our
novel tuning architecture.
To avoid the search space explosion due to the combination of
layout and loop tuning, we prune the space at two levels. First, we
only build layout transformation spaces for tensors accessed by
complex operators. In this work, we take convolutions and general
matrix multiplication as complex operators, the performance of
which are layout sensitive. For other tensors, we further exploit the
layout propagation mechanism to propagate the searched layouts
onto them without more searching. Second, we identify a promising
subspace by tailoring a tuning template for each tensor accessed
by complex operators. These templates are constructed based on
our analysis of layout optimization considering both operator and
hardware characteristics.
By addressing these challenges, ALT achieves joint and ecient
graph-level data layout optimization and operator-level loop opti-
mization automatically.
We comprehensively evaluate ALT on Intel CPU, NVIDIA GPU,
and ARM CPU. Compared with state-of-the-art vendor libraries
(e.g., MKL-DNN [
33
], cuDNN [
12
], and XNNPACK [
27
]) and auto-
tuning frameworks (e.g., Ansor [
83
]), ALT achieves an average of
1
.
5
×
speedup in terms of single operator performance, and 1
.
4
×
speedup in terms of end-to-end inference performance. Our evalua-
tion also shows that ALT can nd data layouts that are not explored
in prior arts. Additionally, we have deployed ALT in production
environments for four months, boosting a broad spectrum of real
workloads (e.g., speech recognition and super resolution).
In summary, we make the following contributions:
2