
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko
near-peak performance on widely used input sizes, as they are
able to implement a large spectrum of optimizations in low-level
languages (e.g., CUDA C/C++ and assembly code). However, man-
ually tweaking a kernel to optimize for performance is laborious,
error-prone, and requires expertise in writing low-level language
codes. Thus, it is dicult to generalize to other input shapes, new
operators, and kernel fusion patterns. In addition, template-based
libraries such as CUTLASS [
32
] employ C++ templates to generate
tensor programs for dierent input shapes on the y. Although
template-based libraries can achieve competitive performance on
many input shapes by dynamically tuning the optimization hyper-
parameters, they do not reduce the complexity of writing tensor
programs for new operators and only provide limited fusion ca-
pability (e.g., only a small number of predened operators can be
fused with matrix multiplication).
Alternatively, DL compilers [
4
,
9
,
43
,
44
,
67
] are proposed to com-
pile deep learning networks into tensor programs automatically.
Existing state-of-the-art DL compilers adopt the idea of decou-
pling computation denition and scheduling, originally proposed
by Halide [
43
] and TVM [
9
]. The computation denition of an
operator only denes how each element of the output tensor is
computed mathematically, and the schedule denes the way the
execution is performed, such as the loop order and thread bind-
ing [
9
,
43
]. Compilers leverage schedulers like AutoTVM [
11
] and
Asnor [
65
] to tune the hyper-parameters of the schedule to optimize
operator performance for each input shape. Unlike kernel libraries
and templates that target a xed set of operators and limited fusion
patterns, compilers are capable of supporting more operators and
more exible fusion patterns automatically.
However, existing state-of-the-art compilers are mostly based on
the loop-oriented scheduling primitives, which manipulate the loop
structure of a tensor program in a declarative manner (e.g., loop split
and reorder). Although loop-oriented scheduling primitives have
achieved great success in simplifying tensor program writing [
9
,
11
,
65
], certain key optimizations (e.g., double buering [
32
]) are
hard to implement. Specically, loop-oriented scheduling primitives
cannot express the ne-grained tensor program transformations
required by the key optimizations discussed in Section 3.1. Besides,
loop-oriented scheduling also suers from the long kernel tuning
time due to the rarity of ecient schedules in the tremendous
tuning spaces. For instance, AutoTVM [
11
] takes 15 hours to tune
a single CNN model Inception V3 [50] on a modern GPU.
In this work, we propose a new paradigm for writing ecient
tensor programs: task-mapping-oriented programming paradigm.
In this paradigm, we dene the parallelizable computations in an
operator as tasks, and the process of assigning and ordering the
tasks to parallel processing units (e.g., threads) as scheduling. The
developers can directly dene the scheduling in the tensor program
through task mappings
1
. This paradigm simplies the development
of tensor programs without sacricing the ability to express opti-
mizations requiring ne-grained program manipulation. With the
in-program style of scheduling, this paradigm also allows us to
search the tensor program in an ecient hardware-centric schedule
space that is agnostic to input size to dramatically reduce the tuning
1
The name task mapping comes from the abstraction where a scheduling process
can be considered as the one that maps tasks to processing units in both spatial and
temporal dimensions.
time. We also propose post-scheduling fusion to fuse the scheduled
operator with surrounding operators automatically, so developers
don’t need to worry about fusion when writing schedule templates.
We implement a new DL compiler called Hidet based on the pro-
posed ideas. In this work, we mainly focus on optimizing DNN in-
ference on GPUs, as it is the most commonly used DNN accelerator.
The proposed ideas also apply to other accelerators such as CPUs
and TPUs [
31
]. Extensive experiments on modern convolutional
and transformer models show that Hidet outperforms state-of-the-
art DL inference frameworks and schedulers, AutoTVM [
11
] and
Ansor [
65
], by up to 1
.
48
×
(1
.
22
×
on average) while reducing the
tuning time of the two schedulers by 20×and 11×, respectively.
We summarize our contributions as follows:
•
We identify and present the limited expressiveness of loop-
oriented scheduling adopted by state-of-the-art DL compilers
to be their fundamental limitation in eciently compiling
complex tensor programs (e.g., matrix multiplication).
•
We introduce the task-mapping-oriented programming par-
adigm to simplify tensor program development without sac-
ricing the expressiveness of optimizations compared with
hand-crafted implementations. Based on this paradigm, we
propose post-scheduling fusion to fuse the scheduled pro-
gram with surrounding operators. The paradigm also allows
us to search in the hardware-centric schedule space to reduce
the tuning time signicantly.
•
We implement a new DL compiler, named Hidet, based on
the proposed ideas. Extensive experiments show that Hidet
outperforms state-of-the-art DL frameworks and compilers
by up to 1
.
48
×
and reduces tuning time by 11
×
. We have
open-sourced Hidet here.
2 BACKGROUND
2.1 CUDA Programming Model
The CUDA programming platform [
40
] is widely used by deep learn-
ing systems on NVIDIA GPUs. In this section, we briey introduce
the CUDA programming model on modern GPUs.
BB
Graphics Processing Unit (GPU)
Str eaming
Mult ipr ocessor (SM)
Ker nel
B
Thread Launch
Ker nel SM SMSM
Thread
Block
...
Used Smem:
Used Registers:
B B B B
Warp
...
32 threads
SM SMSM
SMSM ...
...
Figure 1: An overview of CUDA programming model.
Kernel, Thread Block, and Thread. When running a workload on
the GPU, thousands of threads will be executed. Each thread exe-
cutes the same piece of code, called kernel code. When launching a
kernel, a grid of thread blocks will be dispatched onto the GPU as
shown in Figure 1. Each grid usually comprises tens to thousands of
thread blocks, while each thread block comprises tens to hundreds
of threads. In the kernel code, pre-dened variables
threadIdx
and
blockIdx
, and sux
x
,
y
, and
z
are used to access the 3-dimensional
2