Hidet Task-Mapping Programming Paradigm for Deep Learning Tensor Programs

2025-05-06 0 0 3.03MB 15 页 10玖币
侵权投诉
Hidet: Task-Mapping Programming Paradigm for Deep Learning
Tensor Programs
Yaoyao Ding
University of Toronto
Toronto, Canada
yaoyao@cs.toronto.edu
Cody Hao Yu
Amazon Web Services
Santa Clara, USA
hyuz@amazon.com
Bojian Zheng
University of Toronto
Toronto, Canada
bojian@cs.toronto.edu
Yizhi Liu
Amazon Web Services
Santa Clara, USA
yizhiliu@amazon.com
Yida Wang
Amazon Web Services
Santa Clara, USA
wangyida@amazon.com
Gennady Pekhimenko
University of Toronto
Toronto, Canada
pekhimenko@cs.toronto.edu
ABSTRACT
As deep learning models nowadays are widely adopted by both
cloud services and edge devices, reducing the latency of deep learn-
ing model inferences becomes crucial to provide ecient model
serving. However, it is challenging to develop ecient tensor pro-
grams for deep learning operators due to the high complexity of
modern accelerators (e.g., NVIDIA GPUs and Google TPUs) and
the rapidly growing number of operators.
Deep learning compilers, such as Apache TVM, adopt declara-
tive scheduling primitives to lower the bar of developing tensor
programs. However, we show that this approach is insucient to
cover state-of-the-art tensor program optimizations (e.g., double
buering). In this paper, we propose to embed the scheduling pro-
cess into tensor programs and use dedicated mappings, called task
mappings, to dene the computation assignment and ordering di-
rectly in the tensor programs. This new approach greatly enriches
the expressible optimizations by allowing developers to manipulate
tensor programs at a much ner granularity (e.g., allowing program-
statement-level optimizations). We call the proposed method the
task-mapping programming paradigm. In addition, we propose a
new post-scheduling fusion optimization that allows developers
to focus on scheduling every single operator and automates the
fusion after scheduling. It greatly reduces the engineering eorts
for operator fusion. Our proposed paradigm also constructs an ef-
cient hardware-centric schedule space, which is agnostic to the
program input size and greatly reduces the tuning time.
With the proposed paradigm, we implement a deep learning
compiler – Hidet. Extensive experiments on modern convolution
and transformer models show that Hidet outperforms state-of-the-
art DNN inference framework, ONNX Runtime, and compiler, TVM
equipped with scheduler AutoTVM and Ansor, by up to 1
.
48
×
(1
.
22
×
Part of the work done while interning at Amazon.
Also with Vector Institute.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
©2023 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9916-6/23/03.
https://doi.org/10.1145/3575693.3575702
on average). It also reduces the tuning time by 20
×
and 11
×
com-
pared with AutoTVM and Ansor, respectively. We open-sourced
hidet at https://www.github.com/hidet-org/hidet.
CCS CONCEPTS
Computing methodologies Parallel programming lan-
guages;Machine learning;Articial intelligence.
KEYWORDS
deep learning systems, systems for machine learning, programming
models, compilation, tensor computation
ACM Reference Format:
Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gen-
nady Pekhimenko. 2023. Hidet: Task-Mapping Programming Paradigm for
Deep Learning Tensor Programs. In Proceedings of the 28th ACM Interna-
tional Conference on Architectural Support for Programming Languages and
Operating Systems, Volume 2 (ASPLOS ’23), March 25–29, 2023, Vancouver,
BC, Canada. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/
3575693.3575702
1 INTRODUCTION
Deep neural networks (DNNs) [
35
] have achieved state-of-the-art
(SOTA) results in various tasks such as image recognition [
25
,
34
,
49
,
50
], natural language translation [
16
,
36
,
48
], and autonomous
driving [
14
]. In deployment environments, these models are re-
peatedly executed to serve continuous user requests, named model
serving. Thus, it is crucial to reduce the latency and maximize the
throughput of model execution to ensure safety, save energy, and
improve user experience.
There are two major ways to execute a DNN model. (1) Deep
learning (DL) frameworks such as TensorFlow [
1
], PyTorch [
41
]
and ONNX Runtime [
15
] dispatch operators to kernel libraries such
as cuDNN [
12
], cuBLAS [
26
], and CUTLASS [
32
] during execution.
(2) On the other hand, DL compilers such as Tensorow-XLA [
44
]
and TVM [
9
] automatically generate kernels through a compilation
process for the given operators. Various schedulers such as An-
sor [
65
] and AutoTVM [
11
] are used to schedule the kernels during
compilation to achieve high performance.
Kernel libraries (e.g., cuDNN [
12
] and cuBLAS [
26
]) provide a
collection of highly optimized hand-crafted kernels (e.g., convolu-
tions and matrix multiplications). These libraries typically achieve
arXiv:2210.09603v2 [cs.LG] 15 Feb 2023
ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko
near-peak performance on widely used input sizes, as they are
able to implement a large spectrum of optimizations in low-level
languages (e.g., CUDA C/C++ and assembly code). However, man-
ually tweaking a kernel to optimize for performance is laborious,
error-prone, and requires expertise in writing low-level language
codes. Thus, it is dicult to generalize to other input shapes, new
operators, and kernel fusion patterns. In addition, template-based
libraries such as CUTLASS [
32
] employ C++ templates to generate
tensor programs for dierent input shapes on the y. Although
template-based libraries can achieve competitive performance on
many input shapes by dynamically tuning the optimization hyper-
parameters, they do not reduce the complexity of writing tensor
programs for new operators and only provide limited fusion ca-
pability (e.g., only a small number of predened operators can be
fused with matrix multiplication).
Alternatively, DL compilers [
4
,
9
,
43
,
44
,
67
] are proposed to com-
pile deep learning networks into tensor programs automatically.
Existing state-of-the-art DL compilers adopt the idea of decou-
pling computation denition and scheduling, originally proposed
by Halide [
43
] and TVM [
9
]. The computation denition of an
operator only denes how each element of the output tensor is
computed mathematically, and the schedule denes the way the
execution is performed, such as the loop order and thread bind-
ing [
9
,
43
]. Compilers leverage schedulers like AutoTVM [
11
] and
Asnor [
65
] to tune the hyper-parameters of the schedule to optimize
operator performance for each input shape. Unlike kernel libraries
and templates that target a xed set of operators and limited fusion
patterns, compilers are capable of supporting more operators and
more exible fusion patterns automatically.
However, existing state-of-the-art compilers are mostly based on
the loop-oriented scheduling primitives, which manipulate the loop
structure of a tensor program in a declarative manner (e.g., loop split
and reorder). Although loop-oriented scheduling primitives have
achieved great success in simplifying tensor program writing [
9
,
11
,
65
], certain key optimizations (e.g., double buering [
32
]) are
hard to implement. Specically, loop-oriented scheduling primitives
cannot express the ne-grained tensor program transformations
required by the key optimizations discussed in Section 3.1. Besides,
loop-oriented scheduling also suers from the long kernel tuning
time due to the rarity of ecient schedules in the tremendous
tuning spaces. For instance, AutoTVM [
11
] takes 15 hours to tune
a single CNN model Inception V3 [50] on a modern GPU.
In this work, we propose a new paradigm for writing ecient
tensor programs: task-mapping-oriented programming paradigm.
In this paradigm, we dene the parallelizable computations in an
operator as tasks, and the process of assigning and ordering the
tasks to parallel processing units (e.g., threads) as scheduling. The
developers can directly dene the scheduling in the tensor program
through task mappings
1
. This paradigm simplies the development
of tensor programs without sacricing the ability to express opti-
mizations requiring ne-grained program manipulation. With the
in-program style of scheduling, this paradigm also allows us to
search the tensor program in an ecient hardware-centric schedule
space that is agnostic to input size to dramatically reduce the tuning
1
The name task mapping comes from the abstraction where a scheduling process
can be considered as the one that maps tasks to processing units in both spatial and
temporal dimensions.
time. We also propose post-scheduling fusion to fuse the scheduled
operator with surrounding operators automatically, so developers
don’t need to worry about fusion when writing schedule templates.
We implement a new DL compiler called Hidet based on the pro-
posed ideas. In this work, we mainly focus on optimizing DNN in-
ference on GPUs, as it is the most commonly used DNN accelerator.
The proposed ideas also apply to other accelerators such as CPUs
and TPUs [
31
]. Extensive experiments on modern convolutional
and transformer models show that Hidet outperforms state-of-the-
art DL inference frameworks and schedulers, AutoTVM [
11
] and
Ansor [
65
], by up to 1
.
48
×
(1
.
22
×
on average) while reducing the
tuning time of the two schedulers by 20×and 11×, respectively.
We summarize our contributions as follows:
We identify and present the limited expressiveness of loop-
oriented scheduling adopted by state-of-the-art DL compilers
to be their fundamental limitation in eciently compiling
complex tensor programs (e.g., matrix multiplication).
We introduce the task-mapping-oriented programming par-
adigm to simplify tensor program development without sac-
ricing the expressiveness of optimizations compared with
hand-crafted implementations. Based on this paradigm, we
propose post-scheduling fusion to fuse the scheduled pro-
gram with surrounding operators. The paradigm also allows
us to search in the hardware-centric schedule space to reduce
the tuning time signicantly.
We implement a new DL compiler, named Hidet, based on
the proposed ideas. Extensive experiments show that Hidet
outperforms state-of-the-art DL frameworks and compilers
by up to 1
.
48
×
and reduces tuning time by 11
×
. We have
open-sourced Hidet here.
2 BACKGROUND
2.1 CUDA Programming Model
The CUDA programming platform [
40
] is widely used by deep learn-
ing systems on NVIDIA GPUs. In this section, we briey introduce
the CUDA programming model on modern GPUs.
BB
Graphics Processing Unit (GPU)
Str eaming
Mult ipr ocessor (SM)
Ker nel
B
Thread Launch
Ker nel SM SMSM
Thread
Block
...
Used Smem:
Used Registers:
B B B B
Warp
...
32 threads
SM SMSM
SMSM ...
...
Figure 1: An overview of CUDA programming model.
Kernel, Thread Block, and Thread. When running a workload on
the GPU, thousands of threads will be executed. Each thread exe-
cutes the same piece of code, called kernel code. When launching a
kernel, a grid of thread blocks will be dispatched onto the GPU as
shown in Figure 1. Each grid usually comprises tens to thousands of
thread blocks, while each thread block comprises tens to hundreds
of threads. In the kernel code, pre-dened variables
threadIdx
and
blockIdx
, and sux
x
,
y
, and
z
are used to access the 3-dimensional
2
Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada
index of thread in a thread block and the thread block in the grid
of blocks, respectively.
Hardware Implementation. Each modern GPU has tens to hundreds
of streaming multiprocessors (SMs). Each SM supports scheduling up
to thousands of concurrent threads [
40
]. Threads in a thread block
are partitioned into warps, and each warp contains 32 consecutive
threads executing the same instructions. There are two kinds of
programmable on-chip memory: shared memory and registers. Reg-
isters are privately allocated to each thread, while shared memory
is allocated to each thread block and only threads in the thread
block can access it. When launching a kernel, the thread blocks
are dispatched to the SMs wave by wave [
23
]. Each thread block
will only be dispatched to a single SM while each SM may contain
multiple thread blocks. The number of maximum resident thread
blocks per SM is limited by the size of shared memory, register le,
and warp scheduling units.
Operators in the deep neural network are implemented as GPU
kernels. When running a neural network, we launch these kernels
following an order satisfying the operator dependency. Among
these operators, matrix multiplication (also known as a linear or
dense layer) is one of the most important operators. We next present
an ecient implementation of matrix multiplication using CUDA
and take it as an example throughout the paper.
2.2 Ecient Matrix Multiplication
This section illustrates an ecient implementation of matrix multi-
plication
𝐶=𝐴𝐵
(all matrices are 1024
×
1024) on modern NVIDIA
GPUs via Tensor Cores [
13
]. Figure 2shows the desired workow. In
step
1
, we decompose the matrix multiplication into independent
subtasks by tiling the M and N dimensions. After tiling, there will
be
𝑀
M tile size ×𝑁
N tile size
independent subtasks while each sub-task is
a matrix multiplication with size:
M tile size ×N tile size ×𝐾
. Each
subtask will be assigned to a thread block. Inside each thread block,
the K dimension will be further tiled into
𝐾
K tile size
tiles, and the
thread block will apply step
2
-
3
to each K tile. In step
2
, threads
in the thread block load fragments of matrix A and B from global
memory to shared memory collectively (i.e., dierent threads load
dierent parts of the fragments). All threads in a thread block will
be synchronized to make sure the data loading is nished before
proceeding to the next step. In step
3
, 4 warps in the thread block
work on 4
×
4
=
16 matrix multiply accumulates (MMAs), each of
which is an operation
𝐶16×16 =𝐴16×8𝐵8×16 +𝐶16×16.
Each warp
conducts 4MMAs using Tensor Core [
13
] with 4 sequential itera-
tions. Once we accumulate the results of matrix multiplication for
each K tile, we
4
store the results from the accumulating register
to global memory. Figure 3gives the pseudo-code of the 4 steps.
There are two ways to implement the kernel: (1) directly write the
CUDA C code as in kernel libraries [
12
,
26
,
32
], or (2) use declarative
loop-oriented scheduling primitives. In the next subsection, we
would give a brief introduction to the second method.
2.3 Declarative Loop-Oriented Scheduling
To simplify tensor program optimization, Halide [
43
] proposes a
programming paradigm of tensor programs, in which the computa-
tion denition and scheduling of the computation are decoupled.
This programming paradigm is adopted by state-of-the-art DNN
Decompose into
indepenent sub tasks
run by thread blocks
A
B
C
K=1024
M
1024
N=1024 N tile size 64
M tile size 64
K tile size 8
1
Smem A
Smem B
M tile
size 64
N tile size 64
K tile size 8
0 1 0 1
2 3 2 3
0 1 0 1
2 3 2 3
Warp 0 Warp 1
Warp 2 Warp 3
0 1
2 3
4 Iterations for
each warp
Tensor Core MMA
(16x16x8)
2Load A and B from global memory to shared
memory cooperatively by all threads in block
3Matrix-Multiply Accumulate (MMA)
4Store C from register to global memory
Repeat for each K tile
Smem A
Smem B
Gmem A
Gmem B Global Memory
Shared Memory
Cooperatively Load
CUDA Matr ix
Multiplication
Figure 2: Ecient Matrix Multiplication on CUDA Platform.
def matmul(A: fp32[M, K], B: fp32[K, N], C: fp32[M, N]):
SmemA, SmemB = shared fp32[64, 8], fp32[8, 64]
RegsC = local fp32[...]
... # Step : Calculate sub-task offset
for k0 in range(128): # Iterate each K tile
# Step : Load A and B frag. to SmemA and SmemB
SmemA, SmemB = cooperative_load(A, B, k0)
sync_threads()
# Step : Block MMA (RegsC = SmemA * SmemB + RegsC)
RegsC = block_mma(SmemA, SmemB, RegsC)
sync_threads()
... # Step : Write back C (RegsC => C)
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
Figure 3: Pseudo-code of Matrix Multiplication.
compiler TVM [
9
] and schedulers (e.g., AutoTVM [
11
] and An-
sor [
65
]). Since this paradigm oers a set of declarative scheduling
primitives to manipulate the loop structure of tensor programs, we
name it declarative loop-oriented scheduling.
Figure 4shows the workow of loop-oriented scheduling. De-
velopers rst provide a mathematical computation of the operator
that denes how each element in the tensor is computed. The exam-
ple gives the denition of matrix multiplication, where the (i, j)-th
element of the output is a sum reduction. Given the computation
denition, the schedulers rst
1
generate a default tensor program
3
摘要:

Hidet:Task-MappingProgrammingParadigmforDeepLearningTensorProgramsYaoyaoDing∗†UniversityofTorontoToronto,Canadayaoyao@cs.toronto.eduCodyHaoYuAmazonWebServicesSantaClara,USAhyuz@amazon.comBojianZheng†UniversityofTorontoToronto,Canadabojian@cs.toronto.eduYizhiLiuAmazonWebServicesSantaClara,USAyizhiliu...

展开>> 收起<<
Hidet Task-Mapping Programming Paradigm for Deep Learning Tensor Programs.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:3.03MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注