Hidet Task-Mapping Programming Paradigm for Deep Learning Tensor Programs

2025-05-06 0 0 3.03MB 15 页 10玖币

侵权投诉

Hidet: Task-Mapping Programming Paradigm for Deep Learning

Tensor Programs

Yaoyao Ding∗†

University of Toronto

Toronto, Canada

yaoyao@cs.toronto.edu

Cody Hao Yu

Amazon Web Services

Santa Clara, USA

hyuz@amazon.com

Bojian Zheng†

University of Toronto

Toronto, Canada

bojian@cs.toronto.edu

Yizhi Liu

Amazon Web Services

Santa Clara, USA

yizhiliu@amazon.com

Yida Wang

Amazon Web Services

Santa Clara, USA

wangyida@amazon.com

Gennady Pekhimenko†

University of Toronto

Toronto, Canada

pekhimenko@cs.toronto.edu

ABSTRACT

As deep learning models nowadays are widely adopted by both

cloud services and edge devices, reducing the latency of deep learn-

ing model inferences becomes crucial to provide ecient model

serving. However, it is challenging to develop ecient tensor pro-

grams for deep learning operators due to the high complexity of

modern accelerators (e.g., NVIDIA GPUs and Google TPUs) and

the rapidly growing number of operators.

Deep learning compilers, such as Apache TVM, adopt declara-

tive scheduling primitives to lower the bar of developing tensor

programs. However, we show that this approach is insucient to

cover state-of-the-art tensor program optimizations (e.g., double

buering). In this paper, we propose to embed the scheduling pro-

cess into tensor programs and use dedicated mappings, called task

mappings, to dene the computation assignment and ordering di-

rectly in the tensor programs. This new approach greatly enriches

the expressible optimizations by allowing developers to manipulate

tensor programs at a much ner granularity (e.g., allowing program-

statement-level optimizations). We call the proposed method the

task-mapping programming paradigm. In addition, we propose a

new post-scheduling fusion optimization that allows developers

to focus on scheduling every single operator and automates the

fusion after scheduling. It greatly reduces the engineering eorts

for operator fusion. Our proposed paradigm also constructs an ef-

cient hardware-centric schedule space, which is agnostic to the

program input size and greatly reduces the tuning time.

With the proposed paradigm, we implement a deep learning

compiler – Hidet. Extensive experiments on modern convolution

and transformer models show that Hidet outperforms state-of-the-

art DNN inference framework, ONNX Runtime, and compiler, TVM

equipped with scheduler AutoTVM and Ansor, by up to 1

∗Part of the work done while interning at Amazon.

†Also with Vector Institute.

Permission to make digital or hard copies of part or all of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

ACM ISBN 978-1-4503-9916-6/23/03.

https://doi.org/10.1145/3575693.3575702

on average). It also reduces the tuning time by 20

and 11

com-

pared with AutoTVM and Ansor, respectively. We open-sourced

hidet at https://www.github.com/hidet-org/hidet.

CCS CONCEPTS

•Computing methodologies →Parallel programming lan-

guages;Machine learning;Articial intelligence.

KEYWORDS

deep learning systems, systems for machine learning, programming

models, compilation, tensor computation

ACM Reference Format:

Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gen-

nady Pekhimenko. 2023. Hidet: Task-Mapping Programming Paradigm for

Deep Learning Tensor Programs. In Proceedings of the 28th ACM Interna-

tional Conference on Architectural Support for Programming Languages and

Operating Systems, Volume 2 (ASPLOS ’23), March 25–29, 2023, Vancouver,

BC, Canada. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/

3575693.3575702

1 INTRODUCTION

Deep neural networks (DNNs) [

] have achieved state-of-the-art

(SOTA) results in various tasks such as image recognition [

], natural language translation [

], and autonomous

driving [

]. In deployment environments, these models are re-

peatedly executed to serve continuous user requests, named model

serving. Thus, it is crucial to reduce the latency and maximize the

throughput of model execution to ensure safety, save energy, and

improve user experience.

There are two major ways to execute a DNN model. (1) Deep

learning (DL) frameworks such as TensorFlow [

], PyTorch [

]

and ONNX Runtime [

] dispatch operators to kernel libraries such

as cuDNN [

], cuBLAS [

], and CUTLASS [

] during execution.

(2) On the other hand, DL compilers such as Tensorow-XLA [

]

and TVM [

] automatically generate kernels through a compilation

process for the given operators. Various schedulers such as An-

sor [

] and AutoTVM [

] are used to schedule the kernels during

compilation to achieve high performance.

Kernel libraries (e.g., cuDNN [

] and cuBLAS [

]) provide a

collection of highly optimized hand-crafted kernels (e.g., convolu-

tions and matrix multiplications). These libraries typically achieve

arXiv:2210.09603v2 [cs.LG] 15 Feb 2023

ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada Yaoyao Ding, Cody Hao Yu, Bojian Zheng, Yizhi Liu, Yida Wang, and Gennady Pekhimenko

near-peak performance on widely used input sizes, as they are

able to implement a large spectrum of optimizations in low-level

languages (e.g., CUDA C/C++ and assembly code). However, man-

ually tweaking a kernel to optimize for performance is laborious,

error-prone, and requires expertise in writing low-level language

codes. Thus, it is dicult to generalize to other input shapes, new

operators, and kernel fusion patterns. In addition, template-based

libraries such as CUTLASS [

] employ C++ templates to generate

tensor programs for dierent input shapes on the y. Although

template-based libraries can achieve competitive performance on

many input shapes by dynamically tuning the optimization hyper-

parameters, they do not reduce the complexity of writing tensor

programs for new operators and only provide limited fusion ca-

pability (e.g., only a small number of predened operators can be

fused with matrix multiplication).

Alternatively, DL compilers [

] are proposed to com-

pile deep learning networks into tensor programs automatically.

Existing state-of-the-art DL compilers adopt the idea of decou-

pling computation denition and scheduling, originally proposed

by Halide [

] and TVM [

]. The computation denition of an

operator only denes how each element of the output tensor is

computed mathematically, and the schedule denes the way the

execution is performed, such as the loop order and thread bind-

ing [

]. Compilers leverage schedulers like AutoTVM [

] and

Asnor [

] to tune the hyper-parameters of the schedule to optimize

operator performance for each input shape. Unlike kernel libraries

and templates that target a xed set of operators and limited fusion

patterns, compilers are capable of supporting more operators and

more exible fusion patterns automatically.

However, existing state-of-the-art compilers are mostly based on

the loop-oriented scheduling primitives, which manipulate the loop

structure of a tensor program in a declarative manner (e.g., loop split

and reorder). Although loop-oriented scheduling primitives have

achieved great success in simplifying tensor program writing [

], certain key optimizations (e.g., double buering [

]) are

hard to implement. Specically, loop-oriented scheduling primitives

cannot express the ne-grained tensor program transformations

required by the key optimizations discussed in Section 3.1. Besides,

loop-oriented scheduling also suers from the long kernel tuning

time due to the rarity of ecient schedules in the tremendous

tuning spaces. For instance, AutoTVM [

] takes 15 hours to tune

a single CNN model Inception V3 [50] on a modern GPU.

In this work, we propose a new paradigm for writing ecient

tensor programs: task-mapping-oriented programming paradigm.

In this paradigm, we dene the parallelizable computations in an

operator as tasks, and the process of assigning and ordering the

tasks to parallel processing units (e.g., threads) as scheduling. The

developers can directly dene the scheduling in the tensor program

through task mappings

. This paradigm simplies the development

of tensor programs without sacricing the ability to express opti-

mizations requiring ne-grained program manipulation. With the

in-program style of scheduling, this paradigm also allows us to

search the tensor program in an ecient hardware-centric schedule

space that is agnostic to input size to dramatically reduce the tuning

The name task mapping comes from the abstraction where a scheduling process

can be considered as the one that maps tasks to processing units in both spatial and

temporal dimensions.

time. We also propose post-scheduling fusion to fuse the scheduled

operator with surrounding operators automatically, so developers

don’t need to worry about fusion when writing schedule templates.

We implement a new DL compiler called Hidet based on the pro-

posed ideas. In this work, we mainly focus on optimizing DNN in-

ference on GPUs, as it is the most commonly used DNN accelerator.

The proposed ideas also apply to other accelerators such as CPUs

and TPUs [

]. Extensive experiments on modern convolutional

and transformer models show that Hidet outperforms state-of-the-

art DL inference frameworks and schedulers, AutoTVM [

] and

Ansor [

], by up to 1

on average) while reducing the

tuning time of the two schedulers by 20×and 11×, respectively.

We summarize our contributions as follows:

•

We identify and present the limited expressiveness of loop-

oriented scheduling adopted by state-of-the-art DL compilers

to be their fundamental limitation in eciently compiling

complex tensor programs (e.g., matrix multiplication).

•

We introduce the task-mapping-oriented programming par-

adigm to simplify tensor program development without sac-

ricing the expressiveness of optimizations compared with

hand-crafted implementations. Based on this paradigm, we

propose post-scheduling fusion to fuse the scheduled pro-

gram with surrounding operators. The paradigm also allows

us to search in the hardware-centric schedule space to reduce

the tuning time signicantly.

•

We implement a new DL compiler, named Hidet, based on

the proposed ideas. Extensive experiments show that Hidet

outperforms state-of-the-art DL frameworks and compilers

by up to 1

and reduces tuning time by 11

. We have

open-sourced Hidet here.

2 BACKGROUND

2.1 CUDA Programming Model

The CUDA programming platform [

] is widely used by deep learn-

ing systems on NVIDIA GPUs. In this section, we briey introduce

the CUDA programming model on modern GPUs.

Graphics Processing Unit (GPU)

Str eaming

Mult ipr ocessor (SM)

Ker nel

Thread Launch

Ker nel SM SMSM

Thread

Block

...

Used Smem:

Used Registers:

B B B B

Warp

...

32 threads

SM SMSM

SMSM ...

...

Figure 1: An overview of CUDA programming model.

Kernel, Thread Block, and Thread. When running a workload on

the GPU, thousands of threads will be executed. Each thread exe-

cutes the same piece of code, called kernel code. When launching a

kernel, a grid of thread blocks will be dispatched onto the GPU as

shown in Figure 1. Each grid usually comprises tens to thousands of

thread blocks, while each thread block comprises tens to hundreds

of threads. In the kernel code, pre-dened variables

threadIdx

and

blockIdx

, and sux

, and

are used to access the 3-dimensional

Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs ASPLOS ’23, March 25–29, 2023, Vancouver, BC, Canada

index of thread in a thread block and the thread block in the grid

of blocks, respectively.

Hardware Implementation. Each modern GPU has tens to hundreds

of streaming multiprocessors (SMs). Each SM supports scheduling up

to thousands of concurrent threads [

]. Threads in a thread block

are partitioned into warps, and each warp contains 32 consecutive

threads executing the same instructions. There are two kinds of

programmable on-chip memory: shared memory and registers. Reg-

isters are privately allocated to each thread, while shared memory

is allocated to each thread block and only threads in the thread

block can access it. When launching a kernel, the thread blocks

are dispatched to the SMs wave by wave [

]. Each thread block

will only be dispatched to a single SM while each SM may contain

multiple thread blocks. The number of maximum resident thread

blocks per SM is limited by the size of shared memory, register le,

and warp scheduling units.

Operators in the deep neural network are implemented as GPU

kernels. When running a neural network, we launch these kernels

following an order satisfying the operator dependency. Among

these operators, matrix multiplication (also known as a linear or

dense layer) is one of the most important operators. We next present

an ecient implementation of matrix multiplication using CUDA

and take it as an example throughout the paper.

2.2 Ecient Matrix Multiplication

This section illustrates an ecient implementation of matrix multi-

plication

𝐶=𝐴𝐵

(all matrices are 1024

1024) on modern NVIDIA

GPUs via Tensor Cores [

]. Figure 2shows the desired workow. In

step

, we decompose the matrix multiplication into independent

subtasks by tiling the M and N dimensions. After tiling, there will

𝑀

M tile size ×𝑁

N tile size

independent subtasks while each sub-task is

a matrix multiplication with size:

M tile size ×N tile size ×𝐾

. Each

subtask will be assigned to a thread block. Inside each thread block,

the K dimension will be further tiled into

𝐾

K tile size

tiles, and the

thread block will apply step

to each K tile. In step

, threads

in the thread block load fragments of matrix A and B from global

memory to shared memory collectively (i.e., dierent threads load

dierent parts of the fragments). All threads in a thread block will

be synchronized to make sure the data loading is nished before

proceeding to the next step. In step

, 4 warps in the thread block

work on 4

16 matrix multiply accumulates (MMAs), each of

which is an operation

𝐶16×16 =𝐴16×8𝐵8×16 +𝐶16×16.

Each warp

conducts 4MMAs using Tensor Core [

] with 4 sequential itera-

tions. Once we accumulate the results of matrix multiplication for

each K tile, we

store the results from the accumulating register

to global memory. Figure 3gives the pseudo-code of the 4 steps.

There are two ways to implement the kernel: (1) directly write the

CUDA C code as in kernel libraries [

], or (2) use declarative

loop-oriented scheduling primitives. In the next subsection, we

would give a brief introduction to the second method.

2.3 Declarative Loop-Oriented Scheduling

To simplify tensor program optimization, Halide [

] proposes a

programming paradigm of tensor programs, in which the computa-

tion denition and scheduling of the computation are decoupled.

This programming paradigm is adopted by state-of-the-art DNN

Decompose into

indepenent sub tasks

run by thread blocks

K=1024

1024

N=1024 N tile size 64

M tile size 64

K tile size 8

Smem A

Smem B

M tile

size 64

N tile size 64

K tile size 8

0 1 0 1

2 3 2 3

0 1 0 1

2 3 2 3

Warp 0 Warp 1

Warp 2 Warp 3

0 1

2 3

4 Iterations for

each warp

Tensor Core MMA

(16x16x8)

2Load A and B from global memory to shared

memory cooperatively by all threads in block

3Matrix-Multiply Accumulate (MMA)

4Store C from register to global memory

Repeat for each K tile

Smem A

Smem B

Gmem A

Gmem B Global Memory

Shared Memory

Cooperatively Load

CUDA Matr ix

Multiplication

Figure 2: Ecient Matrix Multiplication on CUDA Platform.

def matmul(A: fp32[M, K], B: fp32[K, N], C: fp32[M, N]):

SmemA, SmemB = shared fp32[64, 8], fp32[8, 64]

RegsC = local fp32[...]

... # Step : Calculate sub-task offset

for k0 in range(128): # Iterate each K tile

# Step : Load A and B frag. to SmemA and SmemB

SmemA, SmemB = cooperative_load(A, B, k0)

sync_threads()

# Step : Block MMA (RegsC = SmemA * SmemB + RegsC)

RegsC = block_mma(SmemA, SmemB, RegsC)

sync_threads()

... # Step : Write back C (RegsC => C)

Figure 3: Pseudo-code of Matrix Multiplication.

compiler TVM [

] and schedulers (e.g., AutoTVM [

] and An-

sor [

]). Since this paradigm oers a set of declarative scheduling

primitives to manipulate the loop structure of tensor programs, we

name it declarative loop-oriented scheduling.

Figure 4shows the workow of loop-oriented scheduling. De-

velopers rst provide a mathematical computation of the operator

that denes how each element in the tensor is computed. The exam-

ple gives the denition of matrix multiplication, where the (i, j)-th

element of the output is a sum reduction. Given the computation

denition, the schedulers rst

generate a default tensor program

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Hidet:Task-MappingProgrammingParadigmforDeepLearningTensorProgramsYaoyaoDing∗†UniversityofTorontoToronto,Canadayaoyao@cs.toronto.eduCodyHaoYuAmazonWebServicesSantaClara,USAhyuz@amazon.comBojianZheng†UniversityofTorontoToronto,Canadabojian@cs.toronto.eduYizhiLiuAmazonWebServicesSantaClara,USAyizhiliu...

展开>> 收起<<

Hidet Task-Mapping Programming Paradigm for Deep Learning Tensor Programs.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hidet Task-Mapping Programming Paradigm for Deep Learning Tensor Programs

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: