Demystifying Map Space Exploration for NPUs

2025-04-22 0 0 1.09MB 13 页 10玖币

侵权投诉

Sheng-Chun Kao1Angshuman Parashar2Po-An Tsai2and Tushar Krishna1

1Georgia Institute of Technology

2Nvidia

1skao6@gatech.edu, tushar@ece.gatech.edu

2{aparashar, poant} @nvidia.com

Abstract

Map Space Exploration is the problem of ﬁnding opti-

mized mappings of a Deep Neural Network (DNN) model on

an accelerator. It is known to be extremely computationally

expensive, and there has been active research looking at

both heuristics and learning-based methods to make the

problem computationally tractable. However, while there are

dozens of mappers out there (all empirically claiming to

ﬁnd better mappings than others), the research community

lacks systematic insights on how different search techniques

navigate the map-space and how different mapping axes

contribute to the accelerator’s performance and efﬁciency.

Such insights are crucial to developing mapping frameworks

for emerging DNNs that are increasingly irregular (due to

neural architecture search) and sparse, making the corre-

sponding map spaces much more complex. In this work,

rather than proposing yet another mapper, we do a ﬁrst-of-

its-kind apples-to-apples comparison of search techniques

leveraged by different mappers. Next, we extract the learnings

from our study and propose two new techniques that can

augment existing mappers — warm-start and sparsity-aware —

that demonstrate speedups, scalability, and robustness across

diverse DNN models1.

1. Introduction

Deep Neural Network (DNNs) have become an indis-

pensable tool in the solution toolbox for a variety of complex

problems such as object detection, machine translation,

language understanding, autonomous driving, and so on.

There is growing demand for specialized DNN accelerators

(also called Neural Processing Units or NPUs)

pursuing high

performance with high energy, power, and area efﬁciency.

The performance and energy-efﬁciency of a NPU depends

on how a DNN is mapped over the accelerator’s hardware

(compute and memory) resources [35,44]. Speciﬁcally, a

mapping (aka schedule) includes the computation order,

parallelization strategy and tile sizes [35,44], as shown

in Fig. 1. In order to achieve high efﬁciency across a wide

range of DNNs that include diverse layer shapes and sizes,

state-of-the-art DNN accelerators are often designed with

Code avaliable at https://github.com/maestro-project/gamma-timeloop.

In this paper, we use the terms DNN Accelerator and NPU interchange-

ably.

ﬂexibility to support different mapping strategies [9,36,48].

This ﬂexibility imposes a unique challenge for deployment:

ﬁnding a high-quality mapping between a DNN and the

ﬂexible accelerator from the space of all legal mappings

(i.e., the map space) during compile time. This is crucial to

unlock the full potential of the DNN accelerator.

As a result, prior work has clearly deﬁned map space

exploration (MSE) [19,23,28,44], as a critical problem for

NPU design and/or deployment, cleanly separating it from

the hardware architecture design space exploration (DSE)

problem. DSE includes identifying the right compute and

memory conﬁgurations for the NPU within constraints such

as total FLOPS, area, and power. MSE, meanwhile, takes

the hardware conﬁguration and DNN workload as input

and ﬁnds optimized mappings, optimizing some objective

(e.g., latency or energy-efﬁciency). To perform MSE, various

search algorithms (i.e., mappers) have been proposed within

the past few years [2,3,7,12–15,23,25,41,44,49,50,

54,55,57–60,63,64,66,67,70,73,75,76,79].

Despite the success achieved by these prior efforts, MSE

remains a computationally challenging problem. This is

because the search space for legal mappings for even a

single layer of a modern DNN (e.g., ResNet-50) on a

typical edge class accelerator [9] is

∼

O(10

) [19,28]

which would require more time than the age of the earth to

search exhaustively (assuming 1ms to evaluate each mapping

sample). This gets exacerbated as newer and ever larger

DNN models are being created with increasing frequency,

especially thanks to the success of neural architecture search

techniques [4,5,39,47,61]. Furthermore, the advent of

compressed-sparse DNNs [16,38,40,51,68,69,80], whose

mappings are not performance-portable across sparsity levels

(a key ﬁnding in this paper), further increases MSE burden.

Researching more sophisticated scalable and sparsity-

aware MSE techniques is at least partially hampered by

the fact that even though prior approaches have empirically

shown that their techniques work, none of them demonstrate

why they work and the insight behind their optimization

techniques.

It is these very insights that we wish to extract in this

paper, and in the process demystify MSE as a problem.

We cover both heuristics and learning-based optimization

approaches, analyze their behavior, and learn from their best

traits. We then use these learnings to scale MSE to more

complex workloads.

arXiv:2210.03731v1 [cs.LG] 7 Oct 2022

…

Global Buffer (L2)

PE PE PE

PE PE

…

DRAM

3 4

1 2

7 8

5 6

Time

1 2 3 4

1 2 1 2

t=1 t=2

Tile Scheduling Spatial Partitioning

Mapping on entire

accelerator at time = 1

Dataﬂow

PE0 PE1 PE2 PE3

1 2 3 4

3 3 3 3

1111

…

7 8

5 6

3 4

1 2

……

X’

Y’

…

Filter Tiles Input Tiles Output Tiles

…

Number: Tile IDs

Data / Computation Tile Sizing

Number: Tile IDs

Ordering Parallelism

Tiling

(Resnet50)

Mapping

Private Buffer (L1)

ALU

…

GEMM

Workload

B Batch size

K Output channel

C Input channel

YInput Height

XInput Width

RWeight Height

SWeight Width

MMatrix-A Rows

NMatrix-B Rows

K Contraction sizes

Accelerator (NPU)

CONV2D

Loop-nest representation of Mapping

DRAM Mapping

L2 Mapping

L1 Mapping

….

for x= [0, 128, xt):

for bt=[0, 1, 1)

for kt=[0, 16, 1):

for ct=[0, 64, 1):

for rt=[0, 3, 1):

for st=[0, 3, 1):

for yt=[0, 4, 1):

for xt= [0, 4, 1):

par_for kp=[0, 16, 1):

par_for cp=[0, 64, 1):

…

for rtt= [0, 1, 1):

…

L2 Tile

L2 Order

L2 Par.

L1 Tile

L1 Order

L1 Par.

DRAMTile

DRAM Order

DRAMPar.

Fig. 1: The overview of DNN Workload, Accelerator, and a (NVDLA-like [1]) Mapping.

Speciﬁcally, our contributions are two-fold.

(1) This is the ﬁrst work, to the best of our knowledge,

to quantitatively compare three wide categories of map-

pers: random-based [44] (i.e., heuristic pruning), feedback-

based [28] (i.e., blackbox optimization and reinforcement

learning), and gradient-based [19] (i.e., surrogate models),

and analyze their trade-offs. We conduct a sensitivity analysis

of different mapping axes to understand the contribution

of each axis. We then perform case studies that reveal

distinguishing characteristics of good and bad mappings.

Our analysis reveals that: (i) random search is inefﬁcient,

(ii) gradient-based search converges fast but requires prior

knowledge of the accelerator architecture, and (ii) feedback-

based search is more adaptable and sample-efﬁcient, but

requires higher cost to acquire each sample. Our analysis

also shows that optimality of a dense DNN mapping does

not port over to a sparse DNN.

(2) Based on our ﬁndings, we propose two novel heuristic

techniques to advance the state-of-the-art in MSE: (i) We

propose a warm-start technique to initialize the MSE with

prior optimal solutions from previous layers in a replay buffer

based on a similarity metric, enabling the mapper to start at

a better point and converge faster. In our evaluations, we ﬁnd

that warm-start can help the mapper converge to a similar

performance point 3.3x-7.3x faster. (ii) We also propose a

sparsity-aware technique to search for a mapping that can

perform well across a range of target activation sparsities.

A ﬁxed mapping found by our sparsity-aware approach can

achieve 99.7% of the performance of each of the mappings

speciﬁcally tailored to the various density levels.

2. Background: DNN Accelerators

2.1. DNN Workloads

In this work, we use individual DNN layers/operators

as our target workload. The workloads vary across different

DNN models because of different types of operations such as

CONV2D, Depth-wise CONV, Point-wise CONV, Attention,

Fully-Connected (FC), and so on, and different tensor shapes

for the layers (i.e., batch, input, weight kernel sizes), as

shown in Fig. 1. All these operations can be represented with

a loop-nest of computations. For example, a CONV2D can

be represented as 7 for-loops, and GEMM can be represented

as 3 for-loops.

2.2. Accelerator Hardware Conﬁguration

A canonical NPU often houses a spatial array of Process-

ing Elements (PEs), as shown in Fig. 1. Each PE has one to

several ALU units to compute partial sums, and private local

(aka “L1”) buffers to store weights, input activations and

partial sums. The accelerator also houses a global shared

(aka “L2”) buffer to prefetch activations and weights from

DRAM for the next tile of computation that will be mapped

over the PEs and L1 buffers. Networks-on-Chip are used

to distribute operands from the global L2 buffer to the L1

buffers in the PEs, collect the partial or full outputs, and

write them back to the L2 buffer.

2.3. Accelerator Map-Space

Given a DNN workload, there exist several choices for

mapping it on the accelerator’s PEs and buffer hierarchy

over space and time. The mapping includes the following

components [34,44], shown in Fig. 1:

(1) Tile sizes:

The ability to change bounds and aspect

ratios of data tiles from one or more operand tensors per

level of the buffer hierarchy [46].

(2) Loop order:

The ability to change the loop orders

iterated per tiling level.

(3) Loop parallelization:

The ability to change which

tensor dimensions are parallelized per tiling level. This

represents the spatial partitioning of data (i.e., across PEs).

Fig. 1 shows an example of the mapping used by the

NVDLA [1] accelerator. Choices for (2) and (3) together

are often referred to as dataﬂow [34] which has been

Objective

(EDP, latency)

Workload

(Resnet)

Accelerator

Config.

(PEs, Buffers,…)

Optimized solution (Mapping)

Workload

Feature

(Sparsity)

Memory Config.

- Levels of buffers

- Per buffer level

- Buffer type

-Read/Write BW

- Buffer sizes

-Bank

-Width

Compute Config.

-Num of PEs

-Num of ALUs

-Bit-width

Representation (Map Space)

Exploration (Mapper)

(e.g., Gamma, Mind Mappings,

Random-Pruned)

Evaluation (Cost model)

(e.g., Timeloop, MAESTRO)

Random

based

Gradient

based

Feedback

based

(black-box,

RL)

Surrogate

model is tied

to one or few

accelerator

configurations

Fast: Update by

gradient & avoid

interaction with

cost-model

The fastest: light

exploration

algorithm

Poor sampling

efficiency

High sampling

efficiency

Slow: heavy exploration

algorithm & frequently

interacting with cost-

model

e.g.,

Random-Pruned

e.g., Gamma

e.g.,

Mind Mappings

Other

methods

Exploration Methods

(Mappers)

e.g., CoSA

(Polyhedral, MIP, MCMC,..)

Fig. 2: A canonical Map Space Exploration framework.

informally classiﬁed by prior work into weight-stationary,

output stationary and input-stationary [8]. The design-space

of all possible mappings (i.e., dataﬂows + tile-sizes) that an

accelerator can support is called its Map-Space [44].

Flexible DNN accelerators [9,36] allow a mapping

optimizer within a compiler to explore tile sizes, loop

orders and parallelization independently for each layer. This

mapping ﬂexibility is crucial for accelerators to adapt to

growing diversity in DNNs [34]. The overall runtime and

energy-efﬁciency of an accelerator depends on both the

hardware conﬁguration and the mapping, making it crucial

to ﬁnd an optimized mapping

, [34,44,75], as we discuss

next.

3. Map Space Exploration (MSE)

A canonical MSE framework is shown in Fig. 2. MSE

takes the NPU’s HW conﬁguration (§2.2) and target DNN

workloads (size, shape, and additional features such as

sparsity level of weight and/or activations) as input and

ﬁnds optimized mappings given an objective (e.g., latency,

throughput, energy, energy-delay-product (EDP), and so

on). MSE may be run at compile time within a mapping

optimizer [6] after the NPU is deployed, or at design-time

in conjunction with DSE for co-optimizing the mapping and

HW conﬁguration [31,73].

The MSE process often includes three parts: Represen-

tation of search space, Evaluation method, and Exploration

method. The representation will deﬁne the scope of the

searching problem and the size of the search space. An

optimization loop that includes exploration and evaluation

performs the actual search. The optimization continues till

the MSE converges, or reaches a given sampling budget or

wall-clock run time budget.

3.1. Representation of Map Space

While recent work has proposed various representations

(MAESTRO [35], UNION [24], and Ruby [22]) to increase

mapping diversity in the map space, in this work we leverage

the canonical Timeloop representation, which is loop-nests

In this paper, we focus on ﬁnding optimized mapping for individual

DNN layers/operators, which has been the target of most Map-Space

Exploration tools. We leave Inter-layer mappings via operator-fusion as

future work.

to represent each tiling level (e.g., NVDLA-like mapping in

Fig. 1). We ensure that all the candidate mappings generated

by various mappers during MSE are legal.

3.2. Evaluation Method (Cost Model)

MSE relies on a DNN accelerator cost model to estimate

the performance of a certain mapping on a given accelerator

for a given workload. These cost models are typically

analytical, enabling rapid evaluation of different design-

points in a matter of ms. Some widely used cost models

include Timeloop [44], MAESTRO [34], dMazeRunner [12],

Interstellar [75], SCALE-sim [52] and others [32,42].

These cost models can model different kinds of acceler-

ators (systolic arrays [52], ﬂexible spatial arrays [12,34,

44], sparse accelerators [71], and so on) and capture each

accelerator’s map space in different formats. In this work,

we use Timeloop [44] as our cost model

which is validated

against real chips [10,54].

3.3. Exploration Method (Mapper)

The exploration algorithm in MSE (Fig. 2) is called a

mapper. Dozens of different DNN mappers have been pro-

posed, which we categorize into random search based [12,44,

54,63,75], feedback-based (including reinforcement learning

and black-box optimization) [7,25,27,28,73,79], gradient-

based [19], and others (including mathematical optimization,

MCMC, polyhedral transformations, and heuristics) [3,15,

23,25,49,64] (Fig. 2). The random search-based either apply

random sampling on the search space or apply pruned random

search [6,44], which prunes off the redundant search space

to increase the sampling efﬁciency. The feedback-based use

a learning algorithm to interact with the cost model and keep

improving its solution. The run time of both random search-

based and feedback-based depend heavily on the run time of

the cost model, potentially becoming the bottleneck of the

MSE run time. Gradient-based methods uses a differentiable

surrogate model, which eliminates this bottleneck and can

update the solution directly by the gradient of the loss. We

do a deeper dive within these three types in §4.3.

Timeloop includes both a cost model and mappers. Throughout this

paper, we refer to the former as Timeloop and the latter as Timeloop-mapper.

Timeloop-mapper itself supports a variety of search heuristics, with the

default being Random-Pruned which we use. We also run other mappers

using Timeloop as the cost model.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DemystifyingMapSpaceExplorationforNPUsSheng-ChunKao1AngshumanParashar2Po-AnTsai2andTusharKrishna11GeorgiaInstituteofTechnology2Nvidia1skao6@gatech.edu,tushar@ece.gatech.edu2{aparashar,poant}@nvidia.comAbstractMapSpaceExplorationistheproblemofndingopti-mizedmappingsofaDeepNeuralNetwork(DNN)modelonan...

展开>> 收起<<

Demystifying Map Space Exploration for NPUs.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Demystifying Map Space Exploration for NPUs

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: