Demystifying Map Space Exploration for NPUs

2025-04-22 0 0 1.09MB 13 页 10玖币
侵权投诉
Demystifying Map Space Exploration for NPUs
Sheng-Chun Kao1Angshuman Parashar2Po-An Tsai2and Tushar Krishna1
1Georgia Institute of Technology
2Nvidia
1skao6@gatech.edu, tushar@ece.gatech.edu
2{aparashar, poant} @nvidia.com
Abstract
Map Space Exploration is the problem of finding opti-
mized mappings of a Deep Neural Network (DNN) model on
an accelerator. It is known to be extremely computationally
expensive, and there has been active research looking at
both heuristics and learning-based methods to make the
problem computationally tractable. However, while there are
dozens of mappers out there (all empirically claiming to
find better mappings than others), the research community
lacks systematic insights on how different search techniques
navigate the map-space and how different mapping axes
contribute to the accelerator’s performance and efficiency.
Such insights are crucial to developing mapping frameworks
for emerging DNNs that are increasingly irregular (due to
neural architecture search) and sparse, making the corre-
sponding map spaces much more complex. In this work,
rather than proposing yet another mapper, we do a first-of-
its-kind apples-to-apples comparison of search techniques
leveraged by different mappers. Next, we extract the learnings
from our study and propose two new techniques that can
augment existing mappers — warm-start and sparsity-aware
that demonstrate speedups, scalability, and robustness across
diverse DNN models1.
1. Introduction
Deep Neural Network (DNNs) have become an indis-
pensable tool in the solution toolbox for a variety of complex
problems such as object detection, machine translation,
language understanding, autonomous driving, and so on.
There is growing demand for specialized DNN accelerators
(also called Neural Processing Units or NPUs)
2
pursuing high
performance with high energy, power, and area efficiency.
The performance and energy-efficiency of a NPU depends
on how a DNN is mapped over the accelerator’s hardware
(compute and memory) resources [35,44]. Specifically, a
mapping (aka schedule) includes the computation order,
parallelization strategy and tile sizes [35,44], as shown
in Fig. 1. In order to achieve high efficiency across a wide
range of DNNs that include diverse layer shapes and sizes,
state-of-the-art DNN accelerators are often designed with
1.
Code avaliable at https://github.com/maestro-project/gamma-timeloop.
2.
In this paper, we use the terms DNN Accelerator and NPU interchange-
ably.
flexibility to support different mapping strategies [9,36,48].
This flexibility imposes a unique challenge for deployment:
finding a high-quality mapping between a DNN and the
flexible accelerator from the space of all legal mappings
(i.e., the map space) during compile time. This is crucial to
unlock the full potential of the DNN accelerator.
As a result, prior work has clearly defined map space
exploration (MSE) [19,23,28,44], as a critical problem for
NPU design and/or deployment, cleanly separating it from
the hardware architecture design space exploration (DSE)
problem. DSE includes identifying the right compute and
memory configurations for the NPU within constraints such
as total FLOPS, area, and power. MSE, meanwhile, takes
the hardware configuration and DNN workload as input
and finds optimized mappings, optimizing some objective
(e.g., latency or energy-efficiency). To perform MSE, various
search algorithms (i.e., mappers) have been proposed within
the past few years [2,3,7,1215,23,25,41,44,49,50,
54,55,5760,63,64,66,67,70,73,75,76,79].
Despite the success achieved by these prior efforts, MSE
remains a computationally challenging problem. This is
because the search space for legal mappings for even a
single layer of a modern DNN (e.g., ResNet-50) on a
typical edge class accelerator [9] is
O(10
24
) [19,28]
which would require more time than the age of the earth to
search exhaustively (assuming 1ms to evaluate each mapping
sample). This gets exacerbated as newer and ever larger
DNN models are being created with increasing frequency,
especially thanks to the success of neural architecture search
techniques [4,5,39,47,61]. Furthermore, the advent of
compressed-sparse DNNs [16,38,40,51,68,69,80], whose
mappings are not performance-portable across sparsity levels
(a key finding in this paper), further increases MSE burden.
Researching more sophisticated scalable and sparsity-
aware MSE techniques is at least partially hampered by
the fact that even though prior approaches have empirically
shown that their techniques work, none of them demonstrate
why they work and the insight behind their optimization
techniques.
It is these very insights that we wish to extract in this
paper, and in the process demystify MSE as a problem.
We cover both heuristics and learning-based optimization
approaches, analyze their behavior, and learn from their best
traits. We then use these learnings to scale MSE to more
complex workloads.
arXiv:2210.03731v1 [cs.LG] 7 Oct 2022
PE
Global Buffer (L2)
PE PE PE
PE PE
PE
PE
PE
DRAM
3 4
1 2
7 8
5 6
Time
1 2 3 4
1 2 1 2
t=1 t=2
Tile Scheduling Spatial Partitioning
Mapping on entire
accelerator at time = 1
Dataflow
PE0 PE1 PE2 PE3
1 2 3 4
3 3 3 3
1111
7 8
5 6
3 4
1 2
K
3
1
4
2
C
S
R
X
Y
C
N
X’
Y’
N
Filter Tiles Input Tiles Output Tiles
15
13
3
1
16
14
4
2
Number: Tile IDs
Data / Computation Tile Sizing
Number: Tile IDs
Ordering Parallelism
Tiling
(Resnet50)
Mapping
Private Buffer (L1)
ALU
ALU
ALU
AB
M
N
K
K
GEMM
Workload
B Batch size
K Output channel
C Input channel
YInput Height
XInput Width
RWeight Height
SWeight Width
MMatrix-A Rows
NMatrix-B Rows
K Contraction sizes
Accelerator (NPU)
CONV2D
Loop-nest representation of Mapping
DRAM Mapping
L2 Mapping
L1 Mapping
….
for x= [0, 128, xt):
for bt=[0, 1, 1)
for kt=[0, 16, 1):
for ct=[0, 64, 1):
for rt=[0, 3, 1):
for st=[0, 3, 1):
for yt=[0, 4, 1):
for xt= [0, 4, 1):
par_for kp=[0, 16, 1):
par_for cp=[0, 64, 1):
for rtt= [0, 1, 1):
L2 Tile
L2 Order
L2 Par.
L1 Tile
L1 Order
L1 Par.
DRAMTile
DRAM Order
DRAMPar.
Fig. 1: The overview of DNN Workload, Accelerator, and a (NVDLA-like [1]) Mapping.
Specifically, our contributions are two-fold.
(1) This is the first work, to the best of our knowledge,
to quantitatively compare three wide categories of map-
pers: random-based [44] (i.e., heuristic pruning), feedback-
based [28] (i.e., blackbox optimization and reinforcement
learning), and gradient-based [19] (i.e., surrogate models),
and analyze their trade-offs. We conduct a sensitivity analysis
of different mapping axes to understand the contribution
of each axis. We then perform case studies that reveal
distinguishing characteristics of good and bad mappings.
Our analysis reveals that: (i) random search is inefficient,
(ii) gradient-based search converges fast but requires prior
knowledge of the accelerator architecture, and (ii) feedback-
based search is more adaptable and sample-efficient, but
requires higher cost to acquire each sample. Our analysis
also shows that optimality of a dense DNN mapping does
not port over to a sparse DNN.
(2) Based on our findings, we propose two novel heuristic
techniques to advance the state-of-the-art in MSE: (i) We
propose a warm-start technique to initialize the MSE with
prior optimal solutions from previous layers in a replay buffer
based on a similarity metric, enabling the mapper to start at
a better point and converge faster. In our evaluations, we find
that warm-start can help the mapper converge to a similar
performance point 3.3x-7.3x faster. (ii) We also propose a
sparsity-aware technique to search for a mapping that can
perform well across a range of target activation sparsities.
A fixed mapping found by our sparsity-aware approach can
achieve 99.7% of the performance of each of the mappings
specifically tailored to the various density levels.
2. Background: DNN Accelerators
2.1. DNN Workloads
In this work, we use individual DNN layers/operators
as our target workload. The workloads vary across different
DNN models because of different types of operations such as
CONV2D, Depth-wise CONV, Point-wise CONV, Attention,
Fully-Connected (FC), and so on, and different tensor shapes
for the layers (i.e., batch, input, weight kernel sizes), as
shown in Fig. 1. All these operations can be represented with
a loop-nest of computations. For example, a CONV2D can
be represented as 7 for-loops, and GEMM can be represented
as 3 for-loops.
2.2. Accelerator Hardware Configuration
A canonical NPU often houses a spatial array of Process-
ing Elements (PEs), as shown in Fig. 1. Each PE has one to
several ALU units to compute partial sums, and private local
(aka “L1”) buffers to store weights, input activations and
partial sums. The accelerator also houses a global shared
(aka “L2”) buffer to prefetch activations and weights from
DRAM for the next tile of computation that will be mapped
over the PEs and L1 buffers. Networks-on-Chip are used
to distribute operands from the global L2 buffer to the L1
buffers in the PEs, collect the partial or full outputs, and
write them back to the L2 buffer.
2.3. Accelerator Map-Space
Given a DNN workload, there exist several choices for
mapping it on the accelerator’s PEs and buffer hierarchy
over space and time. The mapping includes the following
components [34,44], shown in Fig. 1:
(1) Tile sizes:
The ability to change bounds and aspect
ratios of data tiles from one or more operand tensors per
level of the buffer hierarchy [46].
(2) Loop order:
The ability to change the loop orders
iterated per tiling level.
(3) Loop parallelization:
The ability to change which
tensor dimensions are parallelized per tiling level. This
represents the spatial partitioning of data (i.e., across PEs).
Fig. 1 shows an example of the mapping used by the
NVDLA [1] accelerator. Choices for (2) and (3) together
are often referred to as dataflow [34] which has been
Objective
(EDP, latency)
Workload
(Resnet)
Accelerator
Config.
(PEs, Buffers,…)
Optimized solution (Mapping)
Workload
Feature
(Sparsity)
Memory Config.
- Levels of buffers
- Per buffer level
- Buffer type
-Read/Write BW
- Buffer sizes
-Bank
-Width
Compute Config.
-Num of PEs
-Num of ALUs
-Bit-width
Representation (Map Space)
Exploration (Mapper)
(e.g., Gamma, Mind Mappings,
Random-Pruned)
Evaluation (Cost model)
(e.g., Timeloop, MAESTRO)
Random
based
Gradient
based
Feedback
based
(black-box,
RL)
Surrogate
model is tied
to one or few
accelerator
configurations
Fast: Update by
gradient & avoid
interaction with
cost-model
The fastest: light
exploration
algorithm
Poor sampling
efficiency
High sampling
efficiency
Slow: heavy exploration
algorithm & frequently
interacting with cost-
model
e.g.,
Random-Pruned
e.g., Gamma
e.g.,
Mind Mappings
Other
methods
ü
!
ü
!
ü
!
Exploration Methods
(Mappers)
e.g., CoSA
(Polyhedral, MIP, MCMC,..)
Fig. 2: A canonical Map Space Exploration framework.
informally classified by prior work into weight-stationary,
output stationary and input-stationary [8]. The design-space
of all possible mappings (i.e., dataflows + tile-sizes) that an
accelerator can support is called its Map-Space [44].
Flexible DNN accelerators [9,36] allow a mapping
optimizer within a compiler to explore tile sizes, loop
orders and parallelization independently for each layer. This
mapping flexibility is crucial for accelerators to adapt to
growing diversity in DNNs [34]. The overall runtime and
energy-efficiency of an accelerator depends on both the
hardware configuration and the mapping, making it crucial
to find an optimized mapping
3
, [34,44,75], as we discuss
next.
3. Map Space Exploration (MSE)
A canonical MSE framework is shown in Fig. 2. MSE
takes the NPU’s HW configuration (§2.2) and target DNN
workloads (size, shape, and additional features such as
sparsity level of weight and/or activations) as input and
finds optimized mappings given an objective (e.g., latency,
throughput, energy, energy-delay-product (EDP), and so
on). MSE may be run at compile time within a mapping
optimizer [6] after the NPU is deployed, or at design-time
in conjunction with DSE for co-optimizing the mapping and
HW configuration [31,73].
The MSE process often includes three parts: Represen-
tation of search space, Evaluation method, and Exploration
method. The representation will define the scope of the
searching problem and the size of the search space. An
optimization loop that includes exploration and evaluation
performs the actual search. The optimization continues till
the MSE converges, or reaches a given sampling budget or
wall-clock run time budget.
3.1. Representation of Map Space
While recent work has proposed various representations
(MAESTRO [35], UNION [24], and Ruby [22]) to increase
mapping diversity in the map space, in this work we leverage
the canonical Timeloop representation, which is loop-nests
3.
In this paper, we focus on finding optimized mapping for individual
DNN layers/operators, which has been the target of most Map-Space
Exploration tools. We leave Inter-layer mappings via operator-fusion as
future work.
to represent each tiling level (e.g., NVDLA-like mapping in
Fig. 1). We ensure that all the candidate mappings generated
by various mappers during MSE are legal.
3.2. Evaluation Method (Cost Model)
MSE relies on a DNN accelerator cost model to estimate
the performance of a certain mapping on a given accelerator
for a given workload. These cost models are typically
analytical, enabling rapid evaluation of different design-
points in a matter of ms. Some widely used cost models
include Timeloop [44], MAESTRO [34], dMazeRunner [12],
Interstellar [75], SCALE-sim [52] and others [32,42].
These cost models can model different kinds of acceler-
ators (systolic arrays [52], flexible spatial arrays [12,34,
44], sparse accelerators [71], and so on) and capture each
accelerator’s map space in different formats. In this work,
we use Timeloop [44] as our cost model
4
which is validated
against real chips [10,54].
3.3. Exploration Method (Mapper)
The exploration algorithm in MSE (Fig. 2) is called a
mapper. Dozens of different DNN mappers have been pro-
posed, which we categorize into random search based [12,44,
54,63,75], feedback-based (including reinforcement learning
and black-box optimization) [7,25,27,28,73,79], gradient-
based [19], and others (including mathematical optimization,
MCMC, polyhedral transformations, and heuristics) [3,15,
23,25,49,64] (Fig. 2). The random search-based either apply
random sampling on the search space or apply pruned random
search [6,44], which prunes off the redundant search space
to increase the sampling efficiency. The feedback-based use
a learning algorithm to interact with the cost model and keep
improving its solution. The run time of both random search-
based and feedback-based depend heavily on the run time of
the cost model, potentially becoming the bottleneck of the
MSE run time. Gradient-based methods uses a differentiable
surrogate model, which eliminates this bottleneck and can
update the solution directly by the gradient of the loss. We
do a deeper dive within these three types in §4.3.
4.
Timeloop includes both a cost model and mappers. Throughout this
paper, we refer to the former as Timeloop and the latter as Timeloop-mapper.
Timeloop-mapper itself supports a variety of search heuristics, with the
default being Random-Pruned which we use. We also run other mappers
using Timeloop as the cost model.
摘要:

DemystifyingMapSpaceExplorationforNPUsSheng-ChunKao1AngshumanParashar2Po-AnTsai2andTusharKrishna11GeorgiaInstituteofTechnology2Nvidia1skao6@gatech.edu,tushar@ece.gatech.edu2{aparashar,poant}@nvidia.comAbstractMapSpaceExplorationistheproblemofndingopti-mizedmappingsofaDeepNeuralNetwork(DNN)modelonan...

展开>> 收起<<
Demystifying Map Space Exploration for NPUs.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.09MB 格式:PDF 时间:2025-04-22

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注