Demystifying Map Space Exploration for NPUs
Sheng-Chun Kao1Angshuman Parashar2Po-An Tsai2and Tushar Krishna1
1Georgia Institute of Technology
2Nvidia
1skao6@gatech.edu, tushar@ece.gatech.edu
2{aparashar, poant} @nvidia.com
Abstract
Map Space Exploration is the problem of finding opti-
mized mappings of a Deep Neural Network (DNN) model on
an accelerator. It is known to be extremely computationally
expensive, and there has been active research looking at
both heuristics and learning-based methods to make the
problem computationally tractable. However, while there are
dozens of mappers out there (all empirically claiming to
find better mappings than others), the research community
lacks systematic insights on how different search techniques
navigate the map-space and how different mapping axes
contribute to the accelerator’s performance and efficiency.
Such insights are crucial to developing mapping frameworks
for emerging DNNs that are increasingly irregular (due to
neural architecture search) and sparse, making the corre-
sponding map spaces much more complex. In this work,
rather than proposing yet another mapper, we do a first-of-
its-kind apples-to-apples comparison of search techniques
leveraged by different mappers. Next, we extract the learnings
from our study and propose two new techniques that can
augment existing mappers — warm-start and sparsity-aware —
that demonstrate speedups, scalability, and robustness across
diverse DNN models1.
1. Introduction
Deep Neural Network (DNNs) have become an indis-
pensable tool in the solution toolbox for a variety of complex
problems such as object detection, machine translation,
language understanding, autonomous driving, and so on.
There is growing demand for specialized DNN accelerators
(also called Neural Processing Units or NPUs)
2
pursuing high
performance with high energy, power, and area efficiency.
The performance and energy-efficiency of a NPU depends
on how a DNN is mapped over the accelerator’s hardware
(compute and memory) resources [35,44]. Specifically, a
mapping (aka schedule) includes the computation order,
parallelization strategy and tile sizes [35,44], as shown
in Fig. 1. In order to achieve high efficiency across a wide
range of DNNs that include diverse layer shapes and sizes,
state-of-the-art DNN accelerators are often designed with
1.
Code avaliable at https://github.com/maestro-project/gamma-timeloop.
2.
In this paper, we use the terms DNN Accelerator and NPU interchange-
ably.
flexibility to support different mapping strategies [9,36,48].
This flexibility imposes a unique challenge for deployment:
finding a high-quality mapping between a DNN and the
flexible accelerator from the space of all legal mappings
(i.e., the map space) during compile time. This is crucial to
unlock the full potential of the DNN accelerator.
As a result, prior work has clearly defined map space
exploration (MSE) [19,23,28,44], as a critical problem for
NPU design and/or deployment, cleanly separating it from
the hardware architecture design space exploration (DSE)
problem. DSE includes identifying the right compute and
memory configurations for the NPU within constraints such
as total FLOPS, area, and power. MSE, meanwhile, takes
the hardware configuration and DNN workload as input
and finds optimized mappings, optimizing some objective
(e.g., latency or energy-efficiency). To perform MSE, various
search algorithms (i.e., mappers) have been proposed within
the past few years [2,3,7,12–15,23,25,41,44,49,50,
54,55,57–60,63,64,66,67,70,73,75,76,79].
Despite the success achieved by these prior efforts, MSE
remains a computationally challenging problem. This is
because the search space for legal mappings for even a
single layer of a modern DNN (e.g., ResNet-50) on a
typical edge class accelerator [9] is
∼
O(10
24
) [19,28]
which would require more time than the age of the earth to
search exhaustively (assuming 1ms to evaluate each mapping
sample). This gets exacerbated as newer and ever larger
DNN models are being created with increasing frequency,
especially thanks to the success of neural architecture search
techniques [4,5,39,47,61]. Furthermore, the advent of
compressed-sparse DNNs [16,38,40,51,68,69,80], whose
mappings are not performance-portable across sparsity levels
(a key finding in this paper), further increases MSE burden.
Researching more sophisticated scalable and sparsity-
aware MSE techniques is at least partially hampered by
the fact that even though prior approaches have empirically
shown that their techniques work, none of them demonstrate
why they work and the insight behind their optimization
techniques.
It is these very insights that we wish to extract in this
paper, and in the process demystify MSE as a problem.
We cover both heuristics and learning-based optimization
approaches, analyze their behavior, and learn from their best
traits. We then use these learnings to scale MSE to more
complex workloads.
arXiv:2210.03731v1 [cs.LG] 7 Oct 2022