DPU-v2 Energy-efficient execution of irregular directed acyclic graphs Nimish Shah Wannes Meertyand Marian Verhelst Department of Electrical Engineering - MICAS KU Leuven Belgium

2025-05-03 0 0 5.2MB 20 页 10玖币
侵权投诉
DPU-v2: Energy-efficient execution of irregular directed acyclic graphs
Nimish Shah, Wannes Meertand Marian Verhelst
Department of Electrical Engineering - MICAS, KU Leuven, Belgium
Department of Computer Science - DTAI, KU Leuven, Belgium
Emails: nshah@esat.kuleuven.be, wannes.meert@kuleuven.be, marian.verhelst@kuleuven.be
Abstract—A growing number of applications like probabilis-
tic machine learning, sparse linear algebra, robotic navigation,
etc., exhibit irregular data flow computation that can be
modeled with directed acyclic graphs (DAGs). The irregularity
arises from the seemingly random connections of nodes, which
makes the DAG structure unsuitable for vectorization on
CPU or GPU. Moreover, the nodes usually represent a small
number of arithmetic operations that cannot amortize the
overhead of launching tasks/kernels for each node, further
posing challenges for parallel execution.
To enable energy-efficient execution, this work proposes
DAG processing unit (DPU) version 2, a specialized processor
architecture optimized for irregular DAGs with static connec-
tivity. It consists of a tree-structured datapath for efficient
data reuse, a customized banked register file, and interconnects
tuned to support irregular register accesses. DPU-v2 is utilized
effectively through a targeted compiler that systematically
maps operations to the datapath, minimizes register bank
conflicts, and avoids pipeline hazards. Finally, a design space
exploration identifies the optimal architecture configuration
that minimizes the energy-delay product. This hardware-
software co-optimization approach results in a speedup of
1.4×, 3.5×, and 14×over a state-of-the-art DAG processor
ASIP, a CPU, and a GPU, respectively, while also achieving a
lower energy-delay product. In this way, this work takes an
important step towards enabling an embedded execution of
emerging DAG workloads.
Keywords-Graphs, irregular computation graphs, parallel
processor, hardware-software codesign, DAG processing unit,
probabilistic circuits, sparse matrix triangular solve, spatial
datapath, interconnection network, bank conflicts, design space
exploration
I. INTRODUCTION
Emerging machine learning models like probabilistic cir-
cuits (PC, also called sum-product networks) [4], [41] are
increasingly used for energy-constrained applications like
robotic navigation [59], robust image classification [49],
human activity recognition [17], etc. PCs exhibit an ir-
regular flow of data that can be represented with directed
acyclic graphs (DAG), as shown in fig. 1(a) for a human
activity recognition model. Similarly, sparse linear algebra
operations like sparse triangular solve (SpTRSV) used for
embedded applications like robotic localization and mapping
[40], wireless communication [54], cryptography [15], etc.,
can also be represented with irregular DAGs. Fig. 1(b) shows
an example of a DAG for a matrix from the SuiteSparse
collection [13]. In these DAGs, nodes represent arithmetic
Figure 1. Example applications with irregular computation DAGs. (a)
Probabilistic circuits (PC) used in machine learning, and (b) Sparse matrix
triangular solves (SpTRSV) used in linear algebra. (c) CPU and GPU
achieve significantly lower than their peak throughput for such DAGs.
operations and edges represent the dependencies among
these operations.
Unfortunately, these irregular unstructured DAGs pose
challenges to general-purpose computing architectures like
GPUs and single-instruction-multiple-data (SIMD) vector-
ization in CPUs. Fig. 1(c) shows the throughputs of an Intel
Xeon CPU and an Nvidia RTX GPU for these applica-
tions.The CPU performs much below the peak achievable
throughput (3.4 TOPS in this case), and the GPU even
underperforms compared to the CPU despite having highly
parallel hardware (especially for DAGs with fewer than
100K nodes). This underperformance is caused by the fol-
lowing two reasons, preventing the platforms from utilizing
the sparsity in the workloads:
The seemingly-random edge connectivity of an unstruc-
tured DAG results in irregular memory accesses with poor
spatial locality. Hence, the large granularity of cache-line
size leads to severe under-utilization of caches and mem-
ory bandwidth, because only 4B might get consumed out
of the 32/64B of data fetched to a cache line. Furthermore,
1
arXiv:2210.13184v1 [cs.AR] 20 Oct 2022
these irregular accesses also prevent memory coalescing,
which is crucial for high GPU performance [52].
Due to the irregular structure, parallelizing different parts
of DAGs across multiple units (like CPU cores, GPU
streaming multiprocessors, etc.) demands high communi-
cation and synchronization overhead, limiting the paral-
lelization benefits [44].
To address these issues, this paper proposes DAG process-
ing unit (DPU-v2), a processing architecture1optimised for
irregular DAGs along with a custom-designed targeted com-
piler. The major contributions of this work are as follows:
1) Processor: A parameterized architecture template is de-
signed with a tree-structured datapath for immediate
reuse of data, a banked register file with a novel address-
ing scheme, and high-bandwidth flexible interconnects.
The architecture is made programmable with a custom
instruction set with long words.
2) Compiler: A specialized compiler is designed that sched-
ules DAG nodes such that the tree-based datapath is
maximally utilized. The irregularity of DAGs leads to
frequent register bank conflicts, which are reduced with
a conflict-aware register allocation while taking into
account the interconnect constraints.
3) Design-space exploration: Since there are multiple de-
sign parameters in the architecture template (e.g. the
number of register banks, computational parallelism,
etc.), a design space exploration is performed to arrive at
the architecture with the minimum energy-delay product
in a 28nm CMOS technology, and compared with state-
of-the-art implementations.
This work focuses on DAGs that remain static across
multiple executions, which allow amortization of the com-
pilation overhead, enabling highly targeted (but relatively
slow) DAG-specific compilation. This assumption of DAGs
being static is valid for our target workloads: (1) the structure
of a trained PC remains static during inference, and (2) in
many applications using SpTRSV like robotic localization
and mapping [40], wireless communication [54], cryptogra-
phy [15], etc., the sparsity pattern of a matrix remains static
while the numerical values or the right-hand side vector
change across executions, which effectively only changes
the inputs of the DAG. Furthermore, DAGs are assumed to
contain only arithmetic nodes (and no conditional nodes)
that are executed in a predefined order based on edge-
induced dependencies. This assumption is also valid for the
target workloads in this paper, but excludes operations like
breadth-first search (even on a static DAG) from the scope
of this work.
The paper is organized as follows. §II discusses this
work’s approach for efficient parallel execution of DAGs.
§III describes the processor architecture template of DPU-v2
followed by §IV explaining the compiler. Next, §V presents
1Available at https://github.com/nimish15shah/DAG Processor.
Instruction
Shared scratchpad banks
Instruction Instruction
Interconnection network
Instruction
Sync
unit
(a) Irregular acceses to scratchpad
Shared scratchpad banks
(b) Irregular accesses to register le
PE
Decoder
RF PE
Decoder
RF PE
Decoder
RF PE
Decoder
RF
Decoder
Inter. network
Long instruction
RF
Datapath with
parallel PEs
RF RFRF
Figure 2. (a) A common approach for executing DAGs in parallel, in which
the unpredictable irregular accesses happen to scratchpad/memory. (b) In
DPU-v2 by pushing the interconnect closer to the datapath, the compiler
is equipped to predict the irregular accesses and prevent bank conflicts.
the design-space exploration to find the design with the
minimum energy-delay product, and compares it with state-
of-the-art work. Finally, §VI discusses related works and
§VII concludes the paper.
II. CONCEPTS TO TAME IRREGULARITY
In this paper, a DAG encodes the computations to be
performed for an application and its data dependencies. A
DAG node represents arithmetic operations (e.g. additions,
multiplications) to be performed on the node’s inputs. This
work focuses on DAGs with fine-grained nodes with low
arithmetic intensity. In other words, the nodes represent a
scalar or a small-sized vector operation but not large tensor
operations, in contrast to DAGs of deep neural networks, for
example, which have compute-intensive tensor operations as
nodes. The output result of a node serves as an input to
(possibly multiple) nodes connected to the outgoing edges.
Execution order: Executing a DAG means evaluating all
the nodes in a valid order imposed by the directed edges
representing data dependencies. For all the edges, the source
node should be executed before the destination node. Put
differently, a node can be executed only after all the prede-
cessor nodes have finished execution. Note that the DAGs
targeted in this work do not contain conditional branch
nodes, and hence all the nodes are supposed to be executed.
What can be executed in parallel? At a given moment, all
the nodes whose predecessor nodes have finished execution
can be executed in parallel. The execution begins with the
source nodes of the DAG, as they do not have any incoming
edges. Subsequently, different nodes become ready as the
execution progresses. DAGs from real-world applications
typically have ample parallelism, which can be utilized
through a parallel processing architecture.
The rest of this section discusses the challenges for
parallel processing of irregular DAGs and the techniques
introduced in this work to address them.
2
A. Making irregular data accesses predictable
Fig. 2(a) shows a common approach for accelerating
DAGs, in which a DAG is partitioned into smaller sub-
graphs that can be executed on parallel cores, e.g, in a
multicore CPU. This approach is used in works like [39],
[44], [46]. The cores have to synchronize occasionally for
race-free communication that happens through a shared
memory structure like an L3 cache in CPUs or a shared
scratchpad in [46]. Due to the DAG irregularity, the ac-
cesses to this shared memory structure usually happen to
random addresses, inducing frequent bank conflicts in the
typically-used banked memory structures. These conflicts
could become a severe bottleneck. For instance, in [46],
43% of the load requests result in bank conflicts. Work in
[11], [43] alleviates the impact of these conflicts through
aggressive reordering of requests, but incurs the overhead
of a reordering buffer. Yet, none of these works attempt to
avoid the conflicts altogether. Given that the DAG structure
is known at compile time, it could be possible to perform
the memory allocation such that the simultaneous requests
do not access the same bank. The main challenge for such
conflict avoidance is that it is not straightforward to predict
which requests happen simultaneously. The parallel cores
normally execute asynchronously, i.e. a stall in one core (eg.,
to wait for the next instruction) does not stall the others,
preventing a compiler from predicting conflicts.
Fig. 2(b) shows an alternative approach that makes the
simultaneous irregular accesses predictable by pushing the
interconnect closer to the datapath and forcing the PE to
operate in a synchronous way. This limits the irregular
accesses to the register file, while the memory/scratchpad
can be accessed in a regular pattern. In every cycle, the
compiler knows which registers are accessed and can thereby
also predict register bank conflicts. The datapath with par-
allel PEs, however, cannot be programmed with SIMD-
like instructions because the PEs will gather data from
different addresses in the register banks owing to DAG
irregularity. Instead, VLIW (very long instruction word)-like
instructions can be used. Yet, unlike conventional VLIW
with a partitioned register file [2], the datapath does not
have to be limited to an array of PEs, but complex dataflow
patterns can be utilized.
This work hypothesizes that the architecture in fig. 2(b)
can improve throughput by avoiding bank conflicts and by
using a datapath tuned for the target workloads.
B. Reducing data accesses
To ease the problem of bank conflicts further and to reduce
energy consumption, the intermediate results in the com-
putation could be immediately consumed in the datapath,
avoiding reads/writes from a register file. A commonly used
topology for such data reuse is a systolic array of PEs [27]
(fig. 3(a)), yet it is unsuitable for sparse DAGs due to severe
hardware under-utilization as described below.
n inputs, n2/4 PEs n inputs, n-1 PEs
Systolic array Tree
(a) (b)
PE
Systolic Tree
Inputs
Peak utilization (%)
2 4 8 16
0
50
100
(c)
Figure 3. Systolic arrays are under-utilized by irregular DAGs, while a
tree-shaped datapath is a promising alternative.
Peak utilization: To check the suitability of the systolic
array, the spatial datapath mapper from [34] is used to find
the largest subgraph in our target DAGs (described in §V-A)
that can be mapped to the array. This essentially is the peak
utilization possible for the datapath. Fig. 3(c) shows that the
peak utilization of the systolic array drops rapidly with the
increasing number of inputs, and even a 4×4 array with 8
inputs cannot be properly utilized.
Tree of PEs: The tree-structured datapath (fig. 3(b)) is a
promising alternative candidate, as multi-input DAG nodes
can be spatially unrolled as binary trees. As shown in
fig. 3(c), the spatial mapper confirms that there exist sub-
graphs in DAGs that can fully utilize the datapath.
The spatial mapper used here can quantify the peak
utilization, but the constrained-optimization-based approach
used in the mapper is too slow to fully map large DAGs with
tens of thousands of nodes. For a scalable solution, this work
presents a heuristic algorithm to decompose irregular DAGs
into tree-like subgraphs (§IV-A).
C. Supporting irregular data accesses
Section §II-A alluded that DAG irregularity causes irreg-
ular register accesses. Fig. 4 illustrates this with a concrete
example. Fig. 4(a) shows a sample DAG and a target
datapath consisting of a PE tree and a banked register file for
inputs/outputs. Fig. 4(b) shows an example decomposition
of the DAG in tree-shaped structures that can be sequentially
mapped to the datapath. The outputs of some of the nodes
(like node a) are required to be stored in the register file
for later consumption, while others (like node d) are fully
consumed within the datapath. Fig. 4(c) shows the temporal
execution with the evolving state of the register file. The
DAG irregularity manifests in the form of irregular accesses
to register banks as explained next.
Irregular bank accesses: The inputs and outputs of the PE
tree access different banks over the cycles. For example,
PE2 reads banks 2 and 3 at T0and banks 1 and 3 at
T1. The access pattern cannot be regularized by simply
remapping the variables to different banks or remapping
nodes to different PEs. Consider the input variables of node
b, which are also consumed by cand d. At T0, they are
consumed by one PE, while at T1, by two PEs, one each. It is
3
abcd
ef
h
g
ijk
PE0
abcd
ef
gh
k
ij
l
l
ab
e
cd
f
gh
k
ij
l
T0
T1
T2
T3
cycles
(a) A sample DAG to be
mapped on a tree of PEs with
banked register le
(b) Possible decompostion
of the DAG along with the
input/output variables
(c) Sequence of execution
along with the state
of the register le
PE1 PE2
reg banks
01 2 3
Figure 4. Example of a DAG execution on a tree of PEs, causing irregular
register accesses.
not possible to map the variables among banks such that the
PEs read the same banks at T0and T1both. One alternative
could be to execute the nodes cand din different cycles, but
that would lead to idle PEs. Hence, for high PE utilization,
some form of flexibility is required in interconnecting the
inputs/outputs of PE trees to the register banks.
Irregular addresses: PEs read different addresses from
different banks in a cycle. This is caused by the fact that
the input variables consumed simultaneously can possibly
be generated in different cycles. (eg., the inputs of T3). Sim-
ilarly, variables that are generated together can be consumed
in different cycles (eg., the outputs of T1). Thus, unlike
SIMD execution, all the register banks do not use the same
read/write addresses, requiring flexible addressing hardware.
Replication is not a solution: It may appear that replicating
a variable in multiple banks and multiple addresses can solve
both of the above problems for register reads. For such
replications, a variable would be written as well as read
multiple times. However, there are three problems:
The register file can be severely underutilized due to
replicated copies of the same data.
The energy consumption increases due to more writes
compared to fig. 4’s approach of single writes.
The replication alleviates the irregular reads but exacer-
bates the irregularity of writes. Hence, it simply shifts the
problem from reads to writes.
Hence, this work aims to use flexible interconnects and
independent register addresses to ease the gathering of data
from irregular locations without relying on data replication,
enabling high-throughput execution.
III. DPU-V2 ARCHITECTURE TEMPLATE
This section describes the proposed processor architecture
template (fig. 5(a)) designed for unstructured DAGs.
A. Parallel tree of PEs
As shown in fig. 5(a), the datapath consists of many
parallel processing elements (PEs) to execute the DAG
nodes.
PE: Each PE can be configured to perform a basic
arithmetic operation (+and ×). Additionally, a PE can
bypass one of its inputs to its output.
Trees of PEs: PEs are interconnected in Tparallel
structures with a tree topology containing Dlayers, where
outputs of a PE layer are connected to the inputs of
the next PE layer. This enables immediate reuse of in-
termediate results, avoiding register file accesses and the
associated energy consumption.
Pipeline stages: The PE outputs are registered, resulting
in Dpipeline stages in the trees.
B. Shared register file
The PE trees read and write data from a shared register
file containing Bparallel banks with Rregisters per bank
(fig. 5(a)). The number of banks Bis chosen such that
there is one register bank for each input of the trees (i.e.,
B=T×2D). As such, the register file has enough
bandwidth to feed the data into the trees in every cycle.
However, due to DAG irregularity, the parallel inputs needed
in a clock cycle typically do not reside at the same address
in the different banks, as discussed in §II-C. To handle
this irregularity, the banks are made independent in terms
of read/write addressing. Each bank can read/write from/to
any location, independent of the other banks. This flexibility
comes at the overhead of additional instruction bits to encode
these addresses. While this overhead is present for reads, it
is alleviated for writes as discussed below.
Automatic write address generation: To reduce the instruc-
tion size, a novel register-writing policy is used, alleviating
the need of encoding write addresses in instructions. The
policy is to always write data to the empty location with
the lowest address within a register bank. Thus instructions
do not control where a write happens, but the register bank
itself chooses the appropriate location for the incoming data.
Why does the policy work? To perform correct execution
with such an automatic write policy, the compiler should be
able to predict at compile time the write addresses that will
be chosen at run time. Since our target DAGs are known
at compile time and the PEs execute synchronously, the
instruction execution sequence is deterministic. This enables
the compiler to predict write addresses that will be chosen
in every cycle, given that the execution begins with a known
state of the register banks (e.g, all the banks are empty).
4
摘要:

DPU-v2:Energy-efcientexecutionofirregulardirectedacyclicgraphsNimishShah,WannesMeertyandMarianVerhelstDepartmentofElectricalEngineering-MICAS,KULeuven,BelgiumyDepartmentofComputerScience-DTAI,KULeuven,BelgiumEmails:nshah@esat.kuleuven.be,wannes.meert@kuleuven.be,marian.verhelst@kuleuven.beAbstra...

展开>> 收起<<
DPU-v2 Energy-efficient execution of irregular directed acyclic graphs Nimish Shah Wannes Meertyand Marian Verhelst Department of Electrical Engineering - MICAS KU Leuven Belgium.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:5.2MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注