DPU-v2 Energy-efﬁcient execution of irregular directed acyclic graphs Nimish Shah Wannes Meertyand Marian Verhelst Department of Electrical Engineering - MICAS KU Leuven Belgium

2025-05-03 0 0 5.2MB 20 页 10玖币

侵权投诉

DPU-v2: Energy-efﬁcient execution of irregular directed acyclic graphs

Nimish Shah∗, Wannes Meert†and Marian Verhelst∗

∗Department of Electrical Engineering - MICAS, KU Leuven, Belgium

†Department of Computer Science - DTAI, KU Leuven, Belgium

Emails: nshah@esat.kuleuven.be, wannes.meert@kuleuven.be, marian.verhelst@kuleuven.be

Abstract—A growing number of applications like probabilis-

tic machine learning, sparse linear algebra, robotic navigation,

etc., exhibit irregular data ﬂow computation that can be

modeled with directed acyclic graphs (DAGs). The irregularity

arises from the seemingly random connections of nodes, which

makes the DAG structure unsuitable for vectorization on

CPU or GPU. Moreover, the nodes usually represent a small

number of arithmetic operations that cannot amortize the

overhead of launching tasks/kernels for each node, further

posing challenges for parallel execution.

To enable energy-efﬁcient execution, this work proposes

DAG processing unit (DPU) version 2, a specialized processor

architecture optimized for irregular DAGs with static connec-

tivity. It consists of a tree-structured datapath for efﬁcient

data reuse, a customized banked register ﬁle, and interconnects

tuned to support irregular register accesses. DPU-v2 is utilized

effectively through a targeted compiler that systematically

maps operations to the datapath, minimizes register bank

conﬂicts, and avoids pipeline hazards. Finally, a design space

exploration identiﬁes the optimal architecture conﬁguration

that minimizes the energy-delay product. This hardware-

software co-optimization approach results in a speedup of

1.4×, 3.5×, and 14×over a state-of-the-art DAG processor

ASIP, a CPU, and a GPU, respectively, while also achieving a

lower energy-delay product. In this way, this work takes an

important step towards enabling an embedded execution of

emerging DAG workloads.

Keywords-Graphs, irregular computation graphs, parallel

processor, hardware-software codesign, DAG processing unit,

probabilistic circuits, sparse matrix triangular solve, spatial

datapath, interconnection network, bank conﬂicts, design space

exploration

I. INTRODUCTION

Emerging machine learning models like probabilistic cir-

cuits (PC, also called sum-product networks) [4], [41] are

increasingly used for energy-constrained applications like

robotic navigation [59], robust image classiﬁcation [49],

human activity recognition [17], etc. PCs exhibit an ir-

regular ﬂow of data that can be represented with directed

acyclic graphs (DAG), as shown in ﬁg. 1(a) for a human

activity recognition model. Similarly, sparse linear algebra

operations like sparse triangular solve (SpTRSV) used for

embedded applications like robotic localization and mapping

[40], wireless communication [54], cryptography [15], etc.,

can also be represented with irregular DAGs. Fig. 1(b) shows

an example of a DAG for a matrix from the SuiteSparse

collection [13]. In these DAGs, nodes represent arithmetic

Figure 1. Example applications with irregular computation DAGs. (a)

Probabilistic circuits (PC) used in machine learning, and (b) Sparse matrix

triangular solves (SpTRSV) used in linear algebra. (c) CPU and GPU

achieve signiﬁcantly lower than their peak throughput for such DAGs.

operations and edges represent the dependencies among

these operations.

Unfortunately, these irregular unstructured DAGs pose

challenges to general-purpose computing architectures like

GPUs and single-instruction-multiple-data (SIMD) vector-

ization in CPUs. Fig. 1(c) shows the throughputs of an Intel

Xeon CPU and an Nvidia RTX GPU for these applica-

tions.The CPU performs much below the peak achievable

throughput (3.4 TOPS in this case), and the GPU even

underperforms compared to the CPU despite having highly

parallel hardware (especially for DAGs with fewer than

100K nodes). This underperformance is caused by the fol-

lowing two reasons, preventing the platforms from utilizing

the sparsity in the workloads:

•The seemingly-random edge connectivity of an unstruc-

tured DAG results in irregular memory accesses with poor

spatial locality. Hence, the large granularity of cache-line

size leads to severe under-utilization of caches and mem-

ory bandwidth, because only 4B might get consumed out

of the 32/64B of data fetched to a cache line. Furthermore,

arXiv:2210.13184v1 [cs.AR] 20 Oct 2022

these irregular accesses also prevent memory coalescing,

which is crucial for high GPU performance [52].

•Due to the irregular structure, parallelizing different parts

of DAGs across multiple units (like CPU cores, GPU

streaming multiprocessors, etc.) demands high communi-

cation and synchronization overhead, limiting the paral-

lelization beneﬁts [44].

To address these issues, this paper proposes DAG process-

ing unit (DPU-v2), a processing architecture1optimised for

irregular DAGs along with a custom-designed targeted com-

piler. The major contributions of this work are as follows:

1) Processor: A parameterized architecture template is de-

signed with a tree-structured datapath for immediate

reuse of data, a banked register ﬁle with a novel address-

ing scheme, and high-bandwidth ﬂexible interconnects.

The architecture is made programmable with a custom

instruction set with long words.

2) Compiler: A specialized compiler is designed that sched-

ules DAG nodes such that the tree-based datapath is

maximally utilized. The irregularity of DAGs leads to

frequent register bank conﬂicts, which are reduced with

a conﬂict-aware register allocation while taking into

account the interconnect constraints.

3) Design-space exploration: Since there are multiple de-

sign parameters in the architecture template (e.g. the

number of register banks, computational parallelism,

etc.), a design space exploration is performed to arrive at

the architecture with the minimum energy-delay product

in a 28nm CMOS technology, and compared with state-

of-the-art implementations.

This work focuses on DAGs that remain static across

multiple executions, which allow amortization of the com-

pilation overhead, enabling highly targeted (but relatively

slow) DAG-speciﬁc compilation. This assumption of DAGs

being static is valid for our target workloads: (1) the structure

of a trained PC remains static during inference, and (2) in

many applications using SpTRSV like robotic localization

and mapping [40], wireless communication [54], cryptogra-

phy [15], etc., the sparsity pattern of a matrix remains static

while the numerical values or the right-hand side vector

change across executions, which effectively only changes

the inputs of the DAG. Furthermore, DAGs are assumed to

contain only arithmetic nodes (and no conditional nodes)

that are executed in a predeﬁned order based on edge-

induced dependencies. This assumption is also valid for the

target workloads in this paper, but excludes operations like

breadth-ﬁrst search (even on a static DAG) from the scope

of this work.

The paper is organized as follows. §II discusses this

work’s approach for efﬁcient parallel execution of DAGs.

§III describes the processor architecture template of DPU-v2

followed by §IV explaining the compiler. Next, §V presents

1Available at https://github.com/nimish15shah/DAG Processor.

Instruction

Shared scratchpad banks

Instruction Instruction

Interconnection network

Instruction

Sync

unit

(a) Irregular acceses to scratchpad

Shared scratchpad banks

(b) Irregular accesses to register ﬁle

Decoder

RF PE

Decoder

RF PE

Decoder

RF PE

Decoder

Inter. network

Long instruction

Datapath with

parallel PEs

RF RFRF

Figure 2. (a) A common approach for executing DAGs in parallel, in which

the unpredictable irregular accesses happen to scratchpad/memory. (b) In

DPU-v2 by pushing the interconnect closer to the datapath, the compiler

is equipped to predict the irregular accesses and prevent bank conﬂicts.

the design-space exploration to ﬁnd the design with the

minimum energy-delay product, and compares it with state-

of-the-art work. Finally, §VI discusses related works and

§VII concludes the paper.

II. CONCEPTS TO TAME IRREGULARITY

In this paper, a DAG encodes the computations to be

performed for an application and its data dependencies. A

DAG node represents arithmetic operations (e.g. additions,

multiplications) to be performed on the node’s inputs. This

work focuses on DAGs with ﬁne-grained nodes with low

arithmetic intensity. In other words, the nodes represent a

scalar or a small-sized vector operation but not large tensor

operations, in contrast to DAGs of deep neural networks, for

example, which have compute-intensive tensor operations as

nodes. The output result of a node serves as an input to

(possibly multiple) nodes connected to the outgoing edges.

Execution order: Executing a DAG means evaluating all

the nodes in a valid order imposed by the directed edges

representing data dependencies. For all the edges, the source

node should be executed before the destination node. Put

differently, a node can be executed only after all the prede-

cessor nodes have ﬁnished execution. Note that the DAGs

targeted in this work do not contain conditional branch

nodes, and hence all the nodes are supposed to be executed.

What can be executed in parallel? At a given moment, all

the nodes whose predecessor nodes have ﬁnished execution

can be executed in parallel. The execution begins with the

source nodes of the DAG, as they do not have any incoming

edges. Subsequently, different nodes become ready as the

execution progresses. DAGs from real-world applications

typically have ample parallelism, which can be utilized

through a parallel processing architecture.

The rest of this section discusses the challenges for

parallel processing of irregular DAGs and the techniques

introduced in this work to address them.

A. Making irregular data accesses predictable

Fig. 2(a) shows a common approach for accelerating

DAGs, in which a DAG is partitioned into smaller sub-

graphs that can be executed on parallel cores, e.g, in a

multicore CPU. This approach is used in works like [39],

[44], [46]. The cores have to synchronize occasionally for

race-free communication that happens through a shared

memory structure like an L3 cache in CPUs or a shared

scratchpad in [46]. Due to the DAG irregularity, the ac-

cesses to this shared memory structure usually happen to

random addresses, inducing frequent bank conﬂicts in the

typically-used banked memory structures. These conﬂicts

could become a severe bottleneck. For instance, in [46],

43% of the load requests result in bank conﬂicts. Work in

[11], [43] alleviates the impact of these conﬂicts through

aggressive reordering of requests, but incurs the overhead

of a reordering buffer. Yet, none of these works attempt to

avoid the conﬂicts altogether. Given that the DAG structure

is known at compile time, it could be possible to perform

the memory allocation such that the simultaneous requests

do not access the same bank. The main challenge for such

conﬂict avoidance is that it is not straightforward to predict

which requests happen simultaneously. The parallel cores

normally execute asynchronously, i.e. a stall in one core (eg.,

to wait for the next instruction) does not stall the others,

preventing a compiler from predicting conﬂicts.

Fig. 2(b) shows an alternative approach that makes the

simultaneous irregular accesses predictable by pushing the

interconnect closer to the datapath and forcing the PE to

operate in a synchronous way. This limits the irregular

accesses to the register ﬁle, while the memory/scratchpad

can be accessed in a regular pattern. In every cycle, the

compiler knows which registers are accessed and can thereby

also predict register bank conﬂicts. The datapath with par-

allel PEs, however, cannot be programmed with SIMD-

like instructions because the PEs will gather data from

different addresses in the register banks owing to DAG

irregularity. Instead, VLIW (very long instruction word)-like

instructions can be used. Yet, unlike conventional VLIW

with a partitioned register ﬁle [2], the datapath does not

have to be limited to an array of PEs, but complex dataﬂow

patterns can be utilized.

This work hypothesizes that the architecture in ﬁg. 2(b)

can improve throughput by avoiding bank conﬂicts and by

using a datapath tuned for the target workloads.

B. Reducing data accesses

To ease the problem of bank conﬂicts further and to reduce

energy consumption, the intermediate results in the com-

putation could be immediately consumed in the datapath,

avoiding reads/writes from a register ﬁle. A commonly used

topology for such data reuse is a systolic array of PEs [27]

(ﬁg. 3(a)), yet it is unsuitable for sparse DAGs due to severe

hardware under-utilization as described below.

n inputs, n2/4 PEs n inputs, n-1 PEs

Systolic array Tree

(a) (b)

Systolic Tree

Inputs

Peak utilization (%)

2 4 8 16

100

(c)

Figure 3. Systolic arrays are under-utilized by irregular DAGs, while a

tree-shaped datapath is a promising alternative.

Peak utilization: To check the suitability of the systolic

array, the spatial datapath mapper from [34] is used to ﬁnd

the largest subgraph in our target DAGs (described in §V-A)

that can be mapped to the array. This essentially is the peak

utilization possible for the datapath. Fig. 3(c) shows that the

peak utilization of the systolic array drops rapidly with the

increasing number of inputs, and even a 4×4 array with 8

inputs cannot be properly utilized.

Tree of PEs: The tree-structured datapath (ﬁg. 3(b)) is a

promising alternative candidate, as multi-input DAG nodes

can be spatially unrolled as binary trees. As shown in

ﬁg. 3(c), the spatial mapper conﬁrms that there exist sub-

graphs in DAGs that can fully utilize the datapath.

The spatial mapper used here can quantify the peak

utilization, but the constrained-optimization-based approach

used in the mapper is too slow to fully map large DAGs with

tens of thousands of nodes. For a scalable solution, this work

presents a heuristic algorithm to decompose irregular DAGs

into tree-like subgraphs (§IV-A).

C. Supporting irregular data accesses

Section §II-A alluded that DAG irregularity causes irreg-

ular register accesses. Fig. 4 illustrates this with a concrete

example. Fig. 4(a) shows a sample DAG and a target

datapath consisting of a PE tree and a banked register ﬁle for

inputs/outputs. Fig. 4(b) shows an example decomposition

of the DAG in tree-shaped structures that can be sequentially

mapped to the datapath. The outputs of some of the nodes

(like node a) are required to be stored in the register ﬁle

for later consumption, while others (like node d) are fully

consumed within the datapath. Fig. 4(c) shows the temporal

execution with the evolving state of the register ﬁle. The

DAG irregularity manifests in the form of irregular accesses

to register banks as explained next.

Irregular bank accesses: The inputs and outputs of the PE

tree access different banks over the cycles. For example,

PE2 reads banks 2 and 3 at T0and banks 1 and 3 at

T1. The access pattern cannot be regularized by simply

remapping the variables to different banks or remapping

nodes to different PEs. Consider the input variables of node

b, which are also consumed by cand d. At T0, they are

consumed by one PE, while at T1, by two PEs, one each. It is

abcd

ijk

PE0

abcd

cycles

(a) A sample DAG to be

mapped on a tree of PEs with

banked register ﬁle

(b) Possible decompostion

of the DAG along with the

input/output variables

along with the state

of the register ﬁle

PE1 PE2

reg banks

01 2 3

Figure 4. Example of a DAG execution on a tree of PEs, causing irregular

not possible to map the variables among banks such that the

PEs read the same banks at T0and T1both. One alternative

could be to execute the nodes cand din different cycles, but

that would lead to idle PEs. Hence, for high PE utilization,

some form of ﬂexibility is required in interconnecting the

inputs/outputs of PE trees to the register banks.

Irregular addresses: PEs read different addresses from

different banks in a cycle. This is caused by the fact that

the input variables consumed simultaneously can possibly

be generated in different cycles. (eg., the inputs of T3). Sim-

ilarly, variables that are generated together can be consumed

in different cycles (eg., the outputs of T1). Thus, unlike

SIMD execution, all the register banks do not use the same

read/write addresses, requiring ﬂexible addressing hardware.

Replication is not a solution: It may appear that replicating

a variable in multiple banks and multiple addresses can solve

both of the above problems for register reads. For such

replications, a variable would be written as well as read

multiple times. However, there are three problems:

•The register ﬁle can be severely underutilized due to

replicated copies of the same data.

•The energy consumption increases due to more writes

compared to ﬁg. 4’s approach of single writes.

•The replication alleviates the irregular reads but exacer-

bates the irregularity of writes. Hence, it simply shifts the

problem from reads to writes.

Hence, this work aims to use ﬂexible interconnects and

independent register addresses to ease the gathering of data

from irregular locations without relying on data replication,

enabling high-throughput execution.

III. DPU-V2 ARCHITECTURE TEMPLATE

This section describes the proposed processor architecture

template (ﬁg. 5(a)) designed for unstructured DAGs.

A. Parallel tree of PEs

As shown in ﬁg. 5(a), the datapath consists of many

parallel processing elements (PEs) to execute the DAG

nodes.

•PE: Each PE can be conﬁgured to perform a basic

arithmetic operation (+and ×). Additionally, a PE can

bypass one of its inputs to its output.

•Trees of PEs: PEs are interconnected in Tparallel

structures with a tree topology containing Dlayers, where

outputs of a PE layer are connected to the inputs of

the next PE layer. This enables immediate reuse of in-

termediate results, avoiding register ﬁle accesses and the

associated energy consumption.

•Pipeline stages: The PE outputs are registered, resulting

in Dpipeline stages in the trees.

B. Shared register ﬁle

The PE trees read and write data from a shared register

ﬁle containing Bparallel banks with Rregisters per bank

(ﬁg. 5(a)). The number of banks Bis chosen such that

there is one register bank for each input of the trees (i.e.,

B=T×2D). As such, the register ﬁle has enough

bandwidth to feed the data into the trees in every cycle.

However, due to DAG irregularity, the parallel inputs needed

in a clock cycle typically do not reside at the same address

in the different banks, as discussed in §II-C. To handle

this irregularity, the banks are made independent in terms

of read/write addressing. Each bank can read/write from/to

any location, independent of the other banks. This ﬂexibility

comes at the overhead of additional instruction bits to encode

these addresses. While this overhead is present for reads, it

is alleviated for writes as discussed below.

Automatic write address generation: To reduce the instruc-

tion size, a novel register-writing policy is used, alleviating

the need of encoding write addresses in instructions. The

policy is to always write data to the empty location with

the lowest address within a register bank. Thus instructions

do not control where a write happens, but the register bank

itself chooses the appropriate location for the incoming data.

Why does the policy work? To perform correct execution

with such an automatic write policy, the compiler should be

able to predict at compile time the write addresses that will

be chosen at run time. Since our target DAGs are known

at compile time and the PEs execute synchronously, the

instruction execution sequence is deterministic. This enables

the compiler to predict write addresses that will be chosen

in every cycle, given that the execution begins with a known

state of the register banks (e.g, all the banks are empty).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DPU-v2:Energy-efcientexecutionofirregulardirectedacyclicgraphsNimishShah,WannesMeertyandMarianVerhelstDepartmentofElectricalEngineering-MICAS,KULeuven,BelgiumyDepartmentofComputerScience-DTAI,KULeuven,BelgiumEmails:nshah@esat.kuleuven.be,wannes.meert@kuleuven.be,marian.verhelst@kuleuven.beAbstra...

展开>> 收起<<

DPU-v2 Energy-efﬁcient execution of irregular directed acyclic graphs Nimish Shah Wannes Meertyand Marian Verhelst Department of Electrical Engineering - MICAS KU Leuven Belgium.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DPU-v2 Energy-efﬁcient execution of irregular directed acyclic graphs Nimish Shah Wannes Meertyand Marian Verhelst Department of Electrical Engineering - MICAS KU Leuven Belgium

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: