
A. Making irregular data accesses predictable
Fig. 2(a) shows a common approach for accelerating
DAGs, in which a DAG is partitioned into smaller sub-
graphs that can be executed on parallel cores, e.g, in a
multicore CPU. This approach is used in works like [39],
[44], [46]. The cores have to synchronize occasionally for
race-free communication that happens through a shared
memory structure like an L3 cache in CPUs or a shared
scratchpad in [46]. Due to the DAG irregularity, the ac-
cesses to this shared memory structure usually happen to
random addresses, inducing frequent bank conflicts in the
typically-used banked memory structures. These conflicts
could become a severe bottleneck. For instance, in [46],
43% of the load requests result in bank conflicts. Work in
[11], [43] alleviates the impact of these conflicts through
aggressive reordering of requests, but incurs the overhead
of a reordering buffer. Yet, none of these works attempt to
avoid the conflicts altogether. Given that the DAG structure
is known at compile time, it could be possible to perform
the memory allocation such that the simultaneous requests
do not access the same bank. The main challenge for such
conflict avoidance is that it is not straightforward to predict
which requests happen simultaneously. The parallel cores
normally execute asynchronously, i.e. a stall in one core (eg.,
to wait for the next instruction) does not stall the others,
preventing a compiler from predicting conflicts.
Fig. 2(b) shows an alternative approach that makes the
simultaneous irregular accesses predictable by pushing the
interconnect closer to the datapath and forcing the PE to
operate in a synchronous way. This limits the irregular
accesses to the register file, while the memory/scratchpad
can be accessed in a regular pattern. In every cycle, the
compiler knows which registers are accessed and can thereby
also predict register bank conflicts. The datapath with par-
allel PEs, however, cannot be programmed with SIMD-
like instructions because the PEs will gather data from
different addresses in the register banks owing to DAG
irregularity. Instead, VLIW (very long instruction word)-like
instructions can be used. Yet, unlike conventional VLIW
with a partitioned register file [2], the datapath does not
have to be limited to an array of PEs, but complex dataflow
patterns can be utilized.
This work hypothesizes that the architecture in fig. 2(b)
can improve throughput by avoiding bank conflicts and by
using a datapath tuned for the target workloads.
B. Reducing data accesses
To ease the problem of bank conflicts further and to reduce
energy consumption, the intermediate results in the com-
putation could be immediately consumed in the datapath,
avoiding reads/writes from a register file. A commonly used
topology for such data reuse is a systolic array of PEs [27]
(fig. 3(a)), yet it is unsuitable for sparse DAGs due to severe
hardware under-utilization as described below.
n inputs, n2/4 PEs n inputs, n-1 PEs
Systolic array Tree
(a) (b)
PE
Systolic Tree
Inputs
Peak utilization (%)
2 4 8 16
0
50
100
(c)
Figure 3. Systolic arrays are under-utilized by irregular DAGs, while a
tree-shaped datapath is a promising alternative.
Peak utilization: To check the suitability of the systolic
array, the spatial datapath mapper from [34] is used to find
the largest subgraph in our target DAGs (described in §V-A)
that can be mapped to the array. This essentially is the peak
utilization possible for the datapath. Fig. 3(c) shows that the
peak utilization of the systolic array drops rapidly with the
increasing number of inputs, and even a 4×4 array with 8
inputs cannot be properly utilized.
Tree of PEs: The tree-structured datapath (fig. 3(b)) is a
promising alternative candidate, as multi-input DAG nodes
can be spatially unrolled as binary trees. As shown in
fig. 3(c), the spatial mapper confirms that there exist sub-
graphs in DAGs that can fully utilize the datapath.
The spatial mapper used here can quantify the peak
utilization, but the constrained-optimization-based approach
used in the mapper is too slow to fully map large DAGs with
tens of thousands of nodes. For a scalable solution, this work
presents a heuristic algorithm to decompose irregular DAGs
into tree-like subgraphs (§IV-A).
C. Supporting irregular data accesses
Section §II-A alluded that DAG irregularity causes irreg-
ular register accesses. Fig. 4 illustrates this with a concrete
example. Fig. 4(a) shows a sample DAG and a target
datapath consisting of a PE tree and a banked register file for
inputs/outputs. Fig. 4(b) shows an example decomposition
of the DAG in tree-shaped structures that can be sequentially
mapped to the datapath. The outputs of some of the nodes
(like node a) are required to be stored in the register file
for later consumption, while others (like node d) are fully
consumed within the datapath. Fig. 4(c) shows the temporal
execution with the evolving state of the register file. The
DAG irregularity manifests in the form of irregular accesses
to register banks as explained next.
Irregular bank accesses: The inputs and outputs of the PE
tree access different banks over the cycles. For example,
PE2 reads banks 2 and 3 at T0and banks 1 and 3 at
T1. The access pattern cannot be regularized by simply
remapping the variables to different banks or remapping
nodes to different PEs. Consider the input variables of node
b, which are also consumed by cand d. At T0, they are
consumed by one PE, while at T1, by two PEs, one each. It is
3