
of data transferred between each layer of the model. On
average, 10.2 Mbits of data was transferred between layers.
Given an average WiFi bandwidth of 6 Mbps for a low-end
edge network, this gives us a communication time of 1.7s.
This is 7.5x slower than the compute time. In reality, many
models are larger than ResNet50 and will therefore be split
across devices, so each device will have less operations to
execute. This means that communication time will outweigh
compute time as the bottleneck. Therefore, we can simplify
the expression for bottleneck latency to the following:
β= max
k∈[k]γk(2)
Since throughput is defined as 1
β, by minimizing the bottle-
neck latency we maximize inference throughput. Additionally,
we assume that all nodes are homogeneous in RAM. If the
devices are not the same capacity, then the algorithm will
take the smallest memory capacity across all nodes in the
cluster, and take that as the capacity of each node. In this
paper, we primarily analyze image and text models due to
their prevalence on the edge for visual analytics applications
[22], [34].
Our main contribution in this paper is a novel partitioning
and placement algorithm for DNNs across a cluster of edge
devices distributed spatially within the same WiFi network.
The algorithm finds the candidate partition points, finds the
optimal partition sizes to transfer the least amount of data, and
finds the arrangement of nodes with the highest bandwidth.
Together, these aim to minimize the resulting bottleneck
latency according to the throughput metric. We found that
our algorithm results in a 10x improvement over a random
partitioning/placement algorithm, and a 35% reduction in
bottleneck latency for systems with 50 compute nodes. We
empirically observe an average approximation ratio of 1.092
for the bottleneck latency (i.e. it is 9.2% more than the optimal
bottleneck latency, on average).
II. RELATED WORK
Early works on the topic of partitioning DNN models di-
vided them into head and tail models with the former distilled
to enable running on a resource-constrained device and reduce
data transfer [20]. Some prior works on DNN edge inference
mathematically perform DNN model slicing by layer [35],
[36], after calculating layer impact during the training stage;
these do not account for communication demands on the
edge. Others abstract model layers into certain “execution
units,” [7], [17] which they then choose to slice based on
certain resource requirements. Li et al. [16] regressively
predict a layer’s latency demand and optimize communication
bandwidth accordingly. DeeperThings [29] performs layer
fusion on CNNs to optimize data transfer. These works are
optimized for a hybrid edge-cloud pipeline and do not address
the demands of a cluster of edge devices. Couper [13] uses
a similar partitioning scheme to minimize inter-partition data
transfer, but does not address the communication bottleneck
associated with an edge cluster. Hu et. al [14] optimize
the partitioning of a CNN onto a set of devices by taking
compute time as a bottleneck, while employing compression
to deal with communication constraints, and do not consider
placement. Our paper builds on and differentiates itself from
these works by addressing the bandwidth limitation of an edge
cluster, and aims to maximize inter-node bandwidth during
the placement stage to minimize bottleneck latency.
III. PARTITIONING AND PLACEMENT ALGORITHM
We are given two graphs:
1) An unweighted DAG Gmrepresenting the computation
graph of a DNN, where each vertex represents a layer in
the model. This DAG can be found using common ML
libraries such as Tensorflow [1] and Keras [8].
2) A weighted complete graph Gcrepresenting the com-
munication graph of a cluster of homogeneous physical
compute nodes, where each vertex represents a physical
compute node and each edge represents the bandwidth
between those nodes. The graph is complete because we
assume that these edge devices will communicate over
the same WiFi network.
Our goal is to optimally partition the model and place these
partitions on a set of edge devices. We do so as follows.
A. Converting a Complex DAG to a Linear DAG
First, we need to distill Gminto a linear DAG. The
vertices where it is possible to partition the model are called
“candidate partition points.” We illustrate this in Figure 2.
For v∈V, edges e∈Eand source vertex sof Gm, find the
longest path from sto v. This can be done by topologically
sorting the DAG and for each vertex in the resulting list,
relaxing each neighbor of that vertex. We call the length of
this longest path the topological depth of that vertex in the
graph. Let LP (v)denote the length of longest path from s
to v.
To verify that all paths from vertex vprev go through
vertex v, use a modified DFS by recursing on the incident
edges of each vertex. If we encounter a vertex with a greater
topological depth than v, return false. If we reach vertex
v, return true. Let AP (vprev , v)denote the result of this
algorithm.
Given the previously found candidate partition point pk−1
and the current vertex u, the next candidate partition point
pk=uiff:
1) LP (u)6=LP (v)∀v∈ {V−u}
2) AP (pk−1, u) = true
with p0=s.
The time complexity of LP is O(V+E). AP runs in
polynomial time by returning upon reaching a vertex with
a greater topological depth. Therefore, this algorithm runs in
polynomial time.
Figure 2 shows the candidate partition points at certain sec-
tions of the DAG of ResNet50 [12] and InceptionResNetV2
[30]. Each rectangle represents a model layer in the DAG.