Partitioning and Placement of Deep Neural Networks on Distributed Edge Devices to Maximize Inference Throughput

2025-05-02 0 0 1.39MB 8 页 10玖币
侵权投诉
Partitioning and Placement of Deep Neural
Networks on Distributed Edge Devices to Maximize
Inference Throughput
Arjun Parthasarathy
Crystal Springs Uplands School
Email: aparthasarathy23@csus.org
Bhaskar Krishnamachari
University of Southern California
Email: bkrishna@usc.edu
Abstract—Edge inference has become more widespread, as its
diverse applications range from retail to wearable technology.
Clusters of networked resource-constrained edge devices are
becoming common, yet no system exists to split a DNN across
these clusters while maximizing the inference throughput of the
system. We present an algorithm which partitions DNNs and
distributes them across a set of edge devices with the goal of
minimizing the bottleneck latency and therefore maximizing
inference throughput. The system scales well to systems of
different node memory capacities and numbers of nodes. We find
that we can reduce the bottleneck latency by 10x over a random
algorithm and 35% over a greedy joint partitioning-placement
algorithm. Furthermore we find empirically that for the set of
representative models we tested, the algorithm produces results
within 9.2% of the optimal bottleneck latency.
I. INTRODUCTION
Deep Neural Networks (DNNs) have greatly accelerated
machine learning across different disciplines, such as Com-
puter Vision [5] and Natural Language Processing [21]. Edge
Inference is becoming an increasingly popular field with mul-
tiple facets [33], as sensor-driven computation in IoT systems
necessitates DNN inference in the field. IoT applications for
edge inference range from retail to wearable technology [4],
[6].
The edge can come in multiple configurations [19], [28],
and there are multiple approaches to facilitate edge infer-
ence. For cloud-edge hybrid inference, one such approach
is model compression [11], which deals exclusively with
DNN optimization but does not address the system’s runtime
configuration. In this paper, we focus on clusters of resource-
constrained edge devices. These edge clusters are becoming
increasingly common due to their low-cost and scalability at
the edge [24]. Unlike a cloud data center, the edge brings
system resource limitations and communication bottlenecks
between devices.
With this in mind, we address the following problem: How
can we take advantage of multi-device edge clusters to
enable high-performance DNN inference while respecting
computational resource constraints and taking into ac-
count the heterogeneity of communication links?
To partition a deep learning model, we first split the
model into components that are executed sequentially. Each
partition is assigned to a different edge device, and once each
Fig. 1: Partitioning and Distributing a Model Across Edge
Devices to Create an Inference Pipeline
node performs inference with its piece of the model, that
intermediate inference result is sent to the next node with
the corresponding partition in the sequence. This inference
pipeline is shown in Figure 1.
In an edge cluster, although we have a lower computational
power in each node, we can take advantage of this inference
pipelining to increase system throughput. Since each node can
perform inference with its partition individually, prior nodes
in the pipeline can send their finished inference results to the
subsequent nodes in the pipeline and accept new batches.
We define the throughput metric of a system as the number
of inference cycles it can perform per unit time. As we
showed in our previous work DEFER [23], we can achieve
higher throughput with distributed edge inference as opposed
to inference on a single device because of pipelining. The
throughput is defined as the reciprocal of the bottleneck
latency. For nodes [k] = {1,2, . . . , k}, the bottleneck latency
βis defined as
S={k[k]|ck, γk}
β= maxsSs(1)
where ckis the compute time of the operations on node k,
and γkis the communication time between node k1and
k. We use ResNet50 [12], which is a representative model
for our use case. On a Raspberry Pi 4, the inference speed
was found to be 225 ms [25]. Next, we found the amount
arXiv:2210.12219v1 [cs.NI] 21 Oct 2022
of data transferred between each layer of the model. On
average, 10.2 Mbits of data was transferred between layers.
Given an average WiFi bandwidth of 6 Mbps for a low-end
edge network, this gives us a communication time of 1.7s.
This is 7.5x slower than the compute time. In reality, many
models are larger than ResNet50 and will therefore be split
across devices, so each device will have less operations to
execute. This means that communication time will outweigh
compute time as the bottleneck. Therefore, we can simplify
the expression for bottleneck latency to the following:
β= max
k[k]γk(2)
Since throughput is defined as 1
β, by minimizing the bottle-
neck latency we maximize inference throughput. Additionally,
we assume that all nodes are homogeneous in RAM. If the
devices are not the same capacity, then the algorithm will
take the smallest memory capacity across all nodes in the
cluster, and take that as the capacity of each node. In this
paper, we primarily analyze image and text models due to
their prevalence on the edge for visual analytics applications
[22], [34].
Our main contribution in this paper is a novel partitioning
and placement algorithm for DNNs across a cluster of edge
devices distributed spatially within the same WiFi network.
The algorithm finds the candidate partition points, finds the
optimal partition sizes to transfer the least amount of data, and
finds the arrangement of nodes with the highest bandwidth.
Together, these aim to minimize the resulting bottleneck
latency according to the throughput metric. We found that
our algorithm results in a 10x improvement over a random
partitioning/placement algorithm, and a 35% reduction in
bottleneck latency for systems with 50 compute nodes. We
empirically observe an average approximation ratio of 1.092
for the bottleneck latency (i.e. it is 9.2% more than the optimal
bottleneck latency, on average).
II. RELATED WORK
Early works on the topic of partitioning DNN models di-
vided them into head and tail models with the former distilled
to enable running on a resource-constrained device and reduce
data transfer [20]. Some prior works on DNN edge inference
mathematically perform DNN model slicing by layer [35],
[36], after calculating layer impact during the training stage;
these do not account for communication demands on the
edge. Others abstract model layers into certain “execution
units,” [7], [17] which they then choose to slice based on
certain resource requirements. Li et al. [16] regressively
predict a layer’s latency demand and optimize communication
bandwidth accordingly. DeeperThings [29] performs layer
fusion on CNNs to optimize data transfer. These works are
optimized for a hybrid edge-cloud pipeline and do not address
the demands of a cluster of edge devices. Couper [13] uses
a similar partitioning scheme to minimize inter-partition data
transfer, but does not address the communication bottleneck
associated with an edge cluster. Hu et. al [14] optimize
the partitioning of a CNN onto a set of devices by taking
compute time as a bottleneck, while employing compression
to deal with communication constraints, and do not consider
placement. Our paper builds on and differentiates itself from
these works by addressing the bandwidth limitation of an edge
cluster, and aims to maximize inter-node bandwidth during
the placement stage to minimize bottleneck latency.
III. PARTITIONING AND PLACEMENT ALGORITHM
We are given two graphs:
1) An unweighted DAG Gmrepresenting the computation
graph of a DNN, where each vertex represents a layer in
the model. This DAG can be found using common ML
libraries such as Tensorflow [1] and Keras [8].
2) A weighted complete graph Gcrepresenting the com-
munication graph of a cluster of homogeneous physical
compute nodes, where each vertex represents a physical
compute node and each edge represents the bandwidth
between those nodes. The graph is complete because we
assume that these edge devices will communicate over
the same WiFi network.
Our goal is to optimally partition the model and place these
partitions on a set of edge devices. We do so as follows.
A. Converting a Complex DAG to a Linear DAG
First, we need to distill Gminto a linear DAG. The
vertices where it is possible to partition the model are called
“candidate partition points.” We illustrate this in Figure 2.
For vV, edges eEand source vertex sof Gm, find the
longest path from sto v. This can be done by topologically
sorting the DAG and for each vertex in the resulting list,
relaxing each neighbor of that vertex. We call the length of
this longest path the topological depth of that vertex in the
graph. Let LP (v)denote the length of longest path from s
to v.
To verify that all paths from vertex vprev go through
vertex v, use a modified DFS by recursing on the incident
edges of each vertex. If we encounter a vertex with a greater
topological depth than v, return false. If we reach vertex
v, return true. Let AP (vprev , v)denote the result of this
algorithm.
Given the previously found candidate partition point pk1
and the current vertex u, the next candidate partition point
pk=uiff:
1) LP (u)6=LP (v)v∈ {Vu}
2) AP (pk1, u) = true
with p0=s.
The time complexity of LP is O(V+E). AP runs in
polynomial time by returning upon reaching a vertex with
a greater topological depth. Therefore, this algorithm runs in
polynomial time.
Figure 2 shows the candidate partition points at certain sec-
tions of the DAG of ResNet50 [12] and InceptionResNetV2
[30]. Each rectangle represents a model layer in the DAG.
摘要:

PartitioningandPlacementofDeepNeuralNetworksonDistributedEdgeDevicestoMaximizeInferenceThroughputArjunParthasarathyCrystalSpringsUplandsSchoolEmail:aparthasarathy23@csus.orgBhaskarKrishnamachariUniversityofSouthernCaliforniaEmail:bkrishna@usc.eduAbstract—Edgeinferencehasbecomemorewidespread,asitsdiv...

展开>> 收起<<
Partitioning and Placement of Deep Neural Networks on Distributed Edge Devices to Maximize Inference Throughput.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.39MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注