Partitioning and Placement of Deep Neural Networks on Distributed Edge Devices to Maximize Inference Throughput

2025-05-02 0 0 1.39MB 8 页 10玖币

侵权投诉

Partitioning and Placement of Deep Neural

Networks on Distributed Edge Devices to Maximize

Inference Throughput

Arjun Parthasarathy

Crystal Springs Uplands School

Email: aparthasarathy23@csus.org

Bhaskar Krishnamachari

University of Southern California

Email: bkrishna@usc.edu

Abstract—Edge inference has become more widespread, as its

diverse applications range from retail to wearable technology.

Clusters of networked resource-constrained edge devices are

becoming common, yet no system exists to split a DNN across

these clusters while maximizing the inference throughput of the

system. We present an algorithm which partitions DNNs and

distributes them across a set of edge devices with the goal of

minimizing the bottleneck latency and therefore maximizing

inference throughput. The system scales well to systems of

different node memory capacities and numbers of nodes. We ﬁnd

that we can reduce the bottleneck latency by 10x over a random

algorithm and 35% over a greedy joint partitioning-placement

algorithm. Furthermore we ﬁnd empirically that for the set of

representative models we tested, the algorithm produces results

within 9.2% of the optimal bottleneck latency.

I. INTRODUCTION

Deep Neural Networks (DNNs) have greatly accelerated

machine learning across different disciplines, such as Com-

puter Vision [5] and Natural Language Processing [21]. Edge

Inference is becoming an increasingly popular ﬁeld with mul-

tiple facets [33], as sensor-driven computation in IoT systems

necessitates DNN inference in the ﬁeld. IoT applications for

edge inference range from retail to wearable technology [4],

[6].

The edge can come in multiple conﬁgurations [19], [28],

and there are multiple approaches to facilitate edge infer-

ence. For cloud-edge hybrid inference, one such approach

is model compression [11], which deals exclusively with

DNN optimization but does not address the system’s runtime

conﬁguration. In this paper, we focus on clusters of resource-

constrained edge devices. These edge clusters are becoming

increasingly common due to their low-cost and scalability at

the edge [24]. Unlike a cloud data center, the edge brings

system resource limitations and communication bottlenecks

between devices.

With this in mind, we address the following problem: How

can we take advantage of multi-device edge clusters to

enable high-performance DNN inference while respecting

computational resource constraints and taking into ac-

count the heterogeneity of communication links?

To partition a deep learning model, we ﬁrst split the

model into components that are executed sequentially. Each

partition is assigned to a different edge device, and once each

Fig. 1: Partitioning and Distributing a Model Across Edge

Devices to Create an Inference Pipeline

node performs inference with its piece of the model, that

intermediate inference result is sent to the next node with

the corresponding partition in the sequence. This inference

pipeline is shown in Figure 1.

In an edge cluster, although we have a lower computational

power in each node, we can take advantage of this inference

pipelining to increase system throughput. Since each node can

perform inference with its partition individually, prior nodes

in the pipeline can send their ﬁnished inference results to the

subsequent nodes in the pipeline and accept new batches.

We deﬁne the throughput metric of a system as the number

of inference cycles it can perform per unit time. As we

showed in our previous work DEFER [23], we can achieve

higher throughput with distributed edge inference as opposed

to inference on a single device because of pipelining. The

throughput is deﬁned as the reciprocal of the bottleneck

latency. For nodes [k] = {1,2, . . . , k}, the bottleneck latency

βis deﬁned as

S={k∈[k]|ck, γk}

β= maxs∈Ss(1)

where ckis the compute time of the operations on node k,

and γkis the communication time between node k−1and

k. We use ResNet50 [12], which is a representative model

for our use case. On a Raspberry Pi 4, the inference speed

was found to be 225 ms [25]. Next, we found the amount

arXiv:2210.12219v1 [cs.NI] 21 Oct 2022

of data transferred between each layer of the model. On

average, 10.2 Mbits of data was transferred between layers.

Given an average WiFi bandwidth of 6 Mbps for a low-end

edge network, this gives us a communication time of 1.7s.

This is 7.5x slower than the compute time. In reality, many

models are larger than ResNet50 and will therefore be split

across devices, so each device will have less operations to

execute. This means that communication time will outweigh

compute time as the bottleneck. Therefore, we can simplify

the expression for bottleneck latency to the following:

β= max

k∈[k]γk(2)

Since throughput is deﬁned as 1

β, by minimizing the bottle-

neck latency we maximize inference throughput. Additionally,

we assume that all nodes are homogeneous in RAM. If the

devices are not the same capacity, then the algorithm will

take the smallest memory capacity across all nodes in the

cluster, and take that as the capacity of each node. In this

paper, we primarily analyze image and text models due to

their prevalence on the edge for visual analytics applications

[22], [34].

Our main contribution in this paper is a novel partitioning

and placement algorithm for DNNs across a cluster of edge

devices distributed spatially within the same WiFi network.

The algorithm ﬁnds the candidate partition points, ﬁnds the

optimal partition sizes to transfer the least amount of data, and

ﬁnds the arrangement of nodes with the highest bandwidth.

Together, these aim to minimize the resulting bottleneck

latency according to the throughput metric. We found that

our algorithm results in a 10x improvement over a random

partitioning/placement algorithm, and a 35% reduction in

bottleneck latency for systems with 50 compute nodes. We

empirically observe an average approximation ratio of 1.092

for the bottleneck latency (i.e. it is 9.2% more than the optimal

bottleneck latency, on average).

II. RELATED WORK

Early works on the topic of partitioning DNN models di-

vided them into head and tail models with the former distilled

to enable running on a resource-constrained device and reduce

data transfer [20]. Some prior works on DNN edge inference

mathematically perform DNN model slicing by layer [35],

[36], after calculating layer impact during the training stage;

these do not account for communication demands on the

edge. Others abstract model layers into certain “execution

units,” [7], [17] which they then choose to slice based on

certain resource requirements. Li et al. [16] regressively

predict a layer’s latency demand and optimize communication

bandwidth accordingly. DeeperThings [29] performs layer

fusion on CNNs to optimize data transfer. These works are

optimized for a hybrid edge-cloud pipeline and do not address

the demands of a cluster of edge devices. Couper [13] uses

a similar partitioning scheme to minimize inter-partition data

transfer, but does not address the communication bottleneck

associated with an edge cluster. Hu et. al [14] optimize

the partitioning of a CNN onto a set of devices by taking

compute time as a bottleneck, while employing compression

to deal with communication constraints, and do not consider

placement. Our paper builds on and differentiates itself from

these works by addressing the bandwidth limitation of an edge

cluster, and aims to maximize inter-node bandwidth during

the placement stage to minimize bottleneck latency.

III. PARTITIONING AND PLACEMENT ALGORITHM

We are given two graphs:

1) An unweighted DAG Gmrepresenting the computation

graph of a DNN, where each vertex represents a layer in

the model. This DAG can be found using common ML

libraries such as Tensorﬂow [1] and Keras [8].

2) A weighted complete graph Gcrepresenting the com-

munication graph of a cluster of homogeneous physical

compute nodes, where each vertex represents a physical

compute node and each edge represents the bandwidth

between those nodes. The graph is complete because we

assume that these edge devices will communicate over

the same WiFi network.

Our goal is to optimally partition the model and place these

partitions on a set of edge devices. We do so as follows.

A. Converting a Complex DAG to a Linear DAG

First, we need to distill Gminto a linear DAG. The

vertices where it is possible to partition the model are called

“candidate partition points.” We illustrate this in Figure 2.

For v∈V, edges e∈Eand source vertex sof Gm, ﬁnd the

longest path from sto v. This can be done by topologically

sorting the DAG and for each vertex in the resulting list,

relaxing each neighbor of that vertex. We call the length of

this longest path the topological depth of that vertex in the

graph. Let LP (v)denote the length of longest path from s

to v.

To verify that all paths from vertex vprev go through

vertex v, use a modiﬁed DFS by recursing on the incident

edges of each vertex. If we encounter a vertex with a greater

topological depth than v, return false. If we reach vertex

v, return true. Let AP (vprev , v)denote the result of this

algorithm.

Given the previously found candidate partition point pk−1

and the current vertex u, the next candidate partition point

pk=uiff:

1) LP (u)6=LP (v)∀v∈ {V−u}

2) AP (pk−1, u) = true

with p0=s.

The time complexity of LP is O(V+E). AP runs in

polynomial time by returning upon reaching a vertex with

a greater topological depth. Therefore, this algorithm runs in

polynomial time.

Figure 2 shows the candidate partition points at certain sec-

tions of the DAG of ResNet50 [12] and InceptionResNetV2

[30]. Each rectangle represents a model layer in the DAG.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PartitioningandPlacementofDeepNeuralNetworksonDistributedEdgeDevicestoMaximizeInferenceThroughputArjunParthasarathyCrystalSpringsUplandsSchoolEmail:aparthasarathy23@csus.orgBhaskarKrishnamachariUniversityofSouthernCaliforniaEmail:bkrishna@usc.eduAbstractEdgeinferencehasbecomemorewidespread,asitsdiv...

收起<<

Partitioning and Placement of Deep Neural Networks on Distributed Edge Devices to Maximize Inference Throughput.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Partitioning and Placement of Deep Neural Networks on Distributed Edge Devices to Maximize Inference Throughput

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: