Streaming Video Analytics On The Edge With Asynchronous Cloud Support

2025-05-02 0 0 7.87MB 12 页 10玖币

侵权投诉

Streaming Video Analytics On The Edge With Asynchronous

Cloud Support

Anurag Ghosh∗

Carnegie Mellon University

Pittsburgh, PA, USA

anuraggh@andrew.cmu.edu

Srinivasan Iyengar

Microsoft Research

Bangalore, India

sriyengar@microsoft.com

Stephen Lee

University of Pittsburgh

Pittsburgh, PA, USA

stephen.lee@pitt.edu

Anuj Rathore∗

Clutterbot

Bangalore, India

anuj@clutterbot.com

Venkat N Padmanabhan

Microsoft Research

Bangalore, India

padmanab@microsoft.com

ABSTRACT

Emerging Internet of Things (IoT) and mobile computing applica-

tions are expected to support latency-sensitive deep neural network

(DNN) workloads. To realize this vision, the Internet is evolving

towards an edge-computing architecture, where computing infras-

tructure is located closer to the end device to help achieve low

latency. However, edge computing may have limited resources

compared to cloud environments and thus, cannot run large DNN

models that often have high accuracy.

In this work, we develop

REACT

, a framework that leverages

cloud resources to execute large DNN models with higher accu-

racy to improve the accuracy of models running on edge devices.

To do so, we propose a novel edge-cloud fusion algorithm that

fuses edge and cloud predictions, achieving low latency and high

accuracy. We extensively evaluate our approach and show that

our approach can signicantly improve the accuracy compared to

baseline approaches. We focus specically on object detection in

videos (applicable in many video analytics scenarios) and show that

the fused edge-cloud predictions can outperform the accuracy of

edge-only and cloud-only scenarios by as much as 50%. We also

show that

REACT

can achieve good performance across tradeo

points by choosing a wide range of system parameters to satisfy

use-case specic constraints, such as limited network bandwidth

or GPU cycles.

ACM Reference Format:

Anurag Ghosh, Srinivasan Iyengar, Stephen Lee, Anuj Rathore, and Venkat

N Padmanabhan. 2022. Streaming Video Analytics On The Edge With

Asynchronous Cloud Support. In Proceedings of ACM Conference (Con-

ference’17). ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/

nnnnnnn.nnnnnnn

1 INTRODUCTION

Many emerging smart video analytics applications, such as traf-

c state detection, health monitoring, surveillance and assistive

technology require fast processing and real-time response to work

eectively. Such applications in built environment monitoring rely

on deep learning-based object detection models as a core part of

their processing pipeline. These models are compute-intensive and

tend to have large memory requirements.

∗Work done while at Microsoft Research

Features Our

Approach

Glimpse

[10]

Marlin

[2]

Edge-Ast.

[24]

detection at edge ✓✗✓✗

detection at cloud ✓ ✓ ✗✓

n/w variability

resilience ✓ ✓ ✓ ✗

Table 1: A comparison of our approach with existing video

analytics techniques.

Prior works have looked at ooading object detection to the

cloud [

]. By transferring data, the inference is either entirely

or partially ooaded to make use of the compute available in the

cloud. However, sending vast quantities of data to the cloud often in-

creases latency, making it unsuitable for near real-time analysis. For

intelligent drones [

] or smartphone based driver assistance [

]

to be practical, object detection is needed at low latency without

missing any objects. Thus, we believe that improving foundational

real-time vision tasks in a manner that is informed by systems con-

siderations would have a benecial impact on all these applications.

Edge computing has emerged as an approach to address the la-

tency issue with cloud infrastructure. Small form-factor hardware

that are low-cost and consume lower power are often suited for such

scenarios. But these often fall short of the heavy computing needs

of deep learning models. As such, there has been signicant focus

on special-purpose devices — e.g., Nvidia Jetson, Google Coral —

optimized to run specic DNN workloads. While edge accelerators

provide improved performance over a general-purpose edge com-

puting platform, they are still limited in their support

∗

compared

to cloud-based GPUs. Further, due to system constraints, these ap-

proaches run smaller and quantized models at the edge, with lower

accuracy, compared to the larger models, with signicantly higher

accuracy, run on the cloud [15].

In this paper, we seek to answer the following research question:

Can we have the best of both worlds, i.e., the low latency of the edge

models and the high accuracy of the cloud models? In contrast to

cloud-only and edge-only approaches, our key idea is to employ

edge-based and cloud-based models in tandem with the cloud re-

sources accessible over a wide-area network that may have high

latency. By having redundant computation of object detections, we

∗

Google Coral only supports integer (INT8) operations. Support for some special-

ized DNN layers/operations is not available in Jetson devices for FLOAT16 and INT8

operations.

arXiv:2210.01402v1 [cs.CV] 4 Oct 2022

Figure 1: Illustrates the ecacy of asynchronous cloud re-

sponse to improve edge performance. Note that objects are

undetected on edge but detected in the cloud. Thus, cloud

responses can be cascaded to improve system performance.

can use cloud-based inferences asynchronously to course correct

edge-based inferences, thereby improving accuracy without sacri-

cing latency. Table 1 distinguishes our work from the prior work

involving cloud-only and edge-only approaches.

We exploit this arbitrage between cloud and edge as the perfor-

mance disparity would remain for years ahead. Past works [

]

in the computer vision community have proposed using model

ensemble approaches. However, they combine detections from dif-

ferent models with comparable performance and do so on the same

frame without latency considerations.

REACT

’s novel fusion algo-

rithm in contrast combines higher accuracy cloud-based detections

on recent frames with current inference on the less-accurate edge

detections while removing irrelevant stale results from the cloud.

Figure 1 illustrates how redundant computation helps improve

overall accuracy for object detection. The models detect people on

a ood-aected riverbank area collected from an intelligent drone

at two dierent points in time. As shown, a cloud-based detection

model achieves higher accuracy but comes with signicant latency,

wherein the results of a frame sent at

𝑡=

0are obtained at

𝑡=𝑘

. On

the other hand, the edge-based detection model has lower accuracy,

as several humans are not detected. Note that at

𝑡=𝑛

, even though

the scene has changed, some people are still common across the

current and previous frames. However, the edge model still does not

detect these people. Moreover, edge results may be false positives.

Thus, we use cloud-based models to improve the overall accuracy by

considering detections from the accurate cloud model at time

𝑘<𝑛

and merging these with the frame at

𝑡=𝑛

on the edge. We note that

this merge operation is not trivial. We need to consider cases where

both detectors don’t agree with each other. Moreover, combining

results will not work if the edge receives a cloud response after all

the objects of interest within the frame change. It is necessary to

ensure that approaches must work in highly dynamic environments,

where objects of interest change frequently.

In this paper, we describe

REACT

— our system that builds on

these intuitions to exploit cloud’s accuracy with the low latency of

the edge. Below, are our contributions.

REACT System Design:

We designed an edge-cloud video pipeline

system capable of exploiting the performance gap of object detec-

tion models between the cloud and the edge. Our approach is de-

signed to scale to multiple edge devices and is resilient to network

variability. Finally, we develop APIs that edge-based systems can

use to leverage cloud-based models and improve overall accuracy.

Edge-Cloud Fusion Algorithm:

We develop a novel fusion al-

gorithm that combines predictions from edge and cloud object

detection models to achieve higher accuracy than edge-only and

cloud-only scenarios. To the best of our knowledge, we are the rst

to leverage redundant computations to improve the accuracy of

on-edge object detection.

Real-world Evaluation:

We evaluate

REACT

on two challenging

real-world datasets — data collected from car dashcams [

] and

drones [

]. These datasets span dierent cities and exhibit high

variations in scene characteristics and dynamics. Our results show

REACT

can signicantly improve accuracy by 50% over baseline

methods. Further,

REACT

can tradeo edge and cloud computation

while maintaining the same level of accuracy. For instance, by

reducing the edge detection frequency by a fourth (from every

5th frame to every 20th frame) and increasing cloud frequency

(from every 100th frame to 30th frame),

REACT

can achieve similar

accuracy.

Scalability and Resilience Analysis:

We analyze the scalability

of our approach and show

REACT

can support 60+ concurrent edge

devices on a single machine with a server-class GPU. We also show

that

REACT

is resilient to network variability. That is, it can function

on varying network conditions and leverages cloud models when

feasible. We evaluate

REACT

over dierent network types (WiFi and

LTE) with varying latency using a network emulator. Our results

show that even with varying response latency from the cloud,

REACT

performs better than the edge-only scenario,

2 BACKGROUND

In this section, we provide background on video-based applications

and challenges in cloud or edge-based video analytics applications.

Video-analytics systems collect rich visual information that of-

fers insights into the environment. These systems can be broadly

categorized as: (i) devices that send all video to the cloud for pro-

cessing, and (ii) devices that have limited processing capabilities

constrained by its small form-factor, cost, or energy. In this case, the

video processing can be split between the device and the cloud. That

is, the device can perform either some or possibly all the processing

before it sends the video to the cloud. Deep learning inference for

object detection forms the core aspect of such systems.

Since deep learning is compute-intensive, existing systems typi-

cally send data to the cloud for processing. However, cloud analysis

may incur signicant delays and may be unsuitable for live applica-

tions. Edge computing has emerged as an alternative to complement

the cloud, where data processing is done close to the devices to

avoid these delays. A variety of edge computing architectures exist,

depending on where the edge servers are located relative to the end-

devices [

]. Our work assumes the edge device is of low latency,

and limited computing capabilities, such as hubs in smart homes,

routers, and mobile phones and IoT devices such as intelligent

drones and wearable VR headsets. We assume that some form of

resource constrained AI-based workloads can be run on these edge

devices. Modern devices like Raspberry Pi or Jetson are devices are

capable of running lightweight models [

] with a smaller memory

footprint. Pairing specialized accelerators (such as Google Coral or

Intel Movidius) speeds up the inference time of small models with-

out aecting accuracy for a class of model. Unfortunately, larger

deep learning models (having higher accuracy than smaller models)

are still not within the latency and memory budget of these devices.

Larger models require cloud GPU resources, but this comes at the

cost of network delays. This is unacceptable for live and stream-

ing applications. In summary, edge processing provides a latency

advantage but there remains a signicant accuracy gap between

real-time prediction on an edge device and oine prediction in a

resource-rich setting [

]. Our goal in

REACT

is to leverage cloud

processing in tandem with edge processing to bridge the accuracy

gap while preserving the latency advantage of edge processing.

3 REACT DESIGN

For real-time edge inference, we propose a system that uses an

edge-cloud architecture while retaining the low latency of edge

devices but achieving higher accuracy than an edge-only approach.

In this section, we discuss how we leverage the cloud models to

inuence and improve edge results.

Basic Approach:

It is known that video frames are spatiotem-

porally correlated. Typically, it is sucient to invoke edge object

detection once every few frames. As illustrated in Figure 2(a), edge

detection runs every 5th frame. As shown in the Figure, to interpo-

late the intermediate frames, a comparatively lightweight operation

of object tracking can be employed. Additionally, to improve the

accuracy of inference, select frames are asynchronously transmitted

to the cloud for inference. Depending on network conditions (RTT,

bandwidth, etc.) and the cloud server conguration (GPU type,

memory, etc.), cloud detections are available to the edge device

only after a few frames. The newer cloud detections, which were

previously undetected, can be brought to the current frame using

another instance of an object tracker running on the past buered

images. Video frames retain the spatial and temporal context de-

pending on scene and camera dynamics. Our key insight is that

these asynchronous detections from the cloud can help improve

overall system performance as the scene usually does not change

abruptly. See Figure 2(b) for a visual result of the approach.

Challenges:

Nevertheless, designing a system that utilizes the

above approach would require addressing several challenges. First,

combining the detections from two sources, i.e., local edge detec-

tions and the delayed cloud detections is not straightforward. Each

of these two detections contain separate list of objects represented

by a

⟨

class_label,bounding_box,condence_score

⟩

tuple. A fusion al-

gorithm must consider several cases – such as class label mismatch,

misaligned bounding boxes, etc. – to consolidate the edge and cloud

detections into a single list. Second, some or all of the cloud objects

may be “stale”, outside the current edge frame. The longer it takes

to perform fusion, the greater the risk of such staleness, especially

if the scene changes rapidly. Thus, to minimize this risk, once the

old cloud annotations are received, they must be quickly processed

at the edge to help with the current frame.

Another challenge when running detection models on live videos

at the edge is minimizing resource utilization while maintaining

detection accuracy. Previous studies with edge-only detection sys-

tems have shown that running a deep neural network (DNN) for

every frame in a video can drain system resources (e.g., battery)

quickly [

]. In our case, with a distributed edge-cloud architecture,

several resource constraints need to be simultaneously considered.

For example, cloud detections are more accurate as one can run

computationally expensive models with access to server-class GPU

resources. However, bandwidth constraints or a limited cloud bud-

get might restrict their use to once every few frames. Moreover, if

the scene change is insignicant, it would be prudent not to invoke

object detections at the edge and the cloud. On the contrary, for

more dynamic scenes, increasing the frequency of edge detection

might result in excessive heat generation from the modest GPUs

used on edge devices leading to throttling.

Next, we present our system called

REACT

, which overcomes the

above challenges. Primarily,

REACT

consists of three components

– i)

REACT

Edge Manager, ii) Cloud-Edge Fusion Unit, iii)

REACT

Model Server. Below, we describe them in more detail.

3.1 REACT Edge Manager

The

REACT

Edge Manager (REM) consists of dierent modules, and

put together, enables fast and accurate object detection at the edge.

Change detector:

Previous studies have shown that running a ob-

ject detection on every frame in a video can drain system resources

(e.g., battery) quickly [

]. REM provides two parameters, i.e., the

detection frequency at the edge (

𝑘

) and the cloud (

𝑚

) – to modu-

late the number of frames between object detection. Intuitively, if

there is little object displacement across frames, running detection

models frequently will lead to wastage of resources. REM employs

a change detector that computes the optical ow on successive

frames. This represents the relative motion of the scene consisting

of objects and the camera, similar to [

]. Thus, the object

detection invocations will only occur at a detection frequency of

every

𝑘𝑡ℎ

and

𝑚𝑡ℎ

frame at the edge and the cloud, respectively, if

this motion is greater than a pre-decided threshold.

Edge Object Detector:

Every

𝑘𝑡ℎ

frame, REM triggers the edge

object detector module, which in turn outputs a list of

⟨𝑙, 𝑝, 𝑐⟩

tuples.

Here,

𝑙

and

𝑐

are class labels (e.g., cars, person) and condence scores

(between 0 and 1) associated with the detected objects, respectively.

𝑝=(𝑥, 𝑦, 𝑤, ℎ)

represents the bounding box for each of the detected

objects, where

𝑥, 𝑦

is the center coordinate of the object;

𝑤, ℎ

is the

width and height of the bounding box. To avoid multiple bounding

boxes for the same object, we use Non-max suppression, which

removes locally repeated detections.

Main Object tracker:

REM employs an CPU-based object tracker,

a computationally cheaper technique, between frames for which

the object detections are available. For example, a CSRT [

] tracker

can process images at

40 fps (on Nvidia Jetson Xavier). However,

as the quantum of associated displacement of objects increases, the

tracker accuracy also reduces. The tracker module accounts for

this degradation by multiplying every tracked object’s condence

scores by a decay rate

𝛿∈ [

]

. As the condence scores reduce

with every passing frame with this multiplier, the module sweeps

over the list of objects to discard the ones with lower condence

scores (i.e., 𝑐<0.5).

Cloud communicator:

The REM consists of a communication

module responsible for sending every

𝑚𝑡ℎ

frame (cloud detection

frequency) to the cloud and receive the associated output annota-

tions. Similar to edge detections, the cloud annotations consist of a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StreamingVideoAnalyticsOnTheEdgeWithAsynchronousCloudSupportAnuragGhosh∗CarnegieMellonUniversityPittsburgh,PA,USAanuraggh@andrew.cmu.eduSrinivasanIyengarMicrosoftResearchBangalore,Indiasriyengar@microsoft.comStephenLeeUniversityofPittsburghPittsburgh,PA,USAstephen.lee@pitt.eduAnujRathore∗ClutterbotB...

展开>> 收起<<

Streaming Video Analytics On The Edge With Asynchronous Cloud Support.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Streaming Video Analytics On The Edge With Asynchronous Cloud Support

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: