Intel Movidius) speeds up the inference time of small models with-
out aecting accuracy for a class of model. Unfortunately, larger
deep learning models (having higher accuracy than smaller models)
are still not within the latency and memory budget of these devices.
Larger models require cloud GPU resources, but this comes at the
cost of network delays. This is unacceptable for live and stream-
ing applications. In summary, edge processing provides a latency
advantage but there remains a signicant accuracy gap between
real-time prediction on an edge device and oine prediction in a
resource-rich setting [
20
]. Our goal in
REACT
is to leverage cloud
processing in tandem with edge processing to bridge the accuracy
gap while preserving the latency advantage of edge processing.
3 REACT DESIGN
For real-time edge inference, we propose a system that uses an
edge-cloud architecture while retaining the low latency of edge
devices but achieving higher accuracy than an edge-only approach.
In this section, we discuss how we leverage the cloud models to
inuence and improve edge results.
Basic Approach:
It is known that video frames are spatiotem-
porally correlated. Typically, it is sucient to invoke edge object
detection once every few frames. As illustrated in Figure 2(a), edge
detection runs every 5th frame. As shown in the Figure, to interpo-
late the intermediate frames, a comparatively lightweight operation
of object tracking can be employed. Additionally, to improve the
accuracy of inference, select frames are asynchronously transmitted
to the cloud for inference. Depending on network conditions (RTT,
bandwidth, etc.) and the cloud server conguration (GPU type,
memory, etc.), cloud detections are available to the edge device
only after a few frames. The newer cloud detections, which were
previously undetected, can be brought to the current frame using
another instance of an object tracker running on the past buered
images. Video frames retain the spatial and temporal context de-
pending on scene and camera dynamics. Our key insight is that
these asynchronous detections from the cloud can help improve
overall system performance as the scene usually does not change
abruptly. See Figure 2(b) for a visual result of the approach.
Challenges:
Nevertheless, designing a system that utilizes the
above approach would require addressing several challenges. First,
combining the detections from two sources, i.e., local edge detec-
tions and the delayed cloud detections is not straightforward. Each
of these two detections contain separate list of objects represented
by a
⟨
class_label,bounding_box,condence_score
⟩
tuple. A fusion al-
gorithm must consider several cases – such as class label mismatch,
misaligned bounding boxes, etc. – to consolidate the edge and cloud
detections into a single list. Second, some or all of the cloud objects
may be “stale”, outside the current edge frame. The longer it takes
to perform fusion, the greater the risk of such staleness, especially
if the scene changes rapidly. Thus, to minimize this risk, once the
old cloud annotations are received, they must be quickly processed
at the edge to help with the current frame.
Another challenge when running detection models on live videos
at the edge is minimizing resource utilization while maintaining
detection accuracy. Previous studies with edge-only detection sys-
tems have shown that running a deep neural network (DNN) for
every frame in a video can drain system resources (e.g., battery)
quickly [
2
]. In our case, with a distributed edge-cloud architecture,
several resource constraints need to be simultaneously considered.
For example, cloud detections are more accurate as one can run
computationally expensive models with access to server-class GPU
resources. However, bandwidth constraints or a limited cloud bud-
get might restrict their use to once every few frames. Moreover, if
the scene change is insignicant, it would be prudent not to invoke
object detections at the edge and the cloud. On the contrary, for
more dynamic scenes, increasing the frequency of edge detection
might result in excessive heat generation from the modest GPUs
used on edge devices leading to throttling.
Next, we present our system called
REACT
, which overcomes the
above challenges. Primarily,
REACT
consists of three components
– i)
REACT
Edge Manager, ii) Cloud-Edge Fusion Unit, iii)
REACT
Model Server. Below, we describe them in more detail.
3.1 REACT Edge Manager
The
REACT
Edge Manager (REM) consists of dierent modules, and
put together, enables fast and accurate object detection at the edge.
Change detector:
Previous studies have shown that running a ob-
ject detection on every frame in a video can drain system resources
(e.g., battery) quickly [
2
]. REM provides two parameters, i.e., the
detection frequency at the edge (
𝑘
) and the cloud (
𝑚
) – to modu-
late the number of frames between object detection. Intuitively, if
there is little object displacement across frames, running detection
models frequently will lead to wastage of resources. REM employs
a change detector that computes the optical ow on successive
frames. This represents the relative motion of the scene consisting
of objects and the camera, similar to [
2
,
10
,
18
]. Thus, the object
detection invocations will only occur at a detection frequency of
every
𝑘𝑡ℎ
and
𝑚𝑡ℎ
frame at the edge and the cloud, respectively, if
this motion is greater than a pre-decided threshold.
Edge Object Detector:
Every
𝑘𝑡ℎ
frame, REM triggers the edge
object detector module, which in turn outputs a list of
⟨𝑙, 𝑝, 𝑐⟩
tuples.
Here,
𝑙
and
𝑐
are class labels (e.g., cars, person) and condence scores
(between 0 and 1) associated with the detected objects, respectively.
𝑝=(𝑥, 𝑦, 𝑤, ℎ)
represents the bounding box for each of the detected
objects, where
𝑥, 𝑦
is the center coordinate of the object;
𝑤, ℎ
is the
width and height of the bounding box. To avoid multiple bounding
boxes for the same object, we use Non-max suppression, which
removes locally repeated detections.
Main Object tracker:
REM employs an CPU-based object tracker,
a computationally cheaper technique, between frames for which
the object detections are available. For example, a CSRT [
25
] tracker
can process images at
>
40 fps (on Nvidia Jetson Xavier). However,
as the quantum of associated displacement of objects increases, the
tracker accuracy also reduces. The tracker module accounts for
this degradation by multiplying every tracked object’s condence
scores by a decay rate
𝛿∈ [
0
,
1
]
. As the condence scores reduce
with every passing frame with this multiplier, the module sweeps
over the list of objects to discard the ones with lower condence
scores (i.e., 𝑐<0.5).
Cloud communicator:
The REM consists of a communication
module responsible for sending every
𝑚𝑡ℎ
frame (cloud detection
frequency) to the cloud and receive the associated output annota-
tions. Similar to edge detections, the cloud annotations consist of a
3