MotionDeltaCNN Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions Mathias Parger1Chengcheng Tang2Thomas Neff1Christopher D. Twigg2

2025-05-02 0 0 5.32MB 12 页 10玖币
侵权投诉
MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera
Videos with Spherical Buffers and Padded Convolutions
Mathias Parger1Chengcheng Tang2Thomas Neff1Christopher D. Twigg2
Cem Keskin2Robert Wang2Markus Steinberger1
1Graz University of Technology, 2Meta Reality Labs
1{mathias.parger, thomas.neff, steinberger}@icg.tugraz.at
2{chengcheng.tang, cdtwigg, cemkeskin, rywang}@meta.com
Figure 1: MotionDeltaCNN leverages temporal continuity to accelerate CNN inference for videos with moving cameras
(bottom) by processing only the sparse frame differences based on a two-dimensional ring buffer, spherical buffer (top).
Instead of computing each frame individually, we process only the changes (row
b
) between the aligned current (row
a
) and
previous frames. Our padded convolution creates dilated values (top) which allow for seamless attachment of newly unveiled
tiles onto the existing spherical buffer (row
c
). With these concepts, MotionDeltaCNN speeds up inference by processing only
new regions or updated pixels (white areas in row d) without additional memory allocation.
Abstract
Convolutional neural network inference on video input is
computationally expensive and requires high memory band-
width. Recently, DeltaCNN[
26
] managed to reduce the cost
by only processing pixels with significant updates over the
previous frame. However, DeltaCNN relies on static cam-
era input. Moving cameras add new challenges in how to
fuse newly unveiled image regions with already processed
regions efficiently to minimize the update rate - without in-
creasing memory overhead and without knowing the camera
extrinsics of future frames. In this work, we propose Motion-
DeltaCNN, a sparse CNN inference framework that supports
moving cameras. We introduce spherical buffers and padded
convolutions to enable seamless fusion of newly unveiled re-
gions and previously processed regions – without increasing
memory footprint. Our evaluation shows that we outperform
DeltaCNN by up to 90% for moving camera videos.
1. Introduction
Real-time inference of convolutional neural networks
(CNN) with video streams remains a power-consuming task
and is often infeasible on mobile devices due to hardware
and thermal limitations, despite recent efforts aiming at effi-
cient CNN inference through pruning [
19
,
14
], quantization
[
16
,
24
,
22
], specialized hardware [
6
,
13
,
5
] or network
optimization [
28
,
31
] Video input allows for performance
optimization through temporal similarity between frames.
One common method is to use large, slow networks for ac-
curate predictions at key frames, updated with small, fast
networks at intermediate frames [
34
,
33
,
20
,
18
,
9
,
21
,
12
].
However, this approach requires special network design and
training and is not suitable for significant frame changes.
Sparse convolutions, on the other hand, can be used with
existing pre-trained models, accelerating the large network to
the speed of a smaller network without impacting prediction
accuracy significantly [
25
,
26
]. The linearity of convolutions
1
arXiv:2210.09887v5 [cs.CV] 14 Aug 2023
a
b
Network Input 1 Network Input 2 Accumulated Input
Figure 2: To support moving camera input in sparse inference of frame differences, a na
¨
ıve approach (a) is to embed the input
onto a larger image and only overwrite pixels that are covered in the next frame. This approach comes with large overhead in
memory consumption (storing the additional pixels of the embedded frame) and computational cost (checking all pixels for
updates). MotionDeltaCNN (b) uses original shape buffers and feature maps – avoiding the memory overhead and reducing
the number of pixels that have to be checked for updates.
enables the accumulation of updates over consecutive frames.
After processing the first frame densely, upcoming frames
can be processed using the difference between the current
and the previous frame, the Delta, as the network input.
Assuming a static camera scenario, this results in large image
regions with no frame-to-frame difference. Static elements
like background or stationary objects do not require updates
after the initial frame. Researchers exploited this sparsity in
previous work to reduce the number of FLOPs required for
CNN inference [
25
,
4
,
11
,
1
,
8
], but were mostly unable to
accelerate inference in practice. With the recent progress in
DeltaCNN [
26
], theoretical FLOP reductions were translated
into practical speedups on GPUs, using a custom sparse CNN
implementation to exploit sparsity in all layers of the CNN.
DeltaCNN achieves speedups on multiple tasks and
datasets, but like prior work, it is optimized for static cam-
era inputs. Even small frame-to-frame camera motion ne-
cessitates reprocessing large portions, or the entire image.
Spatially aligning consecutive frames, e.g., by leveraging
camera extrinsics from IMUs or SLAM on mobile devices,
can increase per-pixel similarity and thus update sparsity.
Due to the change of camera location and orientation, an
aligned new frame may “grow” beyond the initial field of
view – out of the previous-results buffers. One way to solve
this issue is to embed inputs in a larger frame by padding
the input image and drawing updates on top of it. However,
the downside of this approach is that padded input images
increase the memory consumption and computational cost
significantly, as shown in Figure 2.
In this work, we propose MotionDeltaCNN, a sparse
CNN inference framework that allows moving camera input
with marginal memory overhead. Compared to previous
work, we achieve up to 90% higher frame rates in videos
with moving cameras, indicating a new perspective for CNN
inference optimization on low-power devices, such as surveil-
lance cameras, smartphones, or VR headsets.
Our main contributions are:
We propose MotionDeltaCNN, the first framework that
leverages temporal continuity to accelerate CNN infer-
ence for videos with moving cameras by processing
only the sparse frame differences.
At the core of this work, we propose padded convolu-
tions for seamless integration of newly unveiled pixels
without the need of reprocessing seen pixels.
We design a two-dimensional ring buffer, a spherical
buffer, with wrapped coordinates. Our buffer allows
partial growth, reset and initialization of new tiles with-
out additional memory allocation.
We show how MotionDeltaCNN can also be used for
speeding up applications with static cameras when only
parts of the image require processing.
2. Related Work
The idea of only processing updated pixels in a convo-
lutional layer has been proposed in various ways, with Re-
current Residual Module (RRM) [
25
] being the first work
to apply this concept to videos. RRM uses the difference
between current and previous input, the Delta, as input to
convolutional layers. RRM demonstrates large theoretical
reductions in FLOPs compared to dense inference resulting
from skipping computations involving (nearly) zero-valued
entries in feature maps. In practice, skipping individual
values is infeasible on most inference hardware like GPUs
due to their single instruction, multiple data (SIMD) design.
2
F1
F2
Input Delta Convolution + Bias Accumulated Output
Figure 3: The main concepts of MotionDeltaCNN. We align frame 2 (F2) with frame 1 (F1) using homography matrices.
For the intersecting region, we propagate only the aligned frame-to-frame differences (Delta). The newly uncovered regions
are propagated directly. The regions in which the buffers store the uncovered regions are reset to zero before processing the
current Delta input. After convolving frame 2, we add the bias to all previously unseen regions, and accumulate the result onto
our spherical buffer using the offset coordinates for the current frame.
Furthermore, RRM only processes convolutional layers on
sparse Delta input. All remaining layers are processed on
the full, dense feature maps and therefore require expensive
value-conversions before and after each convolutional layer.
Skip-Convolution [
11
] and CBInfer [
4
] improve the con-
cept of RRM using a per-pixel sparse mask instead of per-
value. Like RRM, CBInfer uses a threshold to truncate
insignificant updates in the input feature map. This threshold
can be tuned for each layer for maximum sparsity while
keeping the accuracy at a desired level. In contrast, Skip-
Convolution decides per output pixel whether an update is
required or can be skipped. In both cases, only the convo-
lutional layers (and pooling layers in case of CBInfer) are
processed sparsely, requiring expensive conversions between
dense and sparse features before and after each of these
layers, leaving large performance potential unused.
DeltaCNN [
26
] propagates sparse Delta features end-to-
end using a sparse Delta feature map together with an update
mask. This greatly reduces the memory bandwidth compared
to previous work. Propagating sparse updates end-to-end
eliminates the necessity of converting the feature maps from
dense accumulated values to sparse Deltas, and it speeds
up other layers bottlenecked by memory bandwidth like
activation, batch normalization, upsampling, etc. Using
a custom CUDA implementation, DeltaCNN outperforms
cuDNN and CBInfer many times on video input. However,
DeltaCNN, as well as all other previous work, assumes static
camera input. Even single pixel camera motion between two
frames can result in nearly dense updates when the image
contains high frequency features.
Event Neural Networks show that computational savings
are strongly impacted by the intensity of camera motion[
8
].
With slightly shaking cameras, theoretical FLOP savings
were reduced by 40% compared to static camera videos,
while moving cameras reduced them by 60%. Depending
on the distribution of the pixel updates, the impact on real
hardware is likely even higher.
Incremental Sparse Convolution uses sparse convolutions
to incrementally update 3D segmentation masks online [
23
].
They solve a similar problem of reusing results from previ-
ously segmented regions, while allowing for new regions to
be attached on the fly. Due to their reliance on non-dilating
3D convolutions [
10
], attaching new regions leads to arti-
facts along the region borders. We solve this issue using a
standard, dilating convolution and by processing all outputs
that are affected by the current input.
We propose MotionDeltaCNN: building upon DeltaCNN,
we design and implement a sparse inference extension that
supports moving cameras. Compared to DeltaCNN, we add
the support for variable resolution inputs, spherical buffers,
dynamic buffer allocation & initialization and padded convo-
lutions for seamless attachment of newly unveiled regions.
3. Method
MotionDeltaCNN relies on the following concepts:
Frame Alignment The current frame is aligned with the
initial frame of the sequence to maximize the overlap of
consistent features (see Figure 3 Delta).
Spherical Buffers We use wrapped buffer offset coordi-
nates in all non-linear layers to align them with the input
(see Figure 3 Accumulated Output).
Dynamic Initialization When a new region is first un-
veiled due to camera motion, MotionDeltaCNN adds biases
onto the feature maps on the fly to all newly unveiled pixels
to allow for seamless integration with previously processed
pixels (see Figure 3 Bias).
Padded Convolutions We add additional padding to con-
volutional layers to process all pixels that are affected by the
kernel. These dilated pixels are stored in truncated values
buffers to enable seamless connection of potentially unveiled
neighbor regions in upcoming frames (see Figure 5).
3
摘要:

MotionDeltaCNN:SparseCNNInferenceofFrameDifferencesinMovingCameraVideoswithSphericalBuffersandPaddedConvolutionsMathiasParger1ChengchengTang2ThomasNeff1ChristopherD.Twigg2CemKeskin2RobertWang2MarkusSteinberger11GrazUniversityofTechnology,2MetaRealityLabs1{mathias.parger,thomas.neff,steinberger}@icg....

展开>> 收起<<
MotionDeltaCNN Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions Mathias Parger1Chengcheng Tang2Thomas Neff1Christopher D. Twigg2.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:5.32MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注