MotionDeltaCNN Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions Mathias Parger1Chengcheng Tang2Thomas Neff1Christopher D. Twigg2

2025-05-02 0 0 5.32MB 12 页 10玖币

侵权投诉

MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera

Videos with Spherical Buffers and Padded Convolutions

Mathias Parger1Chengcheng Tang2Thomas Neff1Christopher D. Twigg2

Cem Keskin2Robert Wang2Markus Steinberger1

1Graz University of Technology, 2Meta Reality Labs

1{mathias.parger, thomas.neff, steinberger}@icg.tugraz.at

2{chengcheng.tang, cdtwigg, cemkeskin, rywang}@meta.com

Figure 1: MotionDeltaCNN leverages temporal continuity to accelerate CNN inference for videos with moving cameras

(bottom) by processing only the sparse frame differences based on a two-dimensional ring buffer, spherical buffer (top).

Instead of computing each frame individually, we process only the changes (row

) between the aligned current (row

) and

previous frames. Our padded convolution creates dilated values (top) which allow for seamless attachment of newly unveiled

tiles onto the existing spherical buffer (row

). With these concepts, MotionDeltaCNN speeds up inference by processing only

new regions or updated pixels (white areas in row d) without additional memory allocation.

Abstract

Convolutional neural network inference on video input is

computationally expensive and requires high memory band-

width. Recently, DeltaCNN[

] managed to reduce the cost

by only processing pixels with signiﬁcant updates over the

previous frame. However, DeltaCNN relies on static cam-

era input. Moving cameras add new challenges in how to

fuse newly unveiled image regions with already processed

regions efﬁciently to minimize the update rate - without in-

creasing memory overhead and without knowing the camera

extrinsics of future frames. In this work, we propose Motion-

DeltaCNN, a sparse CNN inference framework that supports

moving cameras. We introduce spherical buffers and padded

convolutions to enable seamless fusion of newly unveiled re-

gions and previously processed regions – without increasing

memory footprint. Our evaluation shows that we outperform

DeltaCNN by up to 90% for moving camera videos.

1. Introduction

Real-time inference of convolutional neural networks

(CNN) with video streams remains a power-consuming task

and is often infeasible on mobile devices due to hardware

and thermal limitations, despite recent efforts aiming at efﬁ-

cient CNN inference through pruning [

], quantization

[

], specialized hardware [

] or network

optimization [

] Video input allows for performance

optimization through temporal similarity between frames.

One common method is to use large, slow networks for ac-

curate predictions at key frames, updated with small, fast

networks at intermediate frames [

However, this approach requires special network design and

training and is not suitable for signiﬁcant frame changes.

Sparse convolutions, on the other hand, can be used with

existing pre-trained models, accelerating the large network to

the speed of a smaller network without impacting prediction

accuracy signiﬁcantly [

]. The linearity of convolutions

arXiv:2210.09887v5 [cs.CV] 14 Aug 2023

Network Input 1 Network Input 2 Accumulated Input

Figure 2: To support moving camera input in sparse inference of frame differences, a na

ıve approach (a) is to embed the input

onto a larger image and only overwrite pixels that are covered in the next frame. This approach comes with large overhead in

memory consumption (storing the additional pixels of the embedded frame) and computational cost (checking all pixels for

updates). MotionDeltaCNN (b) uses original shape buffers and feature maps – avoiding the memory overhead and reducing

the number of pixels that have to be checked for updates.

enables the accumulation of updates over consecutive frames.

After processing the ﬁrst frame densely, upcoming frames

can be processed using the difference between the current

and the previous frame, the Delta, as the network input.

Assuming a static camera scenario, this results in large image

regions with no frame-to-frame difference. Static elements

like background or stationary objects do not require updates

after the initial frame. Researchers exploited this sparsity in

previous work to reduce the number of FLOPs required for

CNN inference [

], but were mostly unable to

accelerate inference in practice. With the recent progress in

DeltaCNN [

], theoretical FLOP reductions were translated

into practical speedups on GPUs, using a custom sparse CNN

implementation to exploit sparsity in all layers of the CNN.

DeltaCNN achieves speedups on multiple tasks and

datasets, but like prior work, it is optimized for static cam-

era inputs. Even small frame-to-frame camera motion ne-

cessitates reprocessing large portions, or the entire image.

Spatially aligning consecutive frames, e.g., by leveraging

camera extrinsics from IMUs or SLAM on mobile devices,

can increase per-pixel similarity and thus update sparsity.

Due to the change of camera location and orientation, an

aligned new frame may “grow” beyond the initial ﬁeld of

view – out of the previous-results buffers. One way to solve

this issue is to embed inputs in a larger frame by padding

the input image and drawing updates on top of it. However,

the downside of this approach is that padded input images

increase the memory consumption and computational cost

signiﬁcantly, as shown in Figure 2.

In this work, we propose MotionDeltaCNN, a sparse

CNN inference framework that allows moving camera input

with marginal memory overhead. Compared to previous

work, we achieve up to 90% higher frame rates in videos

with moving cameras, indicating a new perspective for CNN

inference optimization on low-power devices, such as surveil-

lance cameras, smartphones, or VR headsets.

Our main contributions are:

•

We propose MotionDeltaCNN, the ﬁrst framework that

leverages temporal continuity to accelerate CNN infer-

ence for videos with moving cameras by processing

only the sparse frame differences.

•

At the core of this work, we propose padded convolu-

tions for seamless integration of newly unveiled pixels

without the need of reprocessing seen pixels.

•

We design a two-dimensional ring buffer, a spherical

buffer, with wrapped coordinates. Our buffer allows

partial growth, reset and initialization of new tiles with-

out additional memory allocation.

•

We show how MotionDeltaCNN can also be used for

speeding up applications with static cameras when only

parts of the image require processing.

2. Related Work

The idea of only processing updated pixels in a convo-

lutional layer has been proposed in various ways, with Re-

current Residual Module (RRM) [

] being the ﬁrst work

to apply this concept to videos. RRM uses the difference

between current and previous input, the Delta, as input to

convolutional layers. RRM demonstrates large theoretical

reductions in FLOPs compared to dense inference resulting

from skipping computations involving (nearly) zero-valued

entries in feature maps. In practice, skipping individual

values is infeasible on most inference hardware like GPUs

due to their single instruction, multiple data (SIMD) design.

Input Delta Convolution + Bias Accumulated Output

Figure 3: The main concepts of MotionDeltaCNN. We align frame 2 (F2) with frame 1 (F1) using homography matrices.

For the intersecting region, we propagate only the aligned frame-to-frame differences (Delta). The newly uncovered regions

are propagated directly. The regions in which the buffers store the uncovered regions are reset to zero before processing the

current Delta input. After convolving frame 2, we add the bias to all previously unseen regions, and accumulate the result onto

our spherical buffer using the offset coordinates for the current frame.

Furthermore, RRM only processes convolutional layers on

sparse Delta input. All remaining layers are processed on

the full, dense feature maps and therefore require expensive

value-conversions before and after each convolutional layer.

Skip-Convolution [

] and CBInfer [

] improve the con-

cept of RRM using a per-pixel sparse mask instead of per-

value. Like RRM, CBInfer uses a threshold to truncate

insigniﬁcant updates in the input feature map. This threshold

can be tuned for each layer for maximum sparsity while

keeping the accuracy at a desired level. In contrast, Skip-

Convolution decides per output pixel whether an update is

required or can be skipped. In both cases, only the convo-

lutional layers (and pooling layers in case of CBInfer) are

processed sparsely, requiring expensive conversions between

dense and sparse features before and after each of these

layers, leaving large performance potential unused.

DeltaCNN [

] propagates sparse Delta features end-to-

end using a sparse Delta feature map together with an update

mask. This greatly reduces the memory bandwidth compared

to previous work. Propagating sparse updates end-to-end

eliminates the necessity of converting the feature maps from

dense accumulated values to sparse Deltas, and it speeds

up other layers bottlenecked by memory bandwidth like

activation, batch normalization, upsampling, etc. Using

a custom CUDA implementation, DeltaCNN outperforms

cuDNN and CBInfer many times on video input. However,

DeltaCNN, as well as all other previous work, assumes static

camera input. Even single pixel camera motion between two

frames can result in nearly dense updates when the image

contains high frequency features.

Event Neural Networks show that computational savings

are strongly impacted by the intensity of camera motion[

With slightly shaking cameras, theoretical FLOP savings

were reduced by 40% compared to static camera videos,

while moving cameras reduced them by 60%. Depending

on the distribution of the pixel updates, the impact on real

hardware is likely even higher.

Incremental Sparse Convolution uses sparse convolutions

to incrementally update 3D segmentation masks online [

They solve a similar problem of reusing results from previ-

ously segmented regions, while allowing for new regions to

be attached on the ﬂy. Due to their reliance on non-dilating

3D convolutions [

], attaching new regions leads to arti-

facts along the region borders. We solve this issue using a

standard, dilating convolution and by processing all outputs

that are affected by the current input.

We propose MotionDeltaCNN: building upon DeltaCNN,

we design and implement a sparse inference extension that

supports moving cameras. Compared to DeltaCNN, we add

the support for variable resolution inputs, spherical buffers,

dynamic buffer allocation & initialization and padded convo-

lutions for seamless attachment of newly unveiled regions.

3. Method

MotionDeltaCNN relies on the following concepts:

Frame Alignment The current frame is aligned with the

initial frame of the sequence to maximize the overlap of

consistent features (see Figure 3 Delta).

Spherical Buffers We use wrapped buffer offset coordi-

nates in all non-linear layers to align them with the input

(see Figure 3 Accumulated Output).

Dynamic Initialization When a new region is ﬁrst un-

veiled due to camera motion, MotionDeltaCNN adds biases

onto the feature maps on the ﬂy to all newly unveiled pixels

to allow for seamless integration with previously processed

pixels (see Figure 3 Bias).

Padded Convolutions We add additional padding to con-

volutional layers to process all pixels that are affected by the

kernel. These dilated pixels are stored in truncated values

buffers to enable seamless connection of potentially unveiled

neighbor regions in upcoming frames (see Figure 5).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MotionDeltaCNN:SparseCNNInferenceofFrameDifferencesinMovingCameraVideoswithSphericalBuffersandPaddedConvolutionsMathiasParger1ChengchengTang2ThomasNeff1ChristopherD.Twigg2CemKeskin2RobertWang2MarkusSteinberger11GrazUniversityofTechnology,2MetaRealityLabs1{mathias.parger,thomas.neff,steinberger}@icg....

展开>> 收起<<

MotionDeltaCNN Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions Mathias Parger1Chengcheng Tang2Thomas Neff1Christopher D. Twigg2.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MotionDeltaCNN Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions Mathias Parger1Chengcheng Tang2Thomas Neff1Christopher D. Twigg2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: