
F1
F2
Input Delta Convolution + Bias Accumulated Output
Figure 3: The main concepts of MotionDeltaCNN. We align frame 2 (F2) with frame 1 (F1) using homography matrices.
For the intersecting region, we propagate only the aligned frame-to-frame differences (Delta). The newly uncovered regions
are propagated directly. The regions in which the buffers store the uncovered regions are reset to zero before processing the
current Delta input. After convolving frame 2, we add the bias to all previously unseen regions, and accumulate the result onto
our spherical buffer using the offset coordinates for the current frame.
Furthermore, RRM only processes convolutional layers on
sparse Delta input. All remaining layers are processed on
the full, dense feature maps and therefore require expensive
value-conversions before and after each convolutional layer.
Skip-Convolution [
11
] and CBInfer [
4
] improve the con-
cept of RRM using a per-pixel sparse mask instead of per-
value. Like RRM, CBInfer uses a threshold to truncate
insignificant updates in the input feature map. This threshold
can be tuned for each layer for maximum sparsity while
keeping the accuracy at a desired level. In contrast, Skip-
Convolution decides per output pixel whether an update is
required or can be skipped. In both cases, only the convo-
lutional layers (and pooling layers in case of CBInfer) are
processed sparsely, requiring expensive conversions between
dense and sparse features before and after each of these
layers, leaving large performance potential unused.
DeltaCNN [
26
] propagates sparse Delta features end-to-
end using a sparse Delta feature map together with an update
mask. This greatly reduces the memory bandwidth compared
to previous work. Propagating sparse updates end-to-end
eliminates the necessity of converting the feature maps from
dense accumulated values to sparse Deltas, and it speeds
up other layers bottlenecked by memory bandwidth like
activation, batch normalization, upsampling, etc. Using
a custom CUDA implementation, DeltaCNN outperforms
cuDNN and CBInfer many times on video input. However,
DeltaCNN, as well as all other previous work, assumes static
camera input. Even single pixel camera motion between two
frames can result in nearly dense updates when the image
contains high frequency features.
Event Neural Networks show that computational savings
are strongly impacted by the intensity of camera motion[
8
].
With slightly shaking cameras, theoretical FLOP savings
were reduced by 40% compared to static camera videos,
while moving cameras reduced them by 60%. Depending
on the distribution of the pixel updates, the impact on real
hardware is likely even higher.
Incremental Sparse Convolution uses sparse convolutions
to incrementally update 3D segmentation masks online [
23
].
They solve a similar problem of reusing results from previ-
ously segmented regions, while allowing for new regions to
be attached on the fly. Due to their reliance on non-dilating
3D convolutions [
10
], attaching new regions leads to arti-
facts along the region borders. We solve this issue using a
standard, dilating convolution and by processing all outputs
that are affected by the current input.
We propose MotionDeltaCNN: building upon DeltaCNN,
we design and implement a sparse inference extension that
supports moving cameras. Compared to DeltaCNN, we add
the support for variable resolution inputs, spherical buffers,
dynamic buffer allocation & initialization and padded convo-
lutions for seamless attachment of newly unveiled regions.
3. Method
MotionDeltaCNN relies on the following concepts:
Frame Alignment The current frame is aligned with the
initial frame of the sequence to maximize the overlap of
consistent features (see Figure 3 Delta).
Spherical Buffers We use wrapped buffer offset coordi-
nates in all non-linear layers to align them with the input
(see Figure 3 Accumulated Output).
Dynamic Initialization When a new region is first un-
veiled due to camera motion, MotionDeltaCNN adds biases
onto the feature maps on the fly to all newly unveiled pixels
to allow for seamless integration with previously processed
pixels (see Figure 3 Bias).
Padded Convolutions We add additional padding to con-
volutional layers to process all pixels that are affected by the
kernel. These dilated pixels are stored in truncated values
buffers to enable seamless connection of potentially unveiled
neighbor regions in upcoming frames (see Figure 5).
3