2 Vincenz Mechler and Pavel Rojtberg
1 INTRODUCTION
Event-based or neuromorphic cameras provide many advantages like high-frequency output,
high dynamic-range and a lower power-consumption. However, their sensor output is a sparse,
asynchronous image-representation, which is fundamentally dierent to traditional, dense images.
This hinders the use of convolutional layers, which are an essential building-block of current
state-of-the-art image processing networks. Classical convolutions on sparse data, as it is produced
e.g. by event-cameras, are inecient, as a large part of the computed feature-map defaults to zero.
Furthermore, sparsity of the data is quickly lost, as the non-zero sites spread rapidly with each
convolution. To alleviate this problem, changes to the convolutional layers were proposed.
Sparse convolutional layers [
6
] compute convolutions only at active (i.e. non-zero) sites. The
sub-type of ’valid’ or ’submanifold’ sparse convolutional layers furthermore tries to preserve the
sparsity of the data by only producing output signals at active sites, which makes them highly
ecient at the cost of restricting signal propagation. Non-valid sparse convolutions are semantically
equivalent with dense convolution layers in that they compute the same result given identical inputs.
Valid or submanifold sparse convolution layers, on the other hand, dier from dense convolutions,
but still provide a good approximation for full convolutions on sparse data.
Messikommer et al
. [8]
further introduce asynchronicity into the network. This allows for
samples to be fed into the network in parts as they are produced by a sensor, and thus to reduce
the latency in real-time applications. Several small batches of events from the same sample can be
processed sequentially, producing identical results to synchronous layers once the whole sample
has been processed. However, [
8
] only implemented a proof-of-concept. The project only includes
asynchronous submanifold sparse convolutional and batch-norm layers, whereas the sparseconvnet
(SCN) project[
6
] provides a full-edged library. Furthermore, asynchronous models cannot be
trained, as the index_add operation used in the forward function is not supported by PyTorch’s
automatic gradient tracking. This, however, does not pose a problem, as each layer is functionally
equivalent to its SCN counterpart. Therefore, it is possible to train an architecturally identical SCN
network and transfer the weights. As the asynchronous property is only relevant during inference,
this does not pose a limitation.
An alternative approach is to convert the sparse frame representation to dense frames rst,
using a learning-based approach [
9
]. This way, however, one loses all computational advantages
that the sparse representation oers. Notably, it is also possible to synthesize events from a dense
frame-based representation [3].
Furthermore, a leaky surface layer was proposed by Cannici et al
. [1]
which integrates the
event-to-frame conversion directly into the target network. This way the network becomes stateful,
and resembles a spiking model [7].
1.1 Contributions and outline
In this work, we use the YOLO v1 model [
10
] as a simple but powerful dense object recognition
baseline. We model sparse networks architecturally identical to YOLO v1 using the SCN[
6
] and
asynet[
8
] frameworks. These serve as a case study to evaluate the performance of sparse and
asynchronous vs dense object detection.
We implement all variants in PyTorch and evaluate the predictive performance and runtime
requirements against a dense variant. To this end, we convert the KITTI Vision dataset to events
using [
3
]. This allows us to answer the question if these novel technologies are a viable optimization
over dense convolutional layers, or if they fall short of the expectations in practice.
The remaining part of this work is structured as follows: First, section 2 introduces data formats
required for the remainder of this work. Next, section 3 details the major changes and additions to