(ii) Events generated by the same object are spatially close
to each other.
Based on these two properties, we aim to group events
along two dimensions - spatial and temporal, to isolate events
based on their source objects and thus identify the object
boundaries.
Biologically inspired spiking neurons, specifically Leaky
Integrate and Fire (LIF) neurons [9], can leverage spatio-
temporal information and are hence, well suited for achiev-
ing boundary detection along both these dimensions. These
neurons mimic the brain activity and behave based on the
temporal properties of the inputs to the neuron. The neuron
generates an output spike only if the input events occur at a
rate higher than a certain frequency. A neural architecture
with these spiking neurons is sensitive to the temporal
structure of input events. It is also possible to arrange the
connectivity between these neurons in a network to enable
support between events that are spatially close to each other.
Further, spiking neurons work in an asynchronous fashion,
making them compatible with the asynchronous outputs of
event cameras. These properties, along with the fact that
spiking neural architectures can be more energy efficient (as
shown in [10]–[12]) than their artificial neural counterparts,
make them an ideal candidate for performing object detection
with a low latency and energy overhead for autonomous
navigation systems.
In this work, we develop a novel and energy efficient
object detection technique to first isolate objects based on the
speed of their movement. This is done using inputs from an
event camera and a lightweight single layer spiking neural
network. Once objects are separated based on their speed
of movement, their corresponding events are then grouped
together based on their spatial characteristics. For this, we
utilize existing clustering techniques to further separate the
events belonging to different objects, based on their spatial
proximity. Here, we are taking advantage of the fact that
the events from the same object, share similar temporal
characteristics such as their speed of movement, and spatial
characteristics such as being generated by pixels that are spa-
tially close together. Our experiments show that this approach
is a more efficient way to detect objects in terms of both
latency and energy consumption. By isolating events based
on the speed of movement of their corresponding objects, we
are also able to eliminate the non-relevant information caused
by noise and background static objects1. Further, this benefits
the clustering techniques as they have a lower operational
complexity due to the reduced number of samples (events)
that need to be clustered.
We summarize the main contributions as follows:
1) We develop an object detection algorithm that solely
relies on the event camera outputs and does not need any
additional information from traditional frame cameras.
2) Unlike many existing works on object detection which
accumulate events for a time duration to create a frame,
1If there is a need to detect static objects, this can be done by clustering
the residual events that do not propagate through the spiking architecture.
we perform object detection asynchronously as the
events are being generated by the event camera.
3) The proposed spike-based network for separating ob-
jects based on their motion consists of a single layer,
resulting in a detection algorithm with lower latency and
energy overhead.
4) The outputs of the proposed spiking architecture (which
isolates objects based on their speed), can be used with
any spatial clustering technique that does not require
prior knowledge of the number or the size of clusters.
5) The spiking architecture is scene independent. This
means that we do not have to train the parameters
of the architecture based on the scene of deployment.
These parameters directly correspond to the speed of
the objects and can be fine-tuned prior to deployment.
II. RELATED WORK
A. Object Detection in the Event Camera Domain
Object detection is a topic that has been extensively
studied in the computer vision community. There have been
works ranging from simple feature detectors [6], [13], to
more complex learning-based methods [14]. There has also
been a significant interest in the learning community on
neural networks that can not only detect objects, but can
also classify them into different classes [7]. However, when it
comes to autonomous navigation where object classification
is not a priority, the latency and energy efficiency of the
underlying algorithms take precedence.
As discussed earlier, traditional frame-based algorithms
fail to operate on event camera outputs due to the absence of
photometric characteristics such as texture and light intensity.
However, owing to the numerous advantages of event cam-
eras, including higher operating speed, wider dynamic range,
and lower power consumption, there has been a substantial
interest in the community to develop algorithms that are more
suited towards this domain.
Initial event-based detection algorithms such as [15], [16]
were focused on detecting patterns present in the event cam-
era output. The authors in [17] used a simple blob detector
to detect the inherent patterns present in the event data, and
[18] used a plane fitting method on the distribution of events
to identify corners. A recent work adapted Gaussian mixture
modelling to detect patterns in the event data [19]. These
methods, however, fail in scenarios where there are events
generated by the background. As a solution, [20] proposed a
motion compensation technique to eliminate events generated
by the background, by estimating the system’s ego-motion.
The optimization involved, however, adds significant latency
and computational overhead to the system.
To improve the detection accuracy, several recent research
efforts were focused on utilizing information from both
frame and event cameras [21], [22]. These hybrid methods
detect features on the frames and track the objects through
events. Since their detection relies on frame inputs, they
cannot operate in scenarios with a wide dynamic range and
are computationally expensive.