2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED FEBRUARY, 2023
of-the-art techniques that use architectures similar to ours,
but without explicit perspective handling. The implementation
of our method is available at https://github.com/cvlab-epfl/
perspective-aware-obstacles.
II. RELATED WORK
A complete overview of state-of-the-art road anomaly de-
tection methods can be found in [8]. In short, many of the
most effective monocular methods, as ours, generate synthetic
training data to palliate for the lack of a sufficiently diverse
annotated road obstacle dataset. We therefore focus on these
methods, and then discuss other attempts at exploiting per-
spective information for diverse tasks.
A. Synthetic Training Data for Obstacle Detection
There is an intractable variety of unexpected objects that can
pose a collision threat on roads. To handle this diversity, most
existing obstacle detection methods rely on creating synthetic
data for training purposes. It is often created from background
traffic frames, often from Cityscapes [10], into which synthetic
obstacles are inserted.
In [2], the synthetic anomalies are generated by altering the
semantic class of existing object instances and synthesizing an
image from those altered labels. In [3], this is complemented
by adding the Cityscapes void regions as obstacles. However,
many of the objects exploited by these techniques are located
above or away from the road, and the resulting training data
only yields limited performance for small on-road obstacles.
Our results show that we outperform these methods.
In [7], synthetic obstacles are obtained by cropping random
polygons within the background frame and copying their con-
tent onto the road, or filling them with a random color. Other
methods [4], [6], [11] inject object instances extracted from
various image datasets. While this can be done effectively, it
remains suboptimal because the objects are placed at random
locations, without accounting for their size or for the scene
geometry. This is what we address here by explicitly exploiting
perspective information, and we demonstrate that it yields a
substantial performance boost.
B. Exploiting Perspective Information
Earlier works [12], [13] propose a lightweight sliding-
window classifier of drivable space using a pyramid of input
patches whose dimension depends on their distance from the
horizon. These patches are then rescaled according to their
distance to the camera, ensuring that the similar obstacles
have similar pixels sizes when presented to the classifier,
regardless of the effects of perspective in the original image.
This application of perspective information to overcome scale
variance is effective, but it can not be easily combined with
standard CNNs which operate on the whole image rather than
individually rescaled patches.
For any perspective camera, distortion depends on image
position. A popular approach to enabling a deep network
to account for this in its predictions is to provide it with
pixel coordinates as input. In [14], [15], [16], this is achieved
by treating normalized pixel coordinates as two additional
channels. In [17] the pixel coordinates are used to compute
an attention map, to exploit the fact that the class distribution
correlates with the image height, for example the sky class
is predominantly at the top of the image. Another way to
implicitly account for perspective effects is to introduce extra
network branches that process the image at different scales
and fuse the results [18], [19]. However, this strategy, as
those relying on pixel coordinates, does not explicitly leverage
the perspective information available when working with a
calibrated camera, as is typically the case in self-driving.
None of obstacle-detection algorithms explicitly accounts
for the relationship between projected object size and distance.
This can be done by creating scale maps that encode the
expected size in the world of an image pixel depending on
its position. Scale maps have been used for obstacle and
anomaly detection [20], [21]. In [20], the scale information
is used to crop and resize image regions before passing them
to a vehicle detection network, which then gets to view the
cars at an approximately constant scale. This requires running
the detector multiple times on the crops. By contrast, our
method processes the whole image at once, and the model
learns how to leverage the perspective information to adjust
the features. In [21], the scale maps are used to rectify the road
surfaces, and obstacles are then detected in the rectified views.
Unlike these methods, we exploit perspective maps as input to
our network, instead of using them for image pre-processing.
This prevents the creation of visual artifacts caused by image
warping, which yields higher accuracy, as we will show in
experiments.
Scale maps have also been extensively investigated for
crowd counting purposes [22], [23], [24], [25], [26], [27].
In [22], [23], the models predict perspective information based
on observed body and head size. In [24], an unsupervised
meta-learning method is deployed to learn perspective maps,
which are then used to warp the input images so that they
depict a uniform scale as in [25]. In [26], a scale map serves
as an extra channel alongside the RGB image and is passed
through the backbone feature extractor, whereas in [27] an
additional branch is added to the backbone to process the
single-channel scale map and to concatenate the resulting
features afterwards. In short, perspective information is used
during feature computation. In this paper, we follow a different
track and incorporate the scale map at different levels of a
feature pyramid network. Our experiments show this to be
more effective. Furthermore, we argue and demonstrate that,
for anomaly detection, incorporating perspective information
into the network is not enough; one must also exploit it when
synthesizing training data.
C. Fusing RGB and Depth for Road Obstacle Detection
When depth information is available, from stereo camera
disparity or RGB-D sensors, it can be fused with the RGB
appearance to improve obstacle detection. For example, [28]
combines semantic segmentation with stereo-based detections;
MergeNet [29] extracts complementary features from RGB-D;
RFNet [30]’s two-stream backbone extracts RGB and depth