Perspective Aware Road Obstacle Detection

2025-05-02 0 0 1.25MB 8 页 10玖币
侵权投诉
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of
any copyrighted component of this work in other works.
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED FEBRUARY, 2023 1
Perspective Aware Road Obstacle Detection
Krzysztof Lis, Sina Honari, Pascal Fua, and Mathieu Salzmann
Abstract—While road obstacle detection techniques have be-
come increasingly effective, they typically ignore the fact that, in
practice, the apparent size of the obstacles decreases as their
distance to the vehicle increases. In this paper, we account
for this by computing a scale map encoding the apparent
size of a hypothetical object at every image location. We then
leverage this perspective map to (i) generate training data by
injecting onto the road synthetic objects whose size corresponds
to the perspective foreshortening; and (ii) incorporate perspective
information in the decoding part of the detection network to guide
the obstacle detector. Our results on standard benchmarks show
that, together, these two strategies significantly boost the obstacle
detection performance, allowing our approach to consistently
outperform state-of-the-art methods in terms of instance-level
obstacle detection.
Index Terms—Computer Vision for Transportation, Data Sets
for Robotic Vision, Deep Learning for Visual Perception, Object
Detection, Segmentation and Categorization.
I. INTRODUCTION
VISION-BASED driving assistance is now commercially
available [1] and enables vehicles to plan a path within
the predicted drivable space while avoiding other traffic.
However, unusual and unexpected obstacles lying on the
road remain a potential danger. Since not every vehicle has
stereo cameras or a LiDAR sensor to detect them in 3D,
much effort has recently been made to achieve detection
in a monocular fashion via learning-based strategies. Such
road obstacle detection can also be beneficial for robots in
novel environments. Given that such objects are non-exclusive,
obtaining exhaustive datasets of real images annotated with
such obstacles for training purposes is impractical. Hence,
many state-of-the-art deep learning approaches [2], [3], [4], [5]
rely on synthetically-generated training data, e.g., by cutting
out objects and inserting them into individual frames of the
Cityscapes dataset.
However, these methods fail to leverage, both while gen-
erating training data and performing the actual detection, the
predictable perspective foreshortening in images captured by
vehicles’ front-facing cameras. It is a standard practice [4], [6],
[7] to insert objects of arbitrary sizes at any image location
in the training data and to detect objects at multiple-scales
irrespective of where they appear in the image. This does not
exploit the well-known fact that more distant objects tend to
Manuscript received: September 13, 2022; Revised December 21, 2022;
Accepted February 9, 2023. This paper was recommended for publication
by by Associate Editor I. Gilitschenski and Editor C. Cadena Lerma upon
evaluation of the reviewers’ comments. The work was supported in part by
the International Chair Drive for All - MINES ParisTech - Peugeot-Citroën -
Safran - Valeo.
All authors are with Computer Vision Laboratory, EPFL, Lau-
sanne, Switzerland. (krzysztof.lis@epfl.ch, lis.krzysztof@protonmail.com;
sina.honari@gmail.com; pascal.fua@epfl.ch; mathieu.salzmann@epfl.ch).
Digital Object Identifier (DOI): 10.1109/LRA.2023.3245410
(a) (b)
(c) (d)
Figure 1: Far and relevant vs close and irrelevant. (a) Original
image. The green circle denotes a real obstacle far away, and the red
circle indicates nearby but harmless leaves. (b) The perspective map
indicates, at each pixel, the size in pixels of a hypothetical meter-wide
object at that location. (c) Our approach uses the perspective map to
distinguish relevant objects from irrelevant ones. It correctly flags
in red the pixels of the real obstacle while ignoring the leaves. (d)
Without the perspective aware training set, a network with a similar
architecture flags them all.
be smaller and that, given a calibrated camera, the relationship
between real and projected sizes is known.
In this work, we show that leveraging the perspective
information substantially increases performance. To this end,
as shown in Fig. 1, we compute a scale map, whose pixel
values denote the apparent size in pixels of a hypothetical
meter-wide object placed at that point on the road. We then
exploit this information in two complementary ways:
Perspective-Aware Synthetic Object Injection. Instead
of uniformly injecting synthetic objects into road scenes
to synthesize training data, as in [4], [6], [7], we use the
perspective map to appropriately set the projected size of
the objects we insert.
Perspective-Aware Architecture. We feed the perspec-
tive map at multiple levels of a feature pyramid network,
enabling it to learn the realistic relationship between
distance and size embodied in our training set and in
real road scenes.
The bottom portion of Fig. 1 illustrates the benefits of our
approach. It not only detects small far-away obstacles but also
avoids false alarms arising from small irregularities near the
car, such as the leaves here, because their size at this image
location does not match that of real threats to the vehicle.
Our results show that these strategies together contribute to
significantly improving the accuracy of road obstacle detec-
tion, particularly in terms of instance-level detection, which is
critical for a self-driving car that need to identify all potential
hazards on the road.
We evaluate our approach on the Segment Me If You Can [8]
benchmark’s obstacle track and the Lost&Found [9] test sub-
set. We demonstrate that it significantly outperforms state-
arXiv:2210.01779v2 [cs.CV] 19 Jun 2023
2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED FEBRUARY, 2023
of-the-art techniques that use architectures similar to ours,
but without explicit perspective handling. The implementation
of our method is available at https://github.com/cvlab-epfl/
perspective-aware-obstacles.
II. RELATED WORK
A complete overview of state-of-the-art road anomaly de-
tection methods can be found in [8]. In short, many of the
most effective monocular methods, as ours, generate synthetic
training data to palliate for the lack of a sufficiently diverse
annotated road obstacle dataset. We therefore focus on these
methods, and then discuss other attempts at exploiting per-
spective information for diverse tasks.
A. Synthetic Training Data for Obstacle Detection
There is an intractable variety of unexpected objects that can
pose a collision threat on roads. To handle this diversity, most
existing obstacle detection methods rely on creating synthetic
data for training purposes. It is often created from background
traffic frames, often from Cityscapes [10], into which synthetic
obstacles are inserted.
In [2], the synthetic anomalies are generated by altering the
semantic class of existing object instances and synthesizing an
image from those altered labels. In [3], this is complemented
by adding the Cityscapes void regions as obstacles. However,
many of the objects exploited by these techniques are located
above or away from the road, and the resulting training data
only yields limited performance for small on-road obstacles.
Our results show that we outperform these methods.
In [7], synthetic obstacles are obtained by cropping random
polygons within the background frame and copying their con-
tent onto the road, or filling them with a random color. Other
methods [4], [6], [11] inject object instances extracted from
various image datasets. While this can be done effectively, it
remains suboptimal because the objects are placed at random
locations, without accounting for their size or for the scene
geometry. This is what we address here by explicitly exploiting
perspective information, and we demonstrate that it yields a
substantial performance boost.
B. Exploiting Perspective Information
Earlier works [12], [13] propose a lightweight sliding-
window classifier of drivable space using a pyramid of input
patches whose dimension depends on their distance from the
horizon. These patches are then rescaled according to their
distance to the camera, ensuring that the similar obstacles
have similar pixels sizes when presented to the classifier,
regardless of the effects of perspective in the original image.
This application of perspective information to overcome scale
variance is effective, but it can not be easily combined with
standard CNNs which operate on the whole image rather than
individually rescaled patches.
For any perspective camera, distortion depends on image
position. A popular approach to enabling a deep network
to account for this in its predictions is to provide it with
pixel coordinates as input. In [14], [15], [16], this is achieved
by treating normalized pixel coordinates as two additional
channels. In [17] the pixel coordinates are used to compute
an attention map, to exploit the fact that the class distribution
correlates with the image height, for example the sky class
is predominantly at the top of the image. Another way to
implicitly account for perspective effects is to introduce extra
network branches that process the image at different scales
and fuse the results [18], [19]. However, this strategy, as
those relying on pixel coordinates, does not explicitly leverage
the perspective information available when working with a
calibrated camera, as is typically the case in self-driving.
None of obstacle-detection algorithms explicitly accounts
for the relationship between projected object size and distance.
This can be done by creating scale maps that encode the
expected size in the world of an image pixel depending on
its position. Scale maps have been used for obstacle and
anomaly detection [20], [21]. In [20], the scale information
is used to crop and resize image regions before passing them
to a vehicle detection network, which then gets to view the
cars at an approximately constant scale. This requires running
the detector multiple times on the crops. By contrast, our
method processes the whole image at once, and the model
learns how to leverage the perspective information to adjust
the features. In [21], the scale maps are used to rectify the road
surfaces, and obstacles are then detected in the rectified views.
Unlike these methods, we exploit perspective maps as input to
our network, instead of using them for image pre-processing.
This prevents the creation of visual artifacts caused by image
warping, which yields higher accuracy, as we will show in
experiments.
Scale maps have also been extensively investigated for
crowd counting purposes [22], [23], [24], [25], [26], [27].
In [22], [23], the models predict perspective information based
on observed body and head size. In [24], an unsupervised
meta-learning method is deployed to learn perspective maps,
which are then used to warp the input images so that they
depict a uniform scale as in [25]. In [26], a scale map serves
as an extra channel alongside the RGB image and is passed
through the backbone feature extractor, whereas in [27] an
additional branch is added to the backbone to process the
single-channel scale map and to concatenate the resulting
features afterwards. In short, perspective information is used
during feature computation. In this paper, we follow a different
track and incorporate the scale map at different levels of a
feature pyramid network. Our experiments show this to be
more effective. Furthermore, we argue and demonstrate that,
for anomaly detection, incorporating perspective information
into the network is not enough; one must also exploit it when
synthesizing training data.
C. Fusing RGB and Depth for Road Obstacle Detection
When depth information is available, from stereo camera
disparity or RGB-D sensors, it can be fused with the RGB
appearance to improve obstacle detection. For example, [28]
combines semantic segmentation with stereo-based detections;
MergeNet [29] extracts complementary features from RGB-D;
RFNet [30]’s two-stream backbone extracts RGB and depth
摘要:

©2023IEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeobtainedforallotheruses,inanycurrentorfuturemedia,includingreprinting/republishingthismaterialforadvertisingorpromotionalpurposes,creatingnewcollectiveworks,forresaleorredistributiontoserversorlists,orreuseofanycopyrightedcompone...

收起<<
Perspective Aware Road Obstacle Detection.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.25MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注