Perspective Aware Road Obstacle Detection

2025-05-02 1 0 1.25MB 8 页 10玖币

reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of

any copyrighted component of this work in other works.

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED FEBRUARY, 2023 1

Perspective Aware Road Obstacle Detection

Krzysztof Lis, Sina Honari, Pascal Fua, and Mathieu Salzmann

Abstract—While road obstacle detection techniques have be-

come increasingly effective, they typically ignore the fact that, in

practice, the apparent size of the obstacles decreases as their

distance to the vehicle increases. In this paper, we account

for this by computing a scale map encoding the apparent

size of a hypothetical object at every image location. We then

leverage this perspective map to (i) generate training data by

injecting onto the road synthetic objects whose size corresponds

to the perspective foreshortening; and (ii) incorporate perspective

information in the decoding part of the detection network to guide

the obstacle detector. Our results on standard benchmarks show

that, together, these two strategies signiﬁcantly boost the obstacle

detection performance, allowing our approach to consistently

outperform state-of-the-art methods in terms of instance-level

obstacle detection.

Index Terms—Computer Vision for Transportation, Data Sets

for Robotic Vision, Deep Learning for Visual Perception, Object

Detection, Segmentation and Categorization.

I. INTRODUCTION

VISION-BASED driving assistance is now commercially

available [1] and enables vehicles to plan a path within

the predicted drivable space while avoiding other trafﬁc.

However, unusual and unexpected obstacles lying on the

road remain a potential danger. Since not every vehicle has

stereo cameras or a LiDAR sensor to detect them in 3D,

much effort has recently been made to achieve detection

in a monocular fashion via learning-based strategies. Such

road obstacle detection can also be beneﬁcial for robots in

novel environments. Given that such objects are non-exclusive,

obtaining exhaustive datasets of real images annotated with

such obstacles for training purposes is impractical. Hence,

many state-of-the-art deep learning approaches [2], [3], [4], [5]

rely on synthetically-generated training data, e.g., by cutting

out objects and inserting them into individual frames of the

Cityscapes dataset.

However, these methods fail to leverage, both while gen-

erating training data and performing the actual detection, the

predictable perspective foreshortening in images captured by

vehicles’ front-facing cameras. It is a standard practice [4], [6],

[7] to insert objects of arbitrary sizes at any image location

in the training data and to detect objects at multiple-scales

irrespective of where they appear in the image. This does not

exploit the well-known fact that more distant objects tend to

Manuscript received: September 13, 2022; Revised December 21, 2022;

Accepted February 9, 2023. This paper was recommended for publication

by by Associate Editor I. Gilitschenski and Editor C. Cadena Lerma upon

evaluation of the reviewers’ comments. The work was supported in part by

the International Chair Drive for All - MINES ParisTech - Peugeot-Citroën -

Safran - Valeo.

All authors are with Computer Vision Laboratory, EPFL, Lau-

sanne, Switzerland. (krzysztof.lis@epﬂ.ch, lis.krzysztof@protonmail.com;

sina.honari@gmail.com; pascal.fua@epﬂ.ch; mathieu.salzmann@epﬂ.ch).

Digital Object Identiﬁer (DOI): 10.1109/LRA.2023.3245410

(a) (b)

Figure 1: Far and relevant vs close and irrelevant. (a) Original

image. The green circle denotes a real obstacle far away, and the red

circle indicates nearby but harmless leaves. (b) The perspective map

indicates, at each pixel, the size in pixels of a hypothetical meter-wide

object at that location. (c) Our approach uses the perspective map to

distinguish relevant objects from irrelevant ones. It correctly ﬂags

in red the pixels of the real obstacle while ignoring the leaves. (d)

Without the perspective aware training set, a network with a similar

architecture ﬂags them all.

be smaller and that, given a calibrated camera, the relationship

between real and projected sizes is known.

In this work, we show that leveraging the perspective

information substantially increases performance. To this end,

as shown in Fig. 1, we compute a scale map, whose pixel

values denote the apparent size in pixels of a hypothetical

meter-wide object placed at that point on the road. We then

exploit this information in two complementary ways:

•Perspective-Aware Synthetic Object Injection. Instead

of uniformly injecting synthetic objects into road scenes

to synthesize training data, as in [4], [6], [7], we use the

perspective map to appropriately set the projected size of

the objects we insert.

•Perspective-Aware Architecture. We feed the perspec-

tive map at multiple levels of a feature pyramid network,

enabling it to learn the realistic relationship between

distance and size embodied in our training set and in

real road scenes.

The bottom portion of Fig. 1 illustrates the beneﬁts of our

approach. It not only detects small far-away obstacles but also

avoids false alarms arising from small irregularities near the

car, such as the leaves here, because their size at this image

location does not match that of real threats to the vehicle.

Our results show that these strategies together contribute to

signiﬁcantly improving the accuracy of road obstacle detec-

tion, particularly in terms of instance-level detection, which is

critical for a self-driving car that need to identify all potential

hazards on the road.

We evaluate our approach on the Segment Me If You Can [8]

benchmark’s obstacle track and the Lost&Found [9] test sub-

set. We demonstrate that it signiﬁcantly outperforms state-

arXiv:2210.01779v2 [cs.CV] 19 Jun 2023

2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED FEBRUARY, 2023

of-the-art techniques that use architectures similar to ours,

but without explicit perspective handling. The implementation

of our method is available at https://github.com/cvlab-epﬂ/

perspective-aware-obstacles.

II. RELATED WORK

A complete overview of state-of-the-art road anomaly de-

tection methods can be found in [8]. In short, many of the

most effective monocular methods, as ours, generate synthetic

training data to palliate for the lack of a sufﬁciently diverse

annotated road obstacle dataset. We therefore focus on these

methods, and then discuss other attempts at exploiting per-

spective information for diverse tasks.

A. Synthetic Training Data for Obstacle Detection

There is an intractable variety of unexpected objects that can

pose a collision threat on roads. To handle this diversity, most

existing obstacle detection methods rely on creating synthetic

data for training purposes. It is often created from background

trafﬁc frames, often from Cityscapes [10], into which synthetic

obstacles are inserted.

In [2], the synthetic anomalies are generated by altering the

semantic class of existing object instances and synthesizing an

image from those altered labels. In [3], this is complemented

by adding the Cityscapes void regions as obstacles. However,

many of the objects exploited by these techniques are located

above or away from the road, and the resulting training data

only yields limited performance for small on-road obstacles.

Our results show that we outperform these methods.

In [7], synthetic obstacles are obtained by cropping random

polygons within the background frame and copying their con-

tent onto the road, or ﬁlling them with a random color. Other

methods [4], [6], [11] inject object instances extracted from

various image datasets. While this can be done effectively, it

remains suboptimal because the objects are placed at random

locations, without accounting for their size or for the scene

geometry. This is what we address here by explicitly exploiting

perspective information, and we demonstrate that it yields a

substantial performance boost.

B. Exploiting Perspective Information

Earlier works [12], [13] propose a lightweight sliding-

window classiﬁer of drivable space using a pyramid of input

patches whose dimension depends on their distance from the

horizon. These patches are then rescaled according to their

distance to the camera, ensuring that the similar obstacles

have similar pixels sizes when presented to the classiﬁer,

regardless of the effects of perspective in the original image.

This application of perspective information to overcome scale

variance is effective, but it can not be easily combined with

standard CNNs which operate on the whole image rather than

individually rescaled patches.

For any perspective camera, distortion depends on image

position. A popular approach to enabling a deep network

to account for this in its predictions is to provide it with

pixel coordinates as input. In [14], [15], [16], this is achieved

by treating normalized pixel coordinates as two additional

channels. In [17] the pixel coordinates are used to compute

an attention map, to exploit the fact that the class distribution

correlates with the image height, for example the sky class

is predominantly at the top of the image. Another way to

implicitly account for perspective effects is to introduce extra

network branches that process the image at different scales

and fuse the results [18], [19]. However, this strategy, as

those relying on pixel coordinates, does not explicitly leverage

the perspective information available when working with a

calibrated camera, as is typically the case in self-driving.

None of obstacle-detection algorithms explicitly accounts

for the relationship between projected object size and distance.

This can be done by creating scale maps that encode the

expected size in the world of an image pixel depending on

its position. Scale maps have been used for obstacle and

anomaly detection [20], [21]. In [20], the scale information

is used to crop and resize image regions before passing them

to a vehicle detection network, which then gets to view the

cars at an approximately constant scale. This requires running

the detector multiple times on the crops. By contrast, our

method processes the whole image at once, and the model

learns how to leverage the perspective information to adjust

the features. In [21], the scale maps are used to rectify the road

surfaces, and obstacles are then detected in the rectiﬁed views.

Unlike these methods, we exploit perspective maps as input to

our network, instead of using them for image pre-processing.

This prevents the creation of visual artifacts caused by image

warping, which yields higher accuracy, as we will show in

experiments.

Scale maps have also been extensively investigated for

crowd counting purposes [22], [23], [24], [25], [26], [27].

In [22], [23], the models predict perspective information based

on observed body and head size. In [24], an unsupervised

meta-learning method is deployed to learn perspective maps,

which are then used to warp the input images so that they

depict a uniform scale as in [25]. In [26], a scale map serves

as an extra channel alongside the RGB image and is passed

through the backbone feature extractor, whereas in [27] an

additional branch is added to the backbone to process the

single-channel scale map and to concatenate the resulting

features afterwards. In short, perspective information is used

during feature computation. In this paper, we follow a different

track and incorporate the scale map at different levels of a

feature pyramid network. Our experiments show this to be

more effective. Furthermore, we argue and demonstrate that,

for anomaly detection, incorporating perspective information

into the network is not enough; one must also exploit it when

synthesizing training data.

C. Fusing RGB and Depth for Road Obstacle Detection

When depth information is available, from stereo camera

disparity or RGB-D sensors, it can be fused with the RGB

appearance to improve obstacle detection. For example, [28]

combines semantic segmentation with stereo-based detections;

MergeNet [29] extracts complementary features from RGB-D;

RFNet [30]’s two-stream backbone extracts RGB and depth

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

©2023IEEE.Personaluseofthismaterialispermitted.PermissionfromIEEEmustbeobtainedforallotheruses,inanycurrentorfuturemedia,includingreprinting/republishingthismaterialforadvertisingorpromotionalpurposes,creatingnewcollectiveworks,forresaleorredistributiontoserversorlists,orreuseofanycopyrightedcompone...

展开>> 收起<<

Perspective Aware Road Obstacle Detection.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Perspective Aware Road Obstacle Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: