CERBERUS Simple and Effective All-In-One Automotive Perception Model with Multi Task Learning Carmelo Scribano12 Giorgia Franchini1 Ignacio Sa nudo Olmedo3and Marko Bertogna13

2025-04-27 0 0 1.3MB 6 页 10玖币
侵权投诉
CERBERUS: Simple and Effective All-In-One Automotive Perception
Model with Multi Task Learning
Carmelo Scribano1,2,, Giorgia Franchini1, Ignacio Sa˜
nudo Olmedo,3and Marko Bertogna1,3
Abstract Perceiving the surrounding environment is essen-
tial for enabling autonomous or assisted driving functionalities.
Common tasks in this domain include detecting road users,
as well as determining lane boundaries and classifying driving
conditions. Over the last few years, a large variety of pow-
erful Deep Learning models have been proposed to address
individual tasks of camera-based automotive perception with
astonishing performances. However, the limited capabilities of
in-vehicle embedded computing platforms cannot cope with
the computational effort required to run a heavy model for
each individual task. In this work, we present CERBERUS
(CEnteR Based End-to-end peRception Using a Single model), a
lightweight model that leverages a multitask-learning approach
to enable the execution of multiple perception tasks at the cost
of a single inference. The code will be made publicly available
at https://github.com/cscribano/CERBERUS.
I. INTRODUCTION
Multi Task Learning [1] is a branch of machine learning
in which several learning tasks are solved at the same time,
while exploiting common and differences between the tasks.
Developing lightweight yet powerful Multi Task models
that can solve several perception tasks in a single forward
pass, while maximizing the parameter sharing among the
individual tasks, is an enabling capability to bring deep-
learning based perception to production vehicles with limited
computing resources. The recent introduction of the freely
available dataset Berkeley Deep Drive (BDD100K) [2] rep-
resents an important resource for the cause of multi-task
perception models. Different Multi Task models [3], [4] have
been lately proposed, showing encouraging results.
In this short paper we propose a Multi Task model that can
simultaneously address the tasks of 1) road object detection
(also classifying the object’s occlusion), 2) lane estimation
and 3) image classification for weather (sunny, rainy, cloudy
etc..), driving scene (highway, city streets etc..) and day time
(morning, night etc...). Taking inspiration from modern single
stage object detectors [5]–[7], we decided to cast both the
task of object detection and lane estimation as regression of
heatmaps (to encode the likelihood of the presence of objects
or lanes at any spatial location) and offsets (used to decode
the individual bounding boxes or lane instances).
The advantages of this election are manifolds: 1) Em-
ploying the same representation among different tasks make
1Carmelo Scribano, Giorgia Franchini and Marko Bertogna are
part of the Department of Physics, Informatics and Mathematics of
the University of Modena and ReggioEmilia, 41125, Modena, Italy.
{name}.{surname}@unimore.it
2Carmelo Scribano is part of the Department of Mathematical, Physical
and Computer Sciences of the University of Parma, 43124, Parma, Italy.
3Ignacio Sa˜
nudo Olmedo and Marko Bertogna are part of HIPERT s.r.l,
41122, Modena, Italy. {name}{surname}@hipert.it
*Corresponding Author
Fig. 1: Qualitative inference results of CERBERUS on pub-
licly available footage. (left) superimposed heatmaps (right)
output for object detection and lane estimation.
the training process simple, allowing to even optimize for
the same objective function. 2) The anchor free approach
allow for simple and efficient decoding of both the detection
bounding boxes and the road marking lanes. 3) The heatmap
representation can be trivially extended to produce instance
level prediction in a bottom-up fashion, as we demonstrate by
incorporating the occlusion classification task in the object
detection head. We design our model’s architecture based on
simple and well understood patterns, while focusing on a
modular approach in order for the final model to be tailored
accordingly to the deployment requirements. Leveraging a
novel objective function for the heatmaps regression task,
our model can be trained end-to-end while simultaneously
optimizing for all the objectives. Finally, we tuned the model
to reduce to the greatest extent possible the computational
footprint, experimenting with efficient backbones [8], [9],
and to ensure ease of deployment by refraining from lever-
aging exotic layers. The experiments presented suggest that
the proposed approach represents a strong baseline and an
important step forward towards a universal model for end to
end perception.
II. RELATED WORKS
A. Object Detection
Deep learning based object detectors can be classified
in Multi-Stage and Single-Stage. The first kind of methods
[10]–[13] include a first module to propose a large set of
regions of the image that are likely to contain an object,
the later stages then classify and refine the predictions.
These methods have being known for being highly accurate,
but computationally expensive. In opposition, single-stage
detectors do not rely on a region proposal mechanism. In
this sense, anchor-base detectors [14]–[16] output a set of
candidate boxes at predefined locations in the form of offset
with respect to a predefined set of anchor boxes, requiring
an expensive process of non-maxima suppression to obtain
the final detections. On the other hand, more recent anchor-
arXiv:2210.00756v1 [cs.CV] 3 Oct 2022
free detectors [5]–[7], [17]–[19] output predictions only
at object locations and thus require little post-processing.
Those methods are often based on detection of keypoints,
centerpoints or even on transformer decoders.
B. Lane Estimation
Lane estimation methods are divided in segmentation-
based and detection-based: the first category [20], [21] cast
the lane estimation as a pixel-wise classification problem
(either the pixel belongs to a lane or not), with the addition of
a mechanism to associate the lane masks to separate marking
instances. Detection-based methods instead have a lot in
common with object detectors, as such some methods regress
the lane instance using an anchor mechanism [22], [23].
Finally, recent works cast the lane estimation as a regression
of keypoints and embeddings [24] or offset [25] to aggregate
the keypoints in distinct lanes.
C. Multi-Task Approaches
To the best of our knowledge, the closest works to our
are the very recent YoloP [3] and HybridNets [4]. In our
understanding, a main limitation of such models is the
author’s choice of relying on heterogeneous representations
for the individual tasks (e.g, combining a Yolo [15] style
object detection head with a segmentation head for the
lane estimation). Such a choice makes the model inherently
convoluted and the optimization phase harder because it
relies on completely different objective functions.
III. METHOD
A. Problem Formulation
As mentioned in the introduction, the proposed model is
able to address multiple distinct tasks required for vision-
based perception in a single inference step. In particular,
given as input a single RGB image IR(w×h×3) we address
the tasks of:
a) Road Object Detection: predicting a bounding box and
an associated class label for each object of among 10
distinct object classes. Additionally, for each object,
we provide a binary label which indicates whether the
object is fully visible or partially occluded.
b) Lane Markings Estimation: regressing a polynomial
curve and an associated class label for each lane mark-
ing visible.
c) Image Tagging: providing three distinct multi-class clas-
sification labels associated to the whole frame, weather
(7 classes), scene (7 classes) and time of day (4 classes).
We train and evaluate our model on BDD100K. This publicly
available dataset consists of 70K train images and 10K
validation images sampled at 10Hz from a large set of 100K
driving videos (around 40s each). Videos are captured in a
wide range of scene types and conditions, allowing to train
robust models able to generalize to real driving conditions.
B. Object detection
The desired output for the object detection task is a set
of NOdetections in the form (x1, y1, x2, y2, c, o)where
x1, y1(resp. x2, y2) define the image space coordinate of
the top-left (resp. bottom-right) corner of the bounding box,
ccharacterizes the object class and ois the binary label for
the object’s occlusion state. Our approach is heavily based
on the anchor-free detector described in [5] which is based
on keypoint estimation.
For each object class kCOwith |CO|= 10, we output a
heatmap Hk
Dw
S×h
S(with Sbeing output stride) that
encodes the center points of all the objects in Ibelonging to
class k. Given a set of detection ground truths, first the target
keypoints in output space ci=xi
2xi
1
2S,yi
2yi
1
2Sfor i
{0, . . . , Nk
O}are computed as the geometrical center point of
each box and rescaled with the output stride (rounding to the
nearest integer), as in fig. 2 (left). Then the target heatmap
Hk
Dfor the k-th object class is obtained by computing for
each keypoint a Gaussian centered at the keypoint’s location
and taking the element-wise maximum:
Hk
D(x, y) = max
jexp (xcj
x)2+ (ycj
y)2
σ2
for j∈ {0,...,Nk
O}
(1)
with σbeing a hyperparameter.
In addition, we regress a double offset map (shared across all
the object classes) OD(w/S ×h/S ×4). The values at
the coordinates of each object center c= (cx, cy)correspond
to the box-corners offset vectors from the object center to the
top-left and bottom-right corners of the target bounding box,
as in fig. 2 (right).
OD(cx, cy) = (cxx1/S, cyy1/S, cxx2/S, cyy2/S)
We also regress the occlusion state for each detected object:
this is simply achieved by predicting an occlusion map VD=
w
S×h
S×1where the value at a center point’s coordinates
(cx, cy)is meant to be 0 if the object is fully visible and 1
otherwise.
w/S
h/S
cx
cy
Cy – y1
Cx – x1
Fig. 2: Definition of the object detection’s targets: (left)
object center-point (right) box-corners offsets: top-left (red)
and bottom-right (blue).
At inference time the predicted object detections for each
class kare retrieved by first taking the local maxima (the
peaks of the Gaussians) from the keypoint heatmaps Pk=
[(c1
x, c1
y)k, ..., (cNk
O
x, cNk
O
y)k], then the bounding boxes are
reconstructed by summing the predicted box-corners offset
vector for each center keypoint taken from ODat the center
point’s coordinates. The occlusion classification is retrieved
from VDin the same way.
摘要:

CERBERUS:SimpleandEffectiveAll-In-OneAutomotivePerceptionModelwithMultiTaskLearningCarmeloScribano1;2;,GiorgiaFranchini1,IgnacioSa˜nudoOlmedo,3andMarkoBertogna1;3Abstract—Perceivingthesurroundingenvironmentisessen-tialforenablingautonomousorassisteddrivingfunctionalities.Commontasksinthisdomainincl...

展开>> 收起<<
CERBERUS Simple and Effective All-In-One Automotive Perception Model with Multi Task Learning Carmelo Scribano12 Giorgia Franchini1 Ignacio Sa nudo Olmedo3and Marko Bertogna13.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:1.3MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注