CERBERUS Simple and Effective All-In-One Automotive Perception Model with Multi Task Learning Carmelo Scribano12 Giorgia Franchini1 Ignacio Sa nudo Olmedo3and Marko Bertogna13

2025-04-27 0 0 1.3MB 6 页 10玖币

侵权投诉

CERBERUS: Simple and Effective All-In-One Automotive Perception

Model with Multi Task Learning

Carmelo Scribano1,2,∗, Giorgia Franchini1, Ignacio Sa˜

nudo Olmedo,3and Marko Bertogna1,3

Abstract— Perceiving the surrounding environment is essen-

tial for enabling autonomous or assisted driving functionalities.

Common tasks in this domain include detecting road users,

as well as determining lane boundaries and classifying driving

conditions. Over the last few years, a large variety of pow-

erful Deep Learning models have been proposed to address

individual tasks of camera-based automotive perception with

astonishing performances. However, the limited capabilities of

in-vehicle embedded computing platforms cannot cope with

the computational effort required to run a heavy model for

each individual task. In this work, we present CERBERUS

(CEnteR Based End-to-end peRception Using a Single model), a

lightweight model that leverages a multitask-learning approach

to enable the execution of multiple perception tasks at the cost

of a single inference. The code will be made publicly available

at https://github.com/cscribano/CERBERUS.

I. INTRODUCTION

Multi Task Learning [1] is a branch of machine learning

in which several learning tasks are solved at the same time,

while exploiting common and differences between the tasks.

Developing lightweight yet powerful Multi Task models

that can solve several perception tasks in a single forward

pass, while maximizing the parameter sharing among the

individual tasks, is an enabling capability to bring deep-

learning based perception to production vehicles with limited

computing resources. The recent introduction of the freely

available dataset Berkeley Deep Drive (BDD100K) [2] rep-

resents an important resource for the cause of multi-task

perception models. Different Multi Task models [3], [4] have

been lately proposed, showing encouraging results.

In this short paper we propose a Multi Task model that can

simultaneously address the tasks of 1) road object detection

(also classifying the object’s occlusion), 2) lane estimation

and 3) image classiﬁcation for weather (sunny, rainy, cloudy

etc..), driving scene (highway, city streets etc..) and day time

(morning, night etc...). Taking inspiration from modern single

stage object detectors [5]–[7], we decided to cast both the

task of object detection and lane estimation as regression of

heatmaps (to encode the likelihood of the presence of objects

or lanes at any spatial location) and offsets (used to decode

the individual bounding boxes or lane instances).

The advantages of this election are manifolds: 1) Em-

ploying the same representation among different tasks make

1Carmelo Scribano, Giorgia Franchini and Marko Bertogna are

part of the Department of Physics, Informatics and Mathematics of

the University of Modena and ReggioEmilia, 41125, Modena, Italy.

{name}.{surname}@unimore.it

2Carmelo Scribano is part of the Department of Mathematical, Physical

and Computer Sciences of the University of Parma, 43124, Parma, Italy.

3Ignacio Sa˜

nudo Olmedo and Marko Bertogna are part of HIPERT s.r.l,

41122, Modena, Italy. {name}{surname}@hipert.it

*Corresponding Author

Fig. 1: Qualitative inference results of CERBERUS on pub-

licly available footage. (left) superimposed heatmaps (right)

output for object detection and lane estimation.

the training process simple, allowing to even optimize for

the same objective function. 2) The anchor free approach

allow for simple and efﬁcient decoding of both the detection

bounding boxes and the road marking lanes. 3) The heatmap

representation can be trivially extended to produce instance

level prediction in a bottom-up fashion, as we demonstrate by

incorporating the occlusion classiﬁcation task in the object

detection head. We design our model’s architecture based on

simple and well understood patterns, while focusing on a

modular approach in order for the ﬁnal model to be tailored

accordingly to the deployment requirements. Leveraging a

novel objective function for the heatmaps regression task,

our model can be trained end-to-end while simultaneously

optimizing for all the objectives. Finally, we tuned the model

to reduce to the greatest extent possible the computational

footprint, experimenting with efﬁcient backbones [8], [9],

and to ensure ease of deployment by refraining from lever-

aging exotic layers. The experiments presented suggest that

the proposed approach represents a strong baseline and an

important step forward towards a universal model for end to

end perception.

II. RELATED WORKS

A. Object Detection

Deep learning based object detectors can be classiﬁed

in Multi-Stage and Single-Stage. The ﬁrst kind of methods

[10]–[13] include a ﬁrst module to propose a large set of

regions of the image that are likely to contain an object,

the later stages then classify and reﬁne the predictions.

These methods have being known for being highly accurate,

but computationally expensive. In opposition, single-stage

detectors do not rely on a region proposal mechanism. In

this sense, anchor-base detectors [14]–[16] output a set of

candidate boxes at predeﬁned locations in the form of offset

with respect to a predeﬁned set of anchor boxes, requiring

an expensive process of non-maxima suppression to obtain

the ﬁnal detections. On the other hand, more recent anchor-

arXiv:2210.00756v1 [cs.CV] 3 Oct 2022

free detectors [5]–[7], [17]–[19] output predictions only

at object locations and thus require little post-processing.

Those methods are often based on detection of keypoints,

centerpoints or even on transformer decoders.

B. Lane Estimation

Lane estimation methods are divided in segmentation-

based and detection-based: the ﬁrst category [20], [21] cast

the lane estimation as a pixel-wise classiﬁcation problem

(either the pixel belongs to a lane or not), with the addition of

a mechanism to associate the lane masks to separate marking

instances. Detection-based methods instead have a lot in

common with object detectors, as such some methods regress

the lane instance using an anchor mechanism [22], [23].

Finally, recent works cast the lane estimation as a regression

of keypoints and embeddings [24] or offset [25] to aggregate

the keypoints in distinct lanes.

C. Multi-Task Approaches

To the best of our knowledge, the closest works to our

are the very recent YoloP [3] and HybridNets [4]. In our

understanding, a main limitation of such models is the

author’s choice of relying on heterogeneous representations

for the individual tasks (e.g, combining a Yolo [15] style

object detection head with a segmentation head for the

lane estimation). Such a choice makes the model inherently

convoluted and the optimization phase harder because it

relies on completely different objective functions.

III. METHOD

A. Problem Formulation

As mentioned in the introduction, the proposed model is

able to address multiple distinct tasks required for vision-

based perception in a single inference step. In particular,

given as input a single RGB image I∈R(w×h×3) we address

the tasks of:

a) Road Object Detection: predicting a bounding box and

an associated class label for each object of among 10

distinct object classes. Additionally, for each object,

we provide a binary label which indicates whether the

object is fully visible or partially occluded.

b) Lane Markings Estimation: regressing a polynomial

curve and an associated class label for each lane mark-

ing visible.

c) Image Tagging: providing three distinct multi-class clas-

siﬁcation labels associated to the whole frame, weather

(7 classes), scene (7 classes) and time of day (4 classes).

We train and evaluate our model on BDD100K. This publicly

available dataset consists of 70K train images and 10K

validation images sampled at 10Hz from a large set of 100K

driving videos (around 40s each). Videos are captured in a

wide range of scene types and conditions, allowing to train

robust models able to generalize to real driving conditions.

B. Object detection

The desired output for the object detection task is a set

of NOdetections in the form (x1, y1, x2, y2, c, o)where

x1, y1(resp. x2, y2) deﬁne the image space coordinate of

the top-left (resp. bottom-right) corner of the bounding box,

ccharacterizes the object class and ois the binary label for

the object’s occlusion state. Our approach is heavily based

on the anchor-free detector described in [5] which is based

on keypoint estimation.

For each object class k∈COwith |CO|= 10, we output a

heatmap Hk

D∈w

S×h

S(with Sbeing output stride) that

encodes the center points of all the objects in Ibelonging to

class k. Given a set of detection ground truths, ﬁrst the target

keypoints in output space ci=xi

2−xi

2S,yi

2−yi

2Sfor i∈

{0, . . . , Nk

O}are computed as the geometrical center point of

each box and rescaled with the output stride (rounding to the

nearest integer), as in ﬁg. 2 (left). Then the target heatmap

Dfor the k-th object class is obtained by computing for

each keypoint a Gaussian centered at the keypoint’s location

and taking the element-wise maximum:

D(x, y) = max

jexp −(x−cj

x)2+ (y−cj

y)2

σ2

for j∈ {0,...,Nk

(1)

with σbeing a hyperparameter.

In addition, we regress a double offset map (shared across all

the object classes) OD∈(w/S ×h/S ×4). The values at

the coordinates of each object center c= (cx, cy)correspond

to the box-corners offset vectors from the object center to the

top-left and bottom-right corners of the target bounding box,

as in ﬁg. 2 (right).

OD(cx, cy) = (cx−x1/S, cy−y1/S, cx−x2/S, cy−y2/S)

We also regress the occlusion state for each detected object:

this is simply achieved by predicting an occlusion map VD=

w

S×h

S×1where the value at a center point’s coordinates

(cx, cy)is meant to be 0 if the object is fully visible and 1

otherwise.

w/S

h/S

Cy – y1

Cx – x1

Fig. 2: Deﬁnition of the object detection’s targets: (left)

object center-point (right) box-corners offsets: top-left (red)

and bottom-right (blue).

At inference time the predicted object detections for each

class kare retrieved by ﬁrst taking the local maxima (the

peaks of the Gaussians) from the keypoint heatmaps Pk=

[(c1

x, c1

y)k, ..., (cNk

x, cNk

y)k], then the bounding boxes are

reconstructed by summing the predicted box-corners offset

vector for each center keypoint taken from ODat the center

point’s coordinates. The occlusion classiﬁcation is retrieved

from VDin the same way.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CERBERUS:SimpleandEffectiveAll-In-OneAutomotivePerceptionModelwithMultiTaskLearningCarmeloScribano1;2;,GiorgiaFranchini1,IgnacioSanudoOlmedo,3andMarkoBertogna1;3AbstractPerceivingthesurroundingenvironmentisessen-tialforenablingautonomousorassisteddrivingfunctionalities.Commontasksinthisdomainincl...

展开>> 收起<<

CERBERUS Simple and Effective All-In-One Automotive Perception Model with Multi Task Learning Carmelo Scribano12 Giorgia Franchini1 Ignacio Sa nudo Olmedo3and Marko Bertogna13.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CERBERUS Simple and Effective All-In-One Automotive Perception Model with Multi Task Learning Carmelo Scribano12 Giorgia Franchini1 Ignacio Sa nudo Olmedo3and Marko Bertogna13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: