
CERBERUS: Simple and Effective All-In-One Automotive Perception
Model with Multi Task Learning
Carmelo Scribano1,2,∗, Giorgia Franchini1, Ignacio Sa˜
nudo Olmedo,3and Marko Bertogna1,3
Abstract— Perceiving the surrounding environment is essen-
tial for enabling autonomous or assisted driving functionalities.
Common tasks in this domain include detecting road users,
as well as determining lane boundaries and classifying driving
conditions. Over the last few years, a large variety of pow-
erful Deep Learning models have been proposed to address
individual tasks of camera-based automotive perception with
astonishing performances. However, the limited capabilities of
in-vehicle embedded computing platforms cannot cope with
the computational effort required to run a heavy model for
each individual task. In this work, we present CERBERUS
(CEnteR Based End-to-end peRception Using a Single model), a
lightweight model that leverages a multitask-learning approach
to enable the execution of multiple perception tasks at the cost
of a single inference. The code will be made publicly available
at https://github.com/cscribano/CERBERUS.
I. INTRODUCTION
Multi Task Learning [1] is a branch of machine learning
in which several learning tasks are solved at the same time,
while exploiting common and differences between the tasks.
Developing lightweight yet powerful Multi Task models
that can solve several perception tasks in a single forward
pass, while maximizing the parameter sharing among the
individual tasks, is an enabling capability to bring deep-
learning based perception to production vehicles with limited
computing resources. The recent introduction of the freely
available dataset Berkeley Deep Drive (BDD100K) [2] rep-
resents an important resource for the cause of multi-task
perception models. Different Multi Task models [3], [4] have
been lately proposed, showing encouraging results.
In this short paper we propose a Multi Task model that can
simultaneously address the tasks of 1) road object detection
(also classifying the object’s occlusion), 2) lane estimation
and 3) image classification for weather (sunny, rainy, cloudy
etc..), driving scene (highway, city streets etc..) and day time
(morning, night etc...). Taking inspiration from modern single
stage object detectors [5]–[7], we decided to cast both the
task of object detection and lane estimation as regression of
heatmaps (to encode the likelihood of the presence of objects
or lanes at any spatial location) and offsets (used to decode
the individual bounding boxes or lane instances).
The advantages of this election are manifolds: 1) Em-
ploying the same representation among different tasks make
1Carmelo Scribano, Giorgia Franchini and Marko Bertogna are
part of the Department of Physics, Informatics and Mathematics of
the University of Modena and ReggioEmilia, 41125, Modena, Italy.
{name}.{surname}@unimore.it
2Carmelo Scribano is part of the Department of Mathematical, Physical
and Computer Sciences of the University of Parma, 43124, Parma, Italy.
3Ignacio Sa˜
nudo Olmedo and Marko Bertogna are part of HIPERT s.r.l,
41122, Modena, Italy. {name}{surname}@hipert.it
*Corresponding Author
Fig. 1: Qualitative inference results of CERBERUS on pub-
licly available footage. (left) superimposed heatmaps (right)
output for object detection and lane estimation.
the training process simple, allowing to even optimize for
the same objective function. 2) The anchor free approach
allow for simple and efficient decoding of both the detection
bounding boxes and the road marking lanes. 3) The heatmap
representation can be trivially extended to produce instance
level prediction in a bottom-up fashion, as we demonstrate by
incorporating the occlusion classification task in the object
detection head. We design our model’s architecture based on
simple and well understood patterns, while focusing on a
modular approach in order for the final model to be tailored
accordingly to the deployment requirements. Leveraging a
novel objective function for the heatmaps regression task,
our model can be trained end-to-end while simultaneously
optimizing for all the objectives. Finally, we tuned the model
to reduce to the greatest extent possible the computational
footprint, experimenting with efficient backbones [8], [9],
and to ensure ease of deployment by refraining from lever-
aging exotic layers. The experiments presented suggest that
the proposed approach represents a strong baseline and an
important step forward towards a universal model for end to
end perception.
II. RELATED WORKS
A. Object Detection
Deep learning based object detectors can be classified
in Multi-Stage and Single-Stage. The first kind of methods
[10]–[13] include a first module to propose a large set of
regions of the image that are likely to contain an object,
the later stages then classify and refine the predictions.
These methods have being known for being highly accurate,
but computationally expensive. In opposition, single-stage
detectors do not rely on a region proposal mechanism. In
this sense, anchor-base detectors [14]–[16] output a set of
candidate boxes at predefined locations in the form of offset
with respect to a predefined set of anchor boxes, requiring
an expensive process of non-maxima suppression to obtain
the final detections. On the other hand, more recent anchor-
arXiv:2210.00756v1 [cs.CV] 3 Oct 2022