
Can Transformer Attention Spread Give Insights Into Uncertainty of
Detected and Tracked Objects?
Felicia Ruppel1,2, Florian Faion1, Claudius Gl¨
aser1and Klaus Dietmayer2
Abstract— Transformers have recently been utilized to per-
form object detection and tracking in the context of autonomous
driving. One unique characteristic of these models is that
attention weights are computed in each forward pass, giving
insights into the model’s interior, in particular, which part
of the input data it deemed interesting for the given task.
Such an attention matrix with the input grid is available for
each detected (or tracked) object in every transformer decoder
layer. In this work, we investigate the distribution of these
attention weights: How do they change through the decoder
layers and through the lifetime of a track? Can they be used
to infer additional information about an object, such as a
detection uncertainty? Especially in unstructured environments,
or environments that were not common during training, a
reliable measure of detection uncertainty is crucial to decide
whether the system can still be trusted or not.
I. INTRODUCTION
Object detection and tracking are essential tasks in a
perception pipeline for autonomous and automated driving.
Only with knowledge about surrounding objects, downstream
tasks, such as prediction and planning, are possible. In such
a system, where the cascading effects of perception errors
can be detrimental, it is very important to be able to quan-
tify the reliability of the detection and tracking output. In
object detection, uncertainty can stem from two sources [1]:
Epistemic uncertainty is caused by uncertainty of the model,
e.g. when an observation is made that was not present in
the training dataset. Unstructured and dynamic environments
can also cause such an uncertainty, as their versatility can
hardly be captured in a training dataset. Second, aleatoric
uncertainty stems from sensor noise, and also encompasses
uncertainty caused by low visibility and increased distance
from the sensor [1].
While state-of-the-art object detection methods have been
based on deep learning for many years, both with image
input [2] as well as on point clouds [3], [4], it is a recent
phenomenon that deep learning based models are also used
for joint tracking and detection [5], [6], [7]. Such trackers
aim to utilize the detector’s latent space to infer additional in-
formation about a tracked object, rather than relying on low-
dimensional bounding boxes as input. However, they have
the drawback that they are unable to output an uncertainty,
as a conventional method would, e.g. tracking based on a
Kalman filter [8]. While deep learning based detectors and
trackers usually output a confidence score or class probability
1Robert Bosch GmbH, Corporate Research, 71272 Renningen, Germany,
{firstname.lastname}@de.bosch.com
2Institute of Measurement, Control and Microtechnology, Ulm Univer-
sity, Germany, {firstname.lastname}@uni-ulm.de
Fig. 1. Example of estimated bounding boxes with their respective attention
covariance matrices, pictured as ellipses. Ground truth boxes are denoted by
dotted grey lines, while estimated boxes, attention weights, attention mean
and ellipses are colored. Excerpt from the birds-eye-view grid at a distance
of 30 to 50 meters from the ego vehicle.
score per estimated object, these generally can not be used
as a reliable uncertainty measure, but additional measures
are necessary to capture uncertainty [9].
One approach towards joint object detection and tracking
is the usage of transformer models [10], which were able
to achieve state-of-the-art results in some domains [11],
[6]. Transformers are based on attention, i.e. the interaction
between input tokens, which is why these models allow for
a unique insight into their reasoning: One can visualize the
attention matrices that are computed in each model forward
pass and investigate which part of the input data was used
to generate a certain output. In previous work, we developed
a transformer based model for detection and tracking [7] in
the context of autonomous driving that operates on (lidar)
point clouds. An example of visualized attention weights
from the tracking model are pictured in Figure 1. In empirical
observations, a more focused attention tends to lead to a more
accurate detection. Therefore, we investigate whether the
attention weight distribution can give insights into a detection
uncertainty in this paper. An uncertainty indicator would be
very valuable towards the ability to use transformer based
arXiv:2210.14391v1 [cs.CV] 26 Oct 2022