
Scenario-based Evaluation of Prediction Models for Automated Vehicles
Manuel Muñoz Sánchez1Jos Elfring2Emilia Silvas3and René van de Molengraft1
Abstract— To operate safely, an automated vehicle (AV) must
anticipate how the environment around it will evolve. For that
purpose, it is important to know which prediction models are
most appropriate for every situation. Currently, assessment of
prediction models is often performed over a set of trajectories
without distinction of the type of movement they capture, result-
ing in the inability to determine the suitability of each model for
different situations. In this work we illustrate how standardized
evaluation methods result in wrong conclusions regarding a
model’s predictive capabilities, preventing a clear assessment
of prediction models and potentially leading to dangerous on-
road situations. We argue that following evaluation practices
in safety assessment for AVs, assessment of prediction models
should be performed in a scenario-based fashion. To encourage
scenario-based assessment of prediction models and illustrate
the dangers of improper assessment, we categorize trajectories
of the Waymo Open Motion dataset according to the type of
movement they capture. Next, three different models are thor-
oughly evaluated for different trajectory types and prediction
horizons. Results show that common evaluation methods are
insufficient and the assessment should be performed depending
on the application in which the model will operate.
I. INTRODUCTION
Automated vehicles (AVs) have become popular in recent
years since they have the potential to increase road safety,
efficiency and comfort [1]–[3]. To operate safely, an AV must
accurately anticipate the future motion of other road users
(RUs) in its surroundings [4]. To build trajectory prediction
models, deep learning (DL) techniques are gaining attention
[5], since they can effectively learn complex interactions
between different RUs [6], [7] and the road infrastructure
[8], [9] from past observations to produce more accurate
predictions. Traditionally, training these models effectively
was a problematic task since the amount of data required was
not easily available. However, this issue has been alleviated
in recent years with the release of several large public
datasets [10]–[14]. A common practice to assess a model’s
predictive accuracy is to consider a fraction of the dataset
reserved for this purpose (commonly referred to as test
data), and to compare the model’s predictions with the real
trajectories. The output of prediction models may vary, hence
different metrics exist to quantify the disparity between the
real and predicted trajectories [4]. For example, some models
This work was supported by SAFE-UP under EU’s Horizon 2020 research
and innovation programme, grant agreement 861570.
1Manuel Muñoz Sánchez, Emilia Silvas, Jos Elfring and René van de
Molengraft are with the Department of Mechanical Engineering, Eindhoven
University of Technology, Eindhoven, The Netherlands.
2Jos Elfring is also with the Product Unit Autonomous Driving, TomTom,
Amsterdam, The Netherlands.
3Emilia Silvas is also with the Department of Integrated Vehicle Safety,
TNO, Helmond, The Netherlands.
True Trajectory
Accurate Prediction
Inaccurate Prediction
A
J
B
E
F
D
G
H
I
K
C
Fig. 1. Example where a model that is accurate on average fails to predict
a pedestrian trajectory, leading to a dangerous situation.
produce a single prediction, while others produce a set of
feasible trajectories and associated confidence for each.
Despite the existence of various evaluation metrics for
prediction models, several challenges remain unaddressed in
current evaluation practices, such as the inability of these
metrics to capture a model’s robustness or generalization
capabilities [5]. Perhaps the most severe shortcoming is that
all trajectories are considered equal for error computation
despite capturing significantly different behaviors, which
can lead to dangerours situations due to misjudgement of
a model’s suitability for specific situations. For instance,
consider the situation shown in Fig. 1, where an AV (A)
predicts the future trajectory of surrounding RUs (B-K) in a
crowded urban scenario. Current evaluation practices would
deem this model suitable for RU trajectory prediction in
crowded urban scenarios, since its predictions are highly
accurate on average. It accurately predicts pedestrians on the
sidewalk (B-D), crossing at designated crossings (E,F), and
lane-following cyclists and vehicles (G-I). However, in this
example only a few of these RUs are relevant to the AV (I,
J). Additionally, failure cases like the pedestrians crossing at
non-designated crossings (J, K) can go unnoticed since all
trajectories are considered equally for error computation.
The importance of a thorough evaluation for different
types of trajectories has been recognized previously [15].
However, current efforts to improve evaluation of prediction
models focus mainly on interactions between pedestrians
(e.g. collision-avoidance [15]), and disregard interactions of
RUs with the road infrastructure (e.g. pedestrian stops at
a red traffic light). Additionally, the evaluation procedure
should provide a transparent assessment of a model’s suit-
ability for the intended application. For instance, for AVs, an
inaccurate prediction for a pedestrian walking in front of the
vehicle should be considered more important or severe than
one of a pedestrian that is walking behind the vehicle or far
arXiv:2210.06553v1 [cs.AI] 11 Oct 2022