
Fig. 3. A merge scenario produces a bi-modal distribution (black sam-
ples). Optimising closest-mode (minADE/FDE) evaluations favours diverse
predictions (green), while probabilistic (predRMS) evaluations favour pre-
dictions close to the mean (red), that minimise the penalty of incorrect
mode estimates. Solving both requires diverse predictions with the ability
to accurately estimate mode probabilities.
These limitations can be addressed using a Gaussian
Mixture Model (GMM), which represents the full predicted
distribution, along with probability estimates of each mode.
This is preferred over increasing the number of samples, as
GMMs provide a compact encoding of the distribution and
a practical means of evaluating the probability distribution.
Previous methods [4], [5] have used GMMs on the NGSIM
dataset, which are evaluated using negative-log-likelihood
(NLL) evaluations. Further methods have used mode proba-
bility estimates [6], [7] which are evaluated using predicted-
mode RMS (predRMS) evaluations (see Section IV-B).
Probabilistic and closest-mode evaluations provide com-
plimentary measures that are more informative than either
alone, and are analogous to precision and recall in binary
classification. We argue that an effective predictor for inter-
active scenarios needs to optimise both measures, to demon-
strate that it is able to closely capture distinct behaviour
modes, while also accurately representing probabilities. This
is a challenging task as different evaluation measures are
supported by contradictory prediction strategies. Closest-
mode evaluations (minADE/FDE/MR) favour diverse pre-
dictions, while probabilistic evaluations (predRMS, NLL)
favour conservative predictions close to the mean of expected
behaviours, where the cost of incorrect mode estimates is
minimised (Figure 3). Optimising both evaluation approaches
together demonstrates accurate multi-modal prediction, and
reduces the over-representation of unlikely predictions seen
in Figure 2.
To that end, we present DiPA (Diverse and Probabilisti-
cally Accurate) – a fast method for predicting in interactive
scenarios using a GMM encoding, that is able to optimise
both objectives together, by producing a diverse set of predic-
tions with accurate probability estimates. This allows distinct
behaviours to be accurately modelled, while producing an
accurate representation of the full trajectory distribution. This
improves over previous methods [1], [3] using closest-mode
evaluations on the INTERACTION dataset [8], and improves
over previous methods [7], [4] using probabilistic evaluations
on NGSIM [9]. DiPA also improves over a baseline method
(Multiple-Futures Prediction (MFP)) [5] when comparing
both closest-mode and probabilistic measures together. This
demonstrates a predictor that is suitable for supporting an
AV planner in interactive scenarios.
Beyond highlighting the importance of evaluating predic-
tors with both closest-mode and probabilistic evaluations, the
key contributions are: 1) a fast prediction architecture with
a flexible representation that processes agent interactions
in wide-ranging road layouts, that produces high accuracy
predictions on interactive scenarios, 2) a training regime that
supports a diverse set of predicted modes using a GMM-
based spatial distribution, with accurate probability esti-
mates, and 3) a revision to the NLL measure for evaluating
GMM predictions, to correct for an important limitation.
II. RELATED WORK
A number of different structures have been used for
prediction of agents in road scenes, including graph-, goal-
and regression-based methods.
StarNet [1] represents the scene and agents using vector-
based graphs, and uses a combined representation of agents
within their own reference frame and from the points of view
of other agents. Further graph-based methods such as [10],
[3], [11] combine map information and agent positions into a
common representation, commonly processed with a Graph
Neural Network [12] in an encoder-decoder framework.
These methods allow encoding the static layout of the scene
and various agents in a generalisable way, and have shown
good results on closest-mode prediction.
Goal-based methods [13], [14], [15], [16], [17] identify
a number of potential future targets that each agent may
head towards, determine likelihoods of each, and produce
predicted trajectories towards those goals. Flash [7] uses
a combination of Bayesian inverse-planning and mixture-
density networks to produce accurate predictions of trajec-
tories in highway driving scenarios. Goal-based methods
use the map to inform trajectory generation, and can use
kinematically-sound trajectory generators. However, this can
lead to limited diversity on other factors such as motion
profile and path variations compared to data-driven methods.
Regression-based methods use representations that directly
map observations to predicted outputs. SAMMP [4] produces
joint predictions of the spatial distribution of vehicles, using
a multi-head self-attention function to capture interactions
between agents. Multiple-Futures Prediction (MFP) [5] mod-
els the joint futures of a number of interacting agents, using
learnt latent variables for generating predicted future modes.
Mersch et al. [18] present a temporal-convolution method
for predicting interacting vehicles in a highway scenario
where neighbouring agents are assigned specific roles based
on relative positions to a central agent. These regression-
based methods can be fast and accurate, but may have
limited generalisability to different layouts when role-based
representation of inputs is used.
Existing interactive prediction using the INTERACTION
dataset have demonstrated good results based on closest-
mode evaluations (minADE / FDE / MR) [1], [2], [3]. These
have typically used a prediction encoding using a fixed
number of modes, each represented as a trajectory sample.
Optimising closest-mode evaluations produces diverse pre-
dictions, which closely capture distinct modes of behaviour.