such as NLP [28] and vision [29], [30]. In other tasks of
autonomous driving such as control, some methods design
synthetic ground truths [31] in order to help the learning
of their driving agents. Concurrent to this work, PreTraM
[32] applies contrastive learning between local map rasters
and trajectories to reinforce the learned relationship between
both, while SSL-Lanes [1] uses a multitasking framework
to analyze the performance improvement of four auxiliary
tasks.
III. METHOD
We proceed as following. First, we describe our base
prediction model, on top of which we will apply the aux-
iliary task in different frameworks. We continue with the
description of each element of this task, and finally detail its
application in each of the frameworks, either pretraining or
multitask.
A. Base trajectory prediction model
Our encoder is the same as in the GOHOME model
[33], which uses attention and graph convolutions to update
agent features with map information and to model agent
interactions. However, instead of decoding a heatmap of the
probability density of the final position of the tracked agent,
we like most of the state-of-the-art decode the kfull trajec-
tory predictions directly through a multi-layer perceptron and
the probability logits of each prediction separately through
a simple linear layer.
The trajectory loss function in this case selects the closest
prediction to the ground-truth by the usual winner-takes-all
method according to minFDEk. Once the closest prediction
is selected, we apply a smooth L1 loss [34] between the N
points in it and the Npoints in the ground-truth, and finally
average the results to get the main trajectory loss Ltraj .
There is also, however, a loss associated to the prediction
probabilities Lprob which in our case is the same as the one
used in TPCN [19]. In the end we combine these losses to
have Lmain =Ltraj +Lprob.
B. Design of the Map Trajectories auxiliary task
Our aim is to design an auxiliary task that uses map
information as a way of improving map comprehension.
However, such a task needs to share important features with
the main goal of trajectory prediction in order to avoid
forgetting the information learned [35] or having conflicting
information from both tasks. One essential feature of motion
forecasting is multi-modality, and a natural approach to
incorporate it to the SSL task while exploiting the map is
to make the network predict, given a starting position, all
trajectories that an agent in that position could take in a
given time horizon. Each trajectory can be split into two parts
- the past and the future. All trajectories generated for the
same starting position share the same past, but their possible
futures vary.
1) Map exploration: Each data set provides its HD-map
as a graph that consists of lanelets (nodes) and edges. The
lanelets represent sections of the roads 10 to 20 meters long
on average. They contain the succession of coordinates of
the center line of its associated lane segment, also giving its
direction. The edges of the graph represent the connectivity
of the lanelets. For clarity, in the following we differentiate
a "path" – a sequence of lanelets that may be travelled by an
agent, from a "trajectory" – a precise sequence of coordinates
taken in regular time intervals and of definite length.
The pretraining data consists of a list of such lanelets along
with the local graph each is a part of, allowing the easy
choice of a starting position from which the all the possible
paths will be built using the connectivity of the graph every
time a sample is taken.
From the starting lanelet, we do a depth-first search to
find all possible paths. We stop the search once a maximum
distance is reached, or when there are no more successor
nodes. We store the paths found as lists of lanelet IDs and
for each of these lists we concatenate their respective center
lines to create a "guide-line" for the possible trajectories.
2) Synthetic speeds and accelerations: In order to bring
the network input as close as possible to that of the main
trajectory prediction task, we create from sampled values
of speed and acceleration the ground-truths of the possible
trajectories and a synthetic past history.
We sample the agent’s initial speed from a uniform distri-
bution, which allows the model to adapt to the different speed
distributions found in each data set more easily. In a certain
percentage of samples, we also take a random acceleration
from a Laplace distribution, centered on 0 and with a scale
chosen to fit the empirical data. This acceleration is kept
constant throughout the past, but to create diversity in the
future modalities, we add a random noise to the acceleration
in the future part of the trajectory, also taken from a Laplace
distribution of smaller scale.
3) Ground-truth and past generation: Using the guide-
line generated by the node paths, we interpolate its coor-
dinates to create a ground-truth trajectory that has constant
acceleration, consistent with the speed and acceleration sam-
pled. We take care to extrapolate the trajectory if there are
not enough points or to cut it off earlier whenever the desired
number of points is achieved, so that every possible trajectory
is an array of equal size. The past is generated in the same
way, except that we simply traverse the starting lanelet’s
predecessors instead of its successors for the node ID list,
and only generate one trajectory.
Finally, to simulate perception noise encountered in the
real data, we add an independent Gaussian noise to each
step of the past trajectory.
4) Network output: The network outputs npred predic-
tions of map trajectories. The loss function is based upon
the matching of predictions and ground-truths.
Most often, the number of actual ground-truths ngt is
less than npred, in which case some predictions are left un-
matched to any ground-truth. We run then into an assignment
problem of which predictions should be matched to a certain