in combination with a CNN-based network architecture [45,46,47,48,49,50]. Gao et al. [51]
show the advantages of object-level representations for motion forecasting via Graph Neural Net-
works (GNN). Several follow-ups to this work use object-level representations in combination with
Transformer-based architectures [52,53,54]. Our key distinctions when compared to these methods
are the architectural simplicity of PlanT (our use of simple self-attention transformer blocks and
the proposed route representation) as well as our closed-loop evaluation protocol (we evaluate the
driving performance in simulation and report online driving metrics).
Explainability. Explaining the decisions of neural networks is a rapidly evolving research
field [55,56,57,58,59,60,61]. In the context of self-driving cars, existing work uses text [62]
or heatmaps [63] to explain decisions. In our work, we can directly obtain post hoc explanations for
decisions of our learning-based PlanT architecture by considering its learned attention. While the
concurrent work CAPO [64] uses a similar strategy, it only considers pedestrian-ego interactions on
an empty route, while we consider the full planning task in an urban environment with dense traffic.
Furthermore, we introduce a simple metric to measure the quality of explanations for a planner.
3 Planning Transformers
In this section, we provide details about our task setup, novel scene representation, simple but ef-
fective architecture, and training strategy resulting in state-of-the-art performance. A PyTorch-style
pseudo-code snippet outlining PlanT and its training is provided in the supplementary material.
Task. We consider the task of point-to-point navigation in an urban setting where the goal is to drive
from a start to a goal location while reacting to other dynamic agents and following traffic rules.
We use Imitation Learning (IL) to train the driving agent. The goal of IL is to learn a policy πthat
imitates the behavior of an expert π∗(the expert implementation is described in Section 4). In our
setup, the policy is a mapping π:X −→ W from our novel object-level input representation Xto
the future trajectory Wof an expert driver. For following traffic rules, we assume access to the state
of the next traffic light relevant to the ego vehicle l∈ {green,red}.
Tokenization. To encode the task-specific information required from the scene, we represent it using
a set of objects, with vehicles and segments of the route each being assigned an oriented bounding
box in BEV space (Fig. 1right). Let Xt=Vt∪St, where Vt∈RVt×Aand St∈RSt×Arepresent the
set of vehicles and the set of route segments at time-step twith A= 6 attributes each. Specifically, if
oi,t ∈ Xtrepresents a particular object, the attributes of oi,t include an object type-specific attribute
zi,t (described below), the position of the bounding box (xi,t, yi,t)relative to the ego vehicle, the
orientation ϕi,t ∈[0,2π], and the extent (wi,t, hi,t). Thus, each object oi,t can be described as a
vector oi,t ={zi,t, xi,t, yi,t, ϕi,t, wi,t, hi,t}, or concisely as {oi,t,a}6
a=1.
For the vehicles Vt, we extract the attributes directly from the simulator in our main experiments
and use an off-the-shelf perception module based on CenterNet [65] (described in the supplementary
material) for experiments involving a full driving system. We consider only vehicles up to a distance
Dmax from the ego vehicle, and use oi,t,1(i.e., zi,t) to represent the speed.
To obtain the route segments St, we first sample a dense set of Ntpoints Ut∈RNt×2along the
route ahead of the ego vehicle at time-step t. We directly use the ground-truth points from CARLA
as Utin our main experiments and predict them with a perception module for the PlanT with per-
ception experiments in Section 4.1. The points are subsampled using the Ramer-Douglas-Peucker
algorithm [66,67] to select a subset ˆ
Ut. One segment spans the area between two points subsam-
pled from the route, ui,t,ui+1,t ∈ˆ
Ut. Specifically, oi,t,1(i.e., zi,t) denotes the ordering for the
current time-step t, starting from 0 for the segment closest to the ego vehicle. We set the segment
length oi,t,6=||ui,t −ui+1,t||2, and the width, oi,t,5, equal to the lane width. In addition, we clip
oi,t,6<=Lmax,∀i, t; and always input a fixed number of segments Nsto our policy. More details
and visualizations of the route representation are provided in the supplementary material.
Token Embeddings. Our model is illustrated in Fig. 2. As a first step, applying a transformer
backbone requires the generation of embeddings for each input token, for which we define a linear
projection ρ:R6→RH(where His the desired hidden dimensionality). To obtain token em-
beddings ei,t, we add the projected input tokens oi,t to a learnable object type embedding vector
ev∈RHor es∈RH, indicating to which type the token belongs (vehicle or route segment).
3