PlanT Explainable Planning Transformers via Object-Level Representations Katrin Renz12Kashyap Chitta12Otniel-Bogdan Mercea1

2025-05-02 0 0 6.95MB 12 页 10玖币
侵权投诉
PlanT: Explainable Planning Transformers via
Object-Level Representations
Katrin Renz1,2Kashyap Chitta1,2Otniel-Bogdan Mercea1
A. Sophia Koepke1Zeynep Akata1,2,3Andreas Geiger1,2
1University of Tübingen 2Max Planck Institute for Intelligent Systems, Tübingen
3Max Planck Institute for Informatics, Saarbrücken
https://www.katrinrenz.de/plant
Abstract: Planning an optimal route in a complex environment requires efficient
reasoning about the surrounding scene. While human drivers prioritize important
objects and ignore details not relevant to the decision, learning-based planners
typically extract features from dense, high-dimensional grid representations con-
taining all vehicle and road context information. In this paper, we propose PlanT,
a novel approach for planning in the context of self-driving that uses a standard
transformer architecture. PlanT is based on imitation learning with a compact
object-level input representation. On the Longest6 benchmark for CARLA, PlanT
outperforms all prior methods (matching the driving score of the expert) while
being 5.3×faster than equivalent pixel-based planning baselines during inference.
Combining PlanT with an off-the-shelf perception module provides a sensor-
based driving system that is more than 10 points better in terms of driving score
than the existing state of the art. Furthermore, we propose an evaluation protocol
to quantify the ability of planners to identify relevant objects, providing insights
regarding their decision-making. Our results indicate that PlanT can focus on the
most relevant object in the scene, even when this object is geometrically distant.
Keywords: Autonomous Driving, Transformers, Explainability
1 Introduction
The ability to plan is an important aspect of human intelligence, allowing us to solve complex nav-
igation tasks. For example, to change lanes on a busy highway, a driver must wait for sufficient
space in the new lane and adjust the speed based on the expected behavior of the other vehicles. Hu-
mans quickly learn this and can generalize to new scenarios, a trait we would also like autonomous
agents to have. Due to the difficulty of the planning task, the field of autonomous driving is shifting
away from traditional rule-based algorithms [1,2,3,4,5,6,7,8] towards learning-based solu-
tions [9,10,11,12,13,14]. Learning-based planners directly map the environmental state represen-
tation (e.g., HD maps and object bounding boxes) to waypoints or vehicle controls. They emerged
as a scalable alternative to rule-based planners which require significant manual effort to design.
Interestingly, while humans reason about the world in terms of objects [15,16,17], most existing
learned planners [9,12,18] choose a high-dimensional pixel-level input representation by rendering
bird’s eye view (BEV) images of detailed HD maps (Fig. 1left). It is widely believed that this
kind of accurate scene understanding is key for robust self-driving vehicles, leading to significant
interest in recovering pixel-level BEV information from sensor inputs [19,20,21,22,23,24]. In
this paper, we investigate whether such detailed representations are actually necessary to achieve
convincing planning performance. We propose PlanT, a learning-based planner that leverages an
object-level representation (Fig. 1right) as an input to a transformer encoder [25]. We represent a
scene as a set of features corresponding to (1) nearby vehicles and (2) the route the planner must
follow. We show that despite the low feature dimensionality, our model achieves state-of-the-art
results. We then propose a novel evaluation scheme and metric to analyze explainability which is
generally applicable to any learning-based planner. Specifically, we test the ability of a planner to
identify the objects that are the most relevant to account for to plan a collision-free route.
6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.
arXiv:2210.14222v1 [cs.RO] 25 Oct 2022
Figure 1: Scene Representations for Planning. As an alternative to the dominant paradigm of
pixel-level planners (left), we show the effectiveness of compact object-level representations (right).
We perform a detailed empirical analysis of learning-based planning on the Longest6 bench-
mark [26] of the CARLA simulator [27]. We first identify the key missing elements in the design of
existing learned planners such as their incomplete field of view and sub-optimal dataset and model
sizes. We then show the advantages of our proposed transformer architecture, including improve-
ments in performance and significantly faster inference times. Finally, we show that the attention
weights of the transformer, which are readily accessible, can be used to represent object relevance.
Our qualitative and quantitative results on explainability confirm that PlanT attends to the objects
that match our intuition for the relevance of objects for safe driving.
Contributions. (1) Using a simple object-level representation, we significantly improve upon the
previous state of the art for planning on CARLA via PlanT, our novel transformer-based approach.
(2) Through a comprehensive experimental study, we identify that the ego vehicle’s route, a full
360° field of view, and information about vehicle speeds are critical elements of a planner’s input
representation. (3) We propose a protocol and metric for evaluating a planner’s prioritization of
obstacles in a scene and show that PlanT is more explainable than CNN-based methods, i.e., the
attention weights of the transformer identify the most relevant objects more reliably.
2 Related Work
Intermediate Representations for Driving. Early work on decoupling end-to-end driving into two
stages predicts a set of low-dimensional affordances from sensor inputs with CNNs which are then
input to a rule-based planner [28]. These affordances are scene-descriptive attributes (e.g. emergency
brake, red light, center-line distance, angle) that are compact, yet comprehensive enough to enable
simple driving tasks, such as urban driving on the initial version of CARLA [27]. Unfortunately,
methods based on affordances perform poorly on subsequent benchmarks in CARLA which involve
higher task complexity [29]. Most state-of-the-art driving models instead rely heavily on annotated
2D data either as intermediate representations or auxiliary training objectives [26,30]. Several sub-
sequent studies show that using semantic segmentation as an intermediate representation helps for
navigational tasks [31,32,33,34]. More recently, there has been a rapid growth in interest on using
BEV semantic segmentation maps as the input representation to planners [9,12,30,18]. To reduce
the immense labeling cost of such segmentation methods, Behl et al. [35] propose visual abstrac-
tions, which are label-efficient alternatives to dense 2D semantic segmentation maps. They show
that reduced class counts and the use of bounding boxes instead of pixel-accurate masks for certain
classes is sufficient. Wang et al. [36] explore the use of object-centric representations for planning
by explicitly extracting objects and rendering them into a BEV input for a planner. However, so
far, the literature lacks a systematic analysis of whether object-centric representations are better or
worse than BEV context techniques for planning in dense traffic, which we address in this work.
We keep our representation simple and compact by directly considering the set of objects as inputs
to our models. In addition to baselines using CNNs to process the object-centric representation, we
show that using a transformer leads to improved performance, efficiency, and explainability.
Transformers for Forecasting. Transformers obtain impressive results in several research ar-
eas [25,37,38,39], including simple interactive environments such as Atari games [40,41,42,
43,44]. While the end objective differs, one application domain that involves similar challenges
to planning is motion forecasting. Most existing motion forecasting methods use a rasterized input
2
in combination with a CNN-based network architecture [45,46,47,48,49,50]. Gao et al. [51]
show the advantages of object-level representations for motion forecasting via Graph Neural Net-
works (GNN). Several follow-ups to this work use object-level representations in combination with
Transformer-based architectures [52,53,54]. Our key distinctions when compared to these methods
are the architectural simplicity of PlanT (our use of simple self-attention transformer blocks and
the proposed route representation) as well as our closed-loop evaluation protocol (we evaluate the
driving performance in simulation and report online driving metrics).
Explainability. Explaining the decisions of neural networks is a rapidly evolving research
field [55,56,57,58,59,60,61]. In the context of self-driving cars, existing work uses text [62]
or heatmaps [63] to explain decisions. In our work, we can directly obtain post hoc explanations for
decisions of our learning-based PlanT architecture by considering its learned attention. While the
concurrent work CAPO [64] uses a similar strategy, it only considers pedestrian-ego interactions on
an empty route, while we consider the full planning task in an urban environment with dense traffic.
Furthermore, we introduce a simple metric to measure the quality of explanations for a planner.
3 Planning Transformers
In this section, we provide details about our task setup, novel scene representation, simple but ef-
fective architecture, and training strategy resulting in state-of-the-art performance. A PyTorch-style
pseudo-code snippet outlining PlanT and its training is provided in the supplementary material.
Task. We consider the task of point-to-point navigation in an urban setting where the goal is to drive
from a start to a goal location while reacting to other dynamic agents and following traffic rules.
We use Imitation Learning (IL) to train the driving agent. The goal of IL is to learn a policy πthat
imitates the behavior of an expert π(the expert implementation is described in Section 4). In our
setup, the policy is a mapping π:X → W from our novel object-level input representation Xto
the future trajectory Wof an expert driver. For following traffic rules, we assume access to the state
of the next traffic light relevant to the ego vehicle l∈ {green,red}.
Tokenization. To encode the task-specific information required from the scene, we represent it using
a set of objects, with vehicles and segments of the route each being assigned an oriented bounding
box in BEV space (Fig. 1right). Let Xt=VtSt, where VtRVt×Aand StRSt×Arepresent the
set of vehicles and the set of route segments at time-step twith A= 6 attributes each. Specifically, if
oi,t ∈ Xtrepresents a particular object, the attributes of oi,t include an object type-specific attribute
zi,t (described below), the position of the bounding box (xi,t, yi,t)relative to the ego vehicle, the
orientation ϕi,t [0,2π], and the extent (wi,t, hi,t). Thus, each object oi,t can be described as a
vector oi,t ={zi,t, xi,t, yi,t, ϕi,t, wi,t, hi,t}, or concisely as {oi,t,a}6
a=1.
For the vehicles Vt, we extract the attributes directly from the simulator in our main experiments
and use an off-the-shelf perception module based on CenterNet [65] (described in the supplementary
material) for experiments involving a full driving system. We consider only vehicles up to a distance
Dmax from the ego vehicle, and use oi,t,1(i.e., zi,t) to represent the speed.
To obtain the route segments St, we first sample a dense set of Ntpoints UtRNt×2along the
route ahead of the ego vehicle at time-step t. We directly use the ground-truth points from CARLA
as Utin our main experiments and predict them with a perception module for the PlanT with per-
ception experiments in Section 4.1. The points are subsampled using the Ramer-Douglas-Peucker
algorithm [66,67] to select a subset ˆ
Ut. One segment spans the area between two points subsam-
pled from the route, ui,t,ui+1,t ˆ
Ut. Specifically, oi,t,1(i.e., zi,t) denotes the ordering for the
current time-step t, starting from 0 for the segment closest to the ego vehicle. We set the segment
length oi,t,6=||ui,t ui+1,t||2, and the width, oi,t,5, equal to the lane width. In addition, we clip
oi,t,6<=Lmax,i, t; and always input a fixed number of segments Nsto our policy. More details
and visualizations of the route representation are provided in the supplementary material.
Token Embeddings. Our model is illustrated in Fig. 2. As a first step, applying a transformer
backbone requires the generation of embeddings for each input token, for which we define a linear
projection ρ:R6RH(where His the desired hidden dimensionality). To obtain token em-
beddings ei,t, we add the projected input tokens oi,t to a learnable object type embedding vector
evRHor esRH, indicating to which type the token belongs (vehicle or route segment).
3
摘要:

PlanT:ExplainablePlanningTransformersviaObject-LevelRepresentationsKatrinRenz1;2KashyapChitta1;2Otniel-BogdanMercea1A.SophiaKoepke1ZeynepAkata1;2;3AndreasGeiger1;21UniversityofTübingen2MaxPlanckInstituteforIntelligentSystems,Tübingen3MaxPlanckInstituteforInformatics,Saarbrückenhttps://www.katrinrenz...

展开>> 收起<<
PlanT Explainable Planning Transformers via Object-Level Representations Katrin Renz12Kashyap Chitta12Otniel-Bogdan Mercea1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:6.95MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注