PlanT Explainable Planning Transformers via Object-Level Representations Katrin Renz12Kashyap Chitta12Otniel-Bogdan Mercea1

2025-05-02 2 0 6.95MB 12 页 10玖币

侵权投诉

PlanT: Explainable Planning Transformers via

Object-Level Representations

Katrin Renz1,2Kashyap Chitta1,2Otniel-Bogdan Mercea1

A. Sophia Koepke1Zeynep Akata1,2,3Andreas Geiger1,2

1University of Tübingen 2Max Planck Institute for Intelligent Systems, Tübingen

3Max Planck Institute for Informatics, Saarbrücken

https://www.katrinrenz.de/plant

Abstract: Planning an optimal route in a complex environment requires efﬁcient

reasoning about the surrounding scene. While human drivers prioritize important

objects and ignore details not relevant to the decision, learning-based planners

typically extract features from dense, high-dimensional grid representations con-

taining all vehicle and road context information. In this paper, we propose PlanT,

a novel approach for planning in the context of self-driving that uses a standard

transformer architecture. PlanT is based on imitation learning with a compact

object-level input representation. On the Longest6 benchmark for CARLA, PlanT

outperforms all prior methods (matching the driving score of the expert) while

being 5.3×faster than equivalent pixel-based planning baselines during inference.

Combining PlanT with an off-the-shelf perception module provides a sensor-

based driving system that is more than 10 points better in terms of driving score

than the existing state of the art. Furthermore, we propose an evaluation protocol

to quantify the ability of planners to identify relevant objects, providing insights

regarding their decision-making. Our results indicate that PlanT can focus on the

most relevant object in the scene, even when this object is geometrically distant.

Keywords: Autonomous Driving, Transformers, Explainability

1 Introduction

The ability to plan is an important aspect of human intelligence, allowing us to solve complex nav-

igation tasks. For example, to change lanes on a busy highway, a driver must wait for sufﬁcient

space in the new lane and adjust the speed based on the expected behavior of the other vehicles. Hu-

mans quickly learn this and can generalize to new scenarios, a trait we would also like autonomous

agents to have. Due to the difﬁculty of the planning task, the ﬁeld of autonomous driving is shifting

away from traditional rule-based algorithms [1,2,3,4,5,6,7,8] towards learning-based solu-

tions [9,10,11,12,13,14]. Learning-based planners directly map the environmental state represen-

tation (e.g., HD maps and object bounding boxes) to waypoints or vehicle controls. They emerged

as a scalable alternative to rule-based planners which require signiﬁcant manual effort to design.

Interestingly, while humans reason about the world in terms of objects [15,16,17], most existing

learned planners [9,12,18] choose a high-dimensional pixel-level input representation by rendering

bird’s eye view (BEV) images of detailed HD maps (Fig. 1left). It is widely believed that this

kind of accurate scene understanding is key for robust self-driving vehicles, leading to signiﬁcant

interest in recovering pixel-level BEV information from sensor inputs [19,20,21,22,23,24]. In

this paper, we investigate whether such detailed representations are actually necessary to achieve

convincing planning performance. We propose PlanT, a learning-based planner that leverages an

object-level representation (Fig. 1right) as an input to a transformer encoder [25]. We represent a

scene as a set of features corresponding to (1) nearby vehicles and (2) the route the planner must

follow. We show that despite the low feature dimensionality, our model achieves state-of-the-art

results. We then propose a novel evaluation scheme and metric to analyze explainability which is

generally applicable to any learning-based planner. Speciﬁcally, we test the ability of a planner to

identify the objects that are the most relevant to account for to plan a collision-free route.

6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.

arXiv:2210.14222v1 [cs.RO] 25 Oct 2022

Figure 1: Scene Representations for Planning. As an alternative to the dominant paradigm of

pixel-level planners (left), we show the effectiveness of compact object-level representations (right).

We perform a detailed empirical analysis of learning-based planning on the Longest6 bench-

mark [26] of the CARLA simulator [27]. We ﬁrst identify the key missing elements in the design of

existing learned planners such as their incomplete ﬁeld of view and sub-optimal dataset and model

sizes. We then show the advantages of our proposed transformer architecture, including improve-

ments in performance and signiﬁcantly faster inference times. Finally, we show that the attention

weights of the transformer, which are readily accessible, can be used to represent object relevance.

Our qualitative and quantitative results on explainability conﬁrm that PlanT attends to the objects

that match our intuition for the relevance of objects for safe driving.

Contributions. (1) Using a simple object-level representation, we signiﬁcantly improve upon the

previous state of the art for planning on CARLA via PlanT, our novel transformer-based approach.

(2) Through a comprehensive experimental study, we identify that the ego vehicle’s route, a full

360° ﬁeld of view, and information about vehicle speeds are critical elements of a planner’s input

representation. (3) We propose a protocol and metric for evaluating a planner’s prioritization of

obstacles in a scene and show that PlanT is more explainable than CNN-based methods, i.e., the

attention weights of the transformer identify the most relevant objects more reliably.

2 Related Work

Intermediate Representations for Driving. Early work on decoupling end-to-end driving into two

stages predicts a set of low-dimensional affordances from sensor inputs with CNNs which are then

input to a rule-based planner [28]. These affordances are scene-descriptive attributes (e.g. emergency

brake, red light, center-line distance, angle) that are compact, yet comprehensive enough to enable

simple driving tasks, such as urban driving on the initial version of CARLA [27]. Unfortunately,

methods based on affordances perform poorly on subsequent benchmarks in CARLA which involve

higher task complexity [29]. Most state-of-the-art driving models instead rely heavily on annotated

2D data either as intermediate representations or auxiliary training objectives [26,30]. Several sub-

sequent studies show that using semantic segmentation as an intermediate representation helps for

navigational tasks [31,32,33,34]. More recently, there has been a rapid growth in interest on using

BEV semantic segmentation maps as the input representation to planners [9,12,30,18]. To reduce

the immense labeling cost of such segmentation methods, Behl et al. [35] propose visual abstrac-

tions, which are label-efﬁcient alternatives to dense 2D semantic segmentation maps. They show

that reduced class counts and the use of bounding boxes instead of pixel-accurate masks for certain

classes is sufﬁcient. Wang et al. [36] explore the use of object-centric representations for planning

by explicitly extracting objects and rendering them into a BEV input for a planner. However, so

far, the literature lacks a systematic analysis of whether object-centric representations are better or

worse than BEV context techniques for planning in dense trafﬁc, which we address in this work.

We keep our representation simple and compact by directly considering the set of objects as inputs

to our models. In addition to baselines using CNNs to process the object-centric representation, we

show that using a transformer leads to improved performance, efﬁciency, and explainability.

Transformers for Forecasting. Transformers obtain impressive results in several research ar-

eas [25,37,38,39], including simple interactive environments such as Atari games [40,41,42,

43,44]. While the end objective differs, one application domain that involves similar challenges

to planning is motion forecasting. Most existing motion forecasting methods use a rasterized input

in combination with a CNN-based network architecture [45,46,47,48,49,50]. Gao et al. [51]

show the advantages of object-level representations for motion forecasting via Graph Neural Net-

works (GNN). Several follow-ups to this work use object-level representations in combination with

Transformer-based architectures [52,53,54]. Our key distinctions when compared to these methods

are the architectural simplicity of PlanT (our use of simple self-attention transformer blocks and

the proposed route representation) as well as our closed-loop evaluation protocol (we evaluate the

driving performance in simulation and report online driving metrics).

Explainability. Explaining the decisions of neural networks is a rapidly evolving research

ﬁeld [55,56,57,58,59,60,61]. In the context of self-driving cars, existing work uses text [62]

or heatmaps [63] to explain decisions. In our work, we can directly obtain post hoc explanations for

decisions of our learning-based PlanT architecture by considering its learned attention. While the

concurrent work CAPO [64] uses a similar strategy, it only considers pedestrian-ego interactions on

an empty route, while we consider the full planning task in an urban environment with dense trafﬁc.

Furthermore, we introduce a simple metric to measure the quality of explanations for a planner.

3 Planning Transformers

In this section, we provide details about our task setup, novel scene representation, simple but ef-

fective architecture, and training strategy resulting in state-of-the-art performance. A PyTorch-style

pseudo-code snippet outlining PlanT and its training is provided in the supplementary material.

Task. We consider the task of point-to-point navigation in an urban setting where the goal is to drive

from a start to a goal location while reacting to other dynamic agents and following trafﬁc rules.

We use Imitation Learning (IL) to train the driving agent. The goal of IL is to learn a policy πthat

imitates the behavior of an expert π∗(the expert implementation is described in Section 4). In our

setup, the policy is a mapping π:X −→ W from our novel object-level input representation Xto

the future trajectory Wof an expert driver. For following trafﬁc rules, we assume access to the state

of the next trafﬁc light relevant to the ego vehicle l∈ {green,red}.

Tokenization. To encode the task-speciﬁc information required from the scene, we represent it using

a set of objects, with vehicles and segments of the route each being assigned an oriented bounding

box in BEV space (Fig. 1right). Let Xt=Vt∪St, where Vt∈RVt×Aand St∈RSt×Arepresent the

set of vehicles and the set of route segments at time-step twith A= 6 attributes each. Speciﬁcally, if

oi,t ∈ Xtrepresents a particular object, the attributes of oi,t include an object type-speciﬁc attribute

zi,t (described below), the position of the bounding box (xi,t, yi,t)relative to the ego vehicle, the

orientation ϕi,t ∈[0,2π], and the extent (wi,t, hi,t). Thus, each object oi,t can be described as a

vector oi,t ={zi,t, xi,t, yi,t, ϕi,t, wi,t, hi,t}, or concisely as {oi,t,a}6

a=1.

For the vehicles Vt, we extract the attributes directly from the simulator in our main experiments

and use an off-the-shelf perception module based on CenterNet [65] (described in the supplementary

material) for experiments involving a full driving system. We consider only vehicles up to a distance

Dmax from the ego vehicle, and use oi,t,1(i.e., zi,t) to represent the speed.

To obtain the route segments St, we ﬁrst sample a dense set of Ntpoints Ut∈RNt×2along the

route ahead of the ego vehicle at time-step t. We directly use the ground-truth points from CARLA

as Utin our main experiments and predict them with a perception module for the PlanT with per-

ception experiments in Section 4.1. The points are subsampled using the Ramer-Douglas-Peucker

algorithm [66,67] to select a subset ˆ

Ut. One segment spans the area between two points subsam-

pled from the route, ui,t,ui+1,t ∈ˆ

Ut. Speciﬁcally, oi,t,1(i.e., zi,t) denotes the ordering for the

current time-step t, starting from 0 for the segment closest to the ego vehicle. We set the segment

length oi,t,6=||ui,t −ui+1,t||2, and the width, oi,t,5, equal to the lane width. In addition, we clip

oi,t,6<=Lmax,∀i, t; and always input a ﬁxed number of segments Nsto our policy. More details

and visualizations of the route representation are provided in the supplementary material.

Token Embeddings. Our model is illustrated in Fig. 2. As a ﬁrst step, applying a transformer

backbone requires the generation of embeddings for each input token, for which we deﬁne a linear

projection ρ:R6→RH(where His the desired hidden dimensionality). To obtain token em-

beddings ei,t, we add the projected input tokens oi,t to a learnable object type embedding vector

ev∈RHor es∈RH, indicating to which type the token belongs (vehicle or route segment).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PlanT:ExplainablePlanningTransformersviaObject-LevelRepresentationsKatrinRenz1;2KashyapChitta1;2Otniel-BogdanMercea1A.SophiaKoepke1ZeynepAkata1;2;3AndreasGeiger1;21UniversityofTübingen2MaxPlanckInstituteforIntelligentSystems,Tübingen3MaxPlanckInstituteforInformatics,Saarbrückenhttps://www.katrinrenz...

展开>> 收起<<

PlanT Explainable Planning Transformers via Object-Level Representations Katrin Renz12Kashyap Chitta12Otniel-Bogdan Mercea1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PlanT Explainable Planning Transformers via Object-Level Representations Katrin Renz12Kashyap Chitta12Otniel-Bogdan Mercea1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: