LOPR Latent Occupancy PRediction using Generative Models Bernard Lange Masha Itkina and Mykel J. Kochenderfer

2025-05-02 0 0 2.92MB 16 页 10玖币

侵权投诉

LOPR: Latent Occupancy PRediction using

Generative Models

Bernard Lange, Masha Itkina, and Mykel J. Kochenderfer

Department of Aeronautics and Astronautics

Stanford University

United States

{blange, mitkina, mykel}@stanford.edu

Abstract: Environment prediction frameworks are integral for autonomous ve-

hicles, enabling safe navigation in dynamic environments. LiDAR generated oc-

cupancy grid maps (L-OGMs) offer a robust bird’s eye-view scene representation

that facilitates joint scene predictions without relying on manual labeling unlike

commonly used trajectory prediction frameworks. Prior approaches have opti-

mized deterministic L-OGM prediction architectures directly in grid cell space.

While these methods have achieved some degree of success in prediction, they

occasionally grapple with unrealistic and incorrect predictions. We claim that the

quality and realism of the forecasted occupancy grids can be enhanced with the

use of generative models. We propose a framework that decouples occupancy pre-

diction into: representation learning and stochastic prediction within the learned

latent space. Our approach allows for conditioning the model on other available

sensor modalities such as RGB-cameras and high deﬁnition maps. We demon-

strate that our approach achieves state-of-the-art performance and is readily trans-

ferable between different robotic platforms on the real-world NuScenes, Waymo

Open, and a custom dataset we collected on an experimental vehicle platform.

Keywords: Occupancy Prediction, Autonomous Driving, Generative Models

1 Introduction

Accurate environment prediction algorithms are essential for autonomous vehicle (AV) navigation

in urban settings. Experienced drivers understand scene semantics and recognize the intents of other

agents to anticipate their trajectories and safely navigate to their destination. To replicate this process

in AVs and other robotic platforms, many environment prediction approaches have been proposed,

employing different environment representations and modeling assumptions [1–9].

The modern AV stack comprises several sequential modules trained independently on labeled data.

For environment reasoning, object-based prediction algorithms are commonly used [4,6,7,10],

which rely on perception systems to create a vectorized representation of the scene with pre-deﬁned

agents and environmental features. However, this approach has multiple limitations. 1) It generates

marginalized future trajectories for individual agents, rather than a holistic scene prediction which

complicates the integration with planning modules [11]. 2) Their reliance on labeled data, sourced

either manually or from off-board perception systems [12,13], diverges from on-board noisy de-

tections. 3) They do not take any sensor measurements into account and depend solely on object

detection algorithms which can fail in suboptimal conditions [14,15]. These approaches also ex-

clude social and topological cues that humans naturally perceive, emphasizing the importance of

end-to-end perception-prediction learning [16]. The current drawbacks render the AV stack sus-

ceptible to cascading failures, and can lead to poor generalization to unforeseen scenarios. These

limitations underscore the need for alternative end-to-end, self-supervised environment modeling

approaches.

arXiv:2210.01249v3 [cs.RO] 24 Aug 2023

Hypothesis 1 Hypothesis K

…

Hypothesis 2

Transformer

Encoder

Causal

Transformer

Tokenizer

Latent Space

Observations

Method Rep. Maps Cam. Partial

obs.

Stochast. Prediction

Type

Chai et al. [6]Vector ✓✗ ✗ GMM Per-agent

Ivanovic et al. [5]Vector ✓✗ ✗ GMM Per-agent

Gu et al. [22]Vector ✓✗ ✗ Goal Per-agent

Nayakanti et al. [16]Vector ✓✗ ✗ GMM Per-agent

Shi et al. [23]Vector ✓✗ ✗ GMM Per-agent

Itkina et al. [1]L-OGM ✗ ✗ ✓✗Scene

Lange et al. [3]L-OGM ✗ ✗ ✓✗Scene

Toyungyernsub et al. [24]L-OGM ✗ ✗ ✓✗Scene

Mahjourian et al. [8]V-OGM ✓✗ ✗ ✗ Scene

Mersch et al. [25]PCL ✗ ✗ ✓✗Scene

Wu et al. [26]PCL ✓✓✓ ✗Scene

LOPR (ours) L-OGM ✓✓✓ Variat. Scene

Representations and prediction types in common approaches

Figure 1: (Left) Latent Occupancy PRediction (LOPR). We decouple the prediction task into task-independent

representation learning, and task-dependent prediction in the latent space. (Right) Comparison with other

approaches in terms of representation type, sensors, stochasticity assumptions, and prediction type. Only LOPR

makes stochastic predictions of the scene conditioned on all sensors without the need for manually labelled data.

Given these challenges, occupancy grid maps generated from LiDAR measurements (L-OGMs)

have gained popularity as a form of scene representation for prediction. This popularity is due to

their minimal data preprocessing requirements, eliminating the need for manual labeling, ability to

model the joint prediction of the scene with an arbitrary number of agents (including interactions

between agents), and robustness to partial observability and detection failures [1–3,17]. In addition,

the sole requirement for their deployment is a LiDAR sensor, simplifying transfer between different

platforms. Our focus is on end-to-end ego-centric L-OGM prediction generated using uncertainty-

aware occupancy state estimation approaches [18]. Due to its generality and ability to scale with

unlabeled data, we hypothesize that such an L-OGM prediction framework could also serve as a

pre-training objective, i.e. a foundational mode, for supervised tasks such as trajectory prediction.

The task of OGM prediction is typically approached similarly to video prediction, by framing the

problem as self-supervised sequence-to-sequence learning. In this approach, a scenario is dissected

into a history sequence and a target prediction sequence. ConvLSTM-based architectures [19] have

been used in previous work for this task due to their ability to handle the spatiotemporal represen-

tation of inputs and outputs [1–3,20,21]. These approaches are optimized end-to-end in grid cell

space, do not account for the stochasticity present in the scene, and neglect other available sensor

modalities, e.g. RGB cameras and high deﬁnition (HD) maps. As a result, they suffer from blurry

predictions, especially at longer time horizons. We propose a prediction framework that reasons

over potential futures in the latent space of generative models. It is trained on sensor modalities

such as L-OGMs, 2D RGB cameras, and maps without the need for manual labeling. We illustrate

our framework in Fig. 1and compare it with other methods.

Recent work has shown generative models can produce high-quality [27,28] and controllable [29–

31] samples. In robotics, generative models have been used to ﬁnd compact representations of im-

ages in planning [32–34], control [35–37], and simulation [38]. We claim that generative models are

similarly capable of accurately encoding and decoding L-OGMs, alongside providing a controllable

latent space for high-quality predictions. We employ a generative model to learn a low-dimensional

latent space, which encodes the features needed to generate realistic predictions and makes use of

available input modalities, such as L-OGM, RGB camera, and map-based observations. We then

train a stochastic prediction network in this latent space to capture the dynamics of the scene.

Existing object-based methods use a vectorized representation to predict trajectories [5,6,22] or

vectorized OGMs (V-OGMs) [8], overlooking important perceptual cues in their predictions. Prior

L-OGM-based works [1,3,25] do not use available sensor modalities, and consider only determin-

istic predictions. Our framework addresses these weaknesses in the following contributions:

• We introduce a framework named Latent Occupancy PRediction (LOPR), which performs

stochastic L-OGM prediction in the latent space of a generative architecture conditioned

on other available sensor modalities, like RGB cameras and maps.

• Through experiments on NuScenes [39] and the Waymo Open Dataset [40], we show that

LOPR outperforms SOTA OGM prediction methods qualitatively and quantitatively.

• We demonstrate that LOPR can be conveniently transferred between different robotic plat-

forms by additionally evaluating our framework on a custom robotic dataset.

2 Related Work

OGM Prediction: The majority of prior work in OGM prediction generates OGMs with LiDAR

measurements (L-OGMs) and uses an adaptation on the recurrent neural network (RNN) with con-

volutions. Dequaire et al. [41] applied a Deep Tracking approach [42] to track objects through

occlusions and predict future binary OGMs with an RNN and a spatial transformer [43]. Schreiber

et al. [20] provided dynamic occupancy grid maps (DOGMas) with cell-wise velocity estimates as

input to a ConvLSTM [19] for environment prediction from a stationary platform. Schreiber et al.

[21] extended this work to forecast DOGMas in a moving ego-vehicle setting. Mohajerin and Rohani

[17] applied a difference learning approach to predict OGMs as seen from the coordinate frame of

the ﬁrst observed time step. Itkina et al. [1] used the PredNet ConvLSTM architecture [44] to achieve

ego-centric OGM prediction. Lange et al. [3] reduced the blurring and the gradual disappearance of

dynamic obstacles in the predicted grids by developing an attention augmented ConvLSTM mecha-

nism. Concurrently, Toyungyernsub et al. [2] addressed obstacle disappearance with a double-prong

framework assuming knowledge of the static and dynamic obstacles. Predicted OGMs often lack

agent identity information. Mahjourian et al. [8] addressed this by exploiting occupancy ﬂow es-

timates to trace back the agent identities from the observed OGM frames. They used upstream

perception detections to generate OGMs with vectorized represenations (V-OGMs) and hence re-

quire manual labeling. Unlike prior work, we perform stochastic OGM predictions in the latent

space of generative models of all available sensors without any manual labeling.

Representation Learning in Robotics: The objective of representation learning is to identify a

low-dimensional and disentangled representation that makes it easier to achieve the desired perfor-

mance on a task. Many robotics applications are inspired by the seminal papers on the autoen-

coder (AE) [45–47], the variational autoencoder (VAE) [29], and the generative adversarial net-

work (GAN) [27]. Ha and Schmidhuber [32] proposed a World Model, where they used a VAE

to compress observations and maximize the expected cumulative reward. Kim et al. [38] applied

the World Model to neural network simulation for autonomous driving, where they merged the

VAE [29] and StyleGAN [28] to increase ﬁdelity of the generated scenes. Similarly, latent spaces

have been used in a plethora of other planning and control approaches to learn latent dynamics

from pixels [36,37,48–50], generate fully imagined trajectories [51], model multi-agent interac-

tions [33], learn competitive policies through self-play [34], imagine goals in goal-conditioned poli-

cies [52,53], and perform meta- and ofﬂine reinforcement learning [54,55]. Diffusion-based gener-

ative models [56–59] are increasingly gaining traction due to their high sampling quality. However,

their computationally intensive sampling process poses a signiﬁcant obstacle for real-time robotic

applications in the context of perceptual generation. In video prediction tasks, variational and adver-

sarial components have been incorporated into the architecture to capture data stochasticity [60] and

improve the realism of the forecasted frames [61], respectively. Since then, large-scale architectures

(over 300 million parameters) [62,63] have been developed for general video prediction, which

are not real-time capable on robot hardware. We present the ﬁrst method, to our knowledge, that

performs stochastic L-OGM prediction with transformers entirely in the latent space of a generative

model while remaining real-time feasible and parameter efﬁcient (less than 4 million parameters).

3 LOPR: Latent Occupancy PRediction

We propose the Latent Occupancy PRediction (LOPR) model, a framework designed to generate

stochastic scene predictions in the form of OGMs. The model uses representations provided by

sensor modalities, such as LiDAR-generated OGMs, RGB cameras, and maps. It does not require

any manually labeled data and can be deployed on any robot equipped with at least a LiDAR sen-

Representation Learning

(for each input modality)

Encoder

Decoder

Scene

Encoder

Autoregressive

Decoder

L-OGM

Encoder

Camera

Encoder

Map

Encoder

Stochastic Sequence Prediction

L-OGM

Decoder

Tokenizer

At train time only (Past +

Future)

Inference

Network

At test time

Tokenizer

[:]

Figure 2: The illustration shows the LOPR framework which consists of (1) representation learning and

(2) stochastic sequence prediction. In the representation learning stage, we train an encoder and decoder in

an unsupervised manner. In the sequence prediction stage, we convert our OGM dataset to the low-dimensional

representation, and perform training entirely in the latent space of our pre-trained generative model.

sor. A visualization of the framework is provided in Fig. 2. The model separates the prediction

task into (1) learning the environment representation and (2) making predictions in the latent space

of a generative model. In the representation learning phase, a VAE-GAN is trained to acquire a

pre-trained latent space of rasterized sensor measurements. During the prediction stage, an auto-

regressive transformer [64] network is trained within the pre-trained latent space to predict future

OGMs. It operates over patches of each latent vector to reduce the dimensionality of the prediction

network and employs a series of auxiliary tasks incorporating various sequence masking strategies

during training to further improve performance.

3.1 Representation Learning

In the ﬁrst stage of training, we acquire a pre-trained latent space of all input modalities by training

an encoder Eand a decoder D. Given the input modality x∈RC×W×H, the encoder outputs a low-

dimensional latent vector z∈Rc×h×w, and the decoder maps the latent vector to a reconstruction

ˆx∈RC×W×H. The framework is trained using a combination of perceptual loss [65], Kullback-

Leibler (KL) regularization [29], patch-based adversarial losses [66] and path regularization [28]:

LVAEGAN = min

E,Dmax

ψ(LLPIPS(x, D(E(x))) −γLadv(D(E(x))) + βLKL(x;E,D) + Lreg).(1)

We employ the adversarial loss to increase the visual ﬁdelity of of the generated samples, KL regu-

larization to encourage the posterior q(z|x)to be clustered close the prior p(z) = N(0, I).

3.2 Stochastic Sequence Prediction

Given the pre-trained latent space of our sensor data, we train a stochastic sequence prediction

network that receives a history of observations and outputs a distribution of potential future scenarios

pθ(zP:T|z0:P), where z0:Prepresents compressed observations over Ptimesteps, Tsigniﬁes

the total sequence length, and θare the network weights. To simplify notation, we assume an

abuse of notation in that when ztcorresponds to an observation, it includes all sensor modalities;

conversely, when referring to the future, it includes only the OGM representation. The environment

prediction task is inherently multimodal, and the latent vectors contributing to this stochasticity

are unobservable. Drawing from prior work on video prediction [60], we introduce a latent vector

zstoch ∼p(zstoch)to encapsulate this stochasticity and extend our model to pθ(zP:T|z0:P, zstoch).

During training, we extract the true posterior p(zstoch |z0:T)using an inference network zstoch ∼

qϕ(zstoch |z0:T), while at test time, we sample from a pre-deﬁned prior p(zstoch). The framework

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LOPR:LatentOccupancyPRedictionusingGenerativeModelsBernardLange,MashaItkina,andMykelJ.KochenderferDepartmentofAeronauticsandAstronauticsStanfordUniversityUnitedStates{blange,mitkina,mykel}@stanford.eduAbstract:Environmentpredictionframeworksareintegralforautonomousve-hicles,enablingsafenavigationind...

展开>> 收起<<

LOPR Latent Occupancy PRediction using Generative Models Bernard Lange Masha Itkina and Mykel J. Kochenderfer.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LOPR Latent Occupancy PRediction using Generative Models Bernard Lange Masha Itkina and Mykel J. Kochenderfer

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: