LOPR Latent Occupancy PRediction using Generative Models Bernard Lange Masha Itkina and Mykel J. Kochenderfer

2025-05-02 0 0 2.92MB 16 页 10玖币
侵权投诉
LOPR: Latent Occupancy PRediction using
Generative Models
Bernard Lange, Masha Itkina, and Mykel J. Kochenderfer
Department of Aeronautics and Astronautics
Stanford University
United States
{blange, mitkina, mykel}@stanford.edu
Abstract: Environment prediction frameworks are integral for autonomous ve-
hicles, enabling safe navigation in dynamic environments. LiDAR generated oc-
cupancy grid maps (L-OGMs) offer a robust bird’s eye-view scene representation
that facilitates joint scene predictions without relying on manual labeling unlike
commonly used trajectory prediction frameworks. Prior approaches have opti-
mized deterministic L-OGM prediction architectures directly in grid cell space.
While these methods have achieved some degree of success in prediction, they
occasionally grapple with unrealistic and incorrect predictions. We claim that the
quality and realism of the forecasted occupancy grids can be enhanced with the
use of generative models. We propose a framework that decouples occupancy pre-
diction into: representation learning and stochastic prediction within the learned
latent space. Our approach allows for conditioning the model on other available
sensor modalities such as RGB-cameras and high definition maps. We demon-
strate that our approach achieves state-of-the-art performance and is readily trans-
ferable between different robotic platforms on the real-world NuScenes, Waymo
Open, and a custom dataset we collected on an experimental vehicle platform.
Keywords: Occupancy Prediction, Autonomous Driving, Generative Models
1 Introduction
Accurate environment prediction algorithms are essential for autonomous vehicle (AV) navigation
in urban settings. Experienced drivers understand scene semantics and recognize the intents of other
agents to anticipate their trajectories and safely navigate to their destination. To replicate this process
in AVs and other robotic platforms, many environment prediction approaches have been proposed,
employing different environment representations and modeling assumptions [19].
The modern AV stack comprises several sequential modules trained independently on labeled data.
For environment reasoning, object-based prediction algorithms are commonly used [4,6,7,10],
which rely on perception systems to create a vectorized representation of the scene with pre-defined
agents and environmental features. However, this approach has multiple limitations. 1) It generates
marginalized future trajectories for individual agents, rather than a holistic scene prediction which
complicates the integration with planning modules [11]. 2) Their reliance on labeled data, sourced
either manually or from off-board perception systems [12,13], diverges from on-board noisy de-
tections. 3) They do not take any sensor measurements into account and depend solely on object
detection algorithms which can fail in suboptimal conditions [14,15]. These approaches also ex-
clude social and topological cues that humans naturally perceive, emphasizing the importance of
end-to-end perception-prediction learning [16]. The current drawbacks render the AV stack sus-
ceptible to cascading failures, and can lead to poor generalization to unforeseen scenarios. These
limitations underscore the need for alternative end-to-end, self-supervised environment modeling
approaches.
arXiv:2210.01249v3 [cs.RO] 24 Aug 2023
Hypothesis 1 Hypothesis K
Hypothesis 2
Transformer
Encoder
Causal
Transformer
Tokenizer
Latent Space
Observations
Method Rep. Maps Cam. Partial
obs.
Stochast. Prediction
Type
Chai et al. [6]Vector ✗ ✗ GMM Per-agent
Ivanovic et al. [5]Vector ✗ ✗ GMM Per-agent
Gu et al. [22]Vector ✗ ✗ Goal Per-agent
Nayakanti et al. [16]Vector ✗ ✗ GMM Per-agent
Shi et al. [23]Vector ✗ ✗ GMM Per-agent
Itkina et al. [1]L-OGM ✗ ✗ Scene
Lange et al. [3]L-OGM ✗ ✗ Scene
Toyungyernsub et al. [24]L-OGM ✗ ✗ Scene
Mahjourian et al. [8]V-OGM ✗ ✗ Scene
Mersch et al. [25]PCL ✗ ✗ Scene
Wu et al. [26]PCL ✓✓✓ Scene
LOPR (ours) L-OGM ✓✓✓ Variat. Scene
Representations and prediction types in common approaches
Figure 1: (Left) Latent Occupancy PRediction (LOPR). We decouple the prediction task into task-independent
representation learning, and task-dependent prediction in the latent space. (Right) Comparison with other
approaches in terms of representation type, sensors, stochasticity assumptions, and prediction type. Only LOPR
makes stochastic predictions of the scene conditioned on all sensors without the need for manually labelled data.
Given these challenges, occupancy grid maps generated from LiDAR measurements (L-OGMs)
have gained popularity as a form of scene representation for prediction. This popularity is due to
their minimal data preprocessing requirements, eliminating the need for manual labeling, ability to
model the joint prediction of the scene with an arbitrary number of agents (including interactions
between agents), and robustness to partial observability and detection failures [13,17]. In addition,
the sole requirement for their deployment is a LiDAR sensor, simplifying transfer between different
platforms. Our focus is on end-to-end ego-centric L-OGM prediction generated using uncertainty-
aware occupancy state estimation approaches [18]. Due to its generality and ability to scale with
unlabeled data, we hypothesize that such an L-OGM prediction framework could also serve as a
pre-training objective, i.e. a foundational mode, for supervised tasks such as trajectory prediction.
The task of OGM prediction is typically approached similarly to video prediction, by framing the
problem as self-supervised sequence-to-sequence learning. In this approach, a scenario is dissected
into a history sequence and a target prediction sequence. ConvLSTM-based architectures [19] have
been used in previous work for this task due to their ability to handle the spatiotemporal represen-
tation of inputs and outputs [13,20,21]. These approaches are optimized end-to-end in grid cell
space, do not account for the stochasticity present in the scene, and neglect other available sensor
modalities, e.g. RGB cameras and high definition (HD) maps. As a result, they suffer from blurry
predictions, especially at longer time horizons. We propose a prediction framework that reasons
over potential futures in the latent space of generative models. It is trained on sensor modalities
such as L-OGMs, 2D RGB cameras, and maps without the need for manual labeling. We illustrate
our framework in Fig. 1and compare it with other methods.
Recent work has shown generative models can produce high-quality [27,28] and controllable [29
31] samples. In robotics, generative models have been used to find compact representations of im-
ages in planning [3234], control [3537], and simulation [38]. We claim that generative models are
similarly capable of accurately encoding and decoding L-OGMs, alongside providing a controllable
latent space for high-quality predictions. We employ a generative model to learn a low-dimensional
latent space, which encodes the features needed to generate realistic predictions and makes use of
available input modalities, such as L-OGM, RGB camera, and map-based observations. We then
train a stochastic prediction network in this latent space to capture the dynamics of the scene.
Existing object-based methods use a vectorized representation to predict trajectories [5,6,22] or
vectorized OGMs (V-OGMs) [8], overlooking important perceptual cues in their predictions. Prior
L-OGM-based works [1,3,25] do not use available sensor modalities, and consider only determin-
istic predictions. Our framework addresses these weaknesses in the following contributions:
We introduce a framework named Latent Occupancy PRediction (LOPR), which performs
stochastic L-OGM prediction in the latent space of a generative architecture conditioned
on other available sensor modalities, like RGB cameras and maps.
2
Through experiments on NuScenes [39] and the Waymo Open Dataset [40], we show that
LOPR outperforms SOTA OGM prediction methods qualitatively and quantitatively.
We demonstrate that LOPR can be conveniently transferred between different robotic plat-
forms by additionally evaluating our framework on a custom robotic dataset.
2 Related Work
OGM Prediction: The majority of prior work in OGM prediction generates OGMs with LiDAR
measurements (L-OGMs) and uses an adaptation on the recurrent neural network (RNN) with con-
volutions. Dequaire et al. [41] applied a Deep Tracking approach [42] to track objects through
occlusions and predict future binary OGMs with an RNN and a spatial transformer [43]. Schreiber
et al. [20] provided dynamic occupancy grid maps (DOGMas) with cell-wise velocity estimates as
input to a ConvLSTM [19] for environment prediction from a stationary platform. Schreiber et al.
[21] extended this work to forecast DOGMas in a moving ego-vehicle setting. Mohajerin and Rohani
[17] applied a difference learning approach to predict OGMs as seen from the coordinate frame of
the first observed time step. Itkina et al. [1] used the PredNet ConvLSTM architecture [44] to achieve
ego-centric OGM prediction. Lange et al. [3] reduced the blurring and the gradual disappearance of
dynamic obstacles in the predicted grids by developing an attention augmented ConvLSTM mecha-
nism. Concurrently, Toyungyernsub et al. [2] addressed obstacle disappearance with a double-prong
framework assuming knowledge of the static and dynamic obstacles. Predicted OGMs often lack
agent identity information. Mahjourian et al. [8] addressed this by exploiting occupancy flow es-
timates to trace back the agent identities from the observed OGM frames. They used upstream
perception detections to generate OGMs with vectorized represenations (V-OGMs) and hence re-
quire manual labeling. Unlike prior work, we perform stochastic OGM predictions in the latent
space of generative models of all available sensors without any manual labeling.
Representation Learning in Robotics: The objective of representation learning is to identify a
low-dimensional and disentangled representation that makes it easier to achieve the desired perfor-
mance on a task. Many robotics applications are inspired by the seminal papers on the autoen-
coder (AE) [4547], the variational autoencoder (VAE) [29], and the generative adversarial net-
work (GAN) [27]. Ha and Schmidhuber [32] proposed a World Model, where they used a VAE
to compress observations and maximize the expected cumulative reward. Kim et al. [38] applied
the World Model to neural network simulation for autonomous driving, where they merged the
VAE [29] and StyleGAN [28] to increase fidelity of the generated scenes. Similarly, latent spaces
have been used in a plethora of other planning and control approaches to learn latent dynamics
from pixels [36,37,4850], generate fully imagined trajectories [51], model multi-agent interac-
tions [33], learn competitive policies through self-play [34], imagine goals in goal-conditioned poli-
cies [52,53], and perform meta- and offline reinforcement learning [54,55]. Diffusion-based gener-
ative models [5659] are increasingly gaining traction due to their high sampling quality. However,
their computationally intensive sampling process poses a significant obstacle for real-time robotic
applications in the context of perceptual generation. In video prediction tasks, variational and adver-
sarial components have been incorporated into the architecture to capture data stochasticity [60] and
improve the realism of the forecasted frames [61], respectively. Since then, large-scale architectures
(over 300 million parameters) [62,63] have been developed for general video prediction, which
are not real-time capable on robot hardware. We present the first method, to our knowledge, that
performs stochastic L-OGM prediction with transformers entirely in the latent space of a generative
model while remaining real-time feasible and parameter efficient (less than 4 million parameters).
3 LOPR: Latent Occupancy PRediction
We propose the Latent Occupancy PRediction (LOPR) model, a framework designed to generate
stochastic scene predictions in the form of OGMs. The model uses representations provided by
sensor modalities, such as LiDAR-generated OGMs, RGB cameras, and maps. It does not require
any manually labeled data and can be deployed on any robot equipped with at least a LiDAR sen-
3
Representation Learning
(for each input modality)
Encoder
Decoder
Scene
Encoder
Autoregressive
Decoder
L-OGM
Encoder
Camera
Encoder
Map
Encoder
Stochastic Sequence Prediction
L-OGM
Decoder
Tokenizer
x
At train time only (Past +
Future)
Inference
Network
At test time
Tokenizer
Tokenizer
Tokenizer
[:]
Figure 2: The illustration shows the LOPR framework which consists of (1) representation learning and
(2) stochastic sequence prediction. In the representation learning stage, we train an encoder and decoder in
an unsupervised manner. In the sequence prediction stage, we convert our OGM dataset to the low-dimensional
representation, and perform training entirely in the latent space of our pre-trained generative model.
sor. A visualization of the framework is provided in Fig. 2. The model separates the prediction
task into (1) learning the environment representation and (2) making predictions in the latent space
of a generative model. In the representation learning phase, a VAE-GAN is trained to acquire a
pre-trained latent space of rasterized sensor measurements. During the prediction stage, an auto-
regressive transformer [64] network is trained within the pre-trained latent space to predict future
OGMs. It operates over patches of each latent vector to reduce the dimensionality of the prediction
network and employs a series of auxiliary tasks incorporating various sequence masking strategies
during training to further improve performance.
3.1 Representation Learning
In the first stage of training, we acquire a pre-trained latent space of all input modalities by training
an encoder Eand a decoder D. Given the input modality xRC×W×H, the encoder outputs a low-
dimensional latent vector zRc×h×w, and the decoder maps the latent vector to a reconstruction
ˆxRC×W×H. The framework is trained using a combination of perceptual loss [65], Kullback-
Leibler (KL) regularization [29], patch-based adversarial losses [66] and path regularization [28]:
LVAEGAN = min
E,Dmax
ψ(LLPIPS(x, D(E(x))) γLadv(D(E(x))) + βLKL(x;E,D) + Lreg).(1)
We employ the adversarial loss to increase the visual fidelity of of the generated samples, KL regu-
larization to encourage the posterior q(z|x)to be clustered close the prior p(z) = N(0, I).
3.2 Stochastic Sequence Prediction
Given the pre-trained latent space of our sensor data, we train a stochastic sequence prediction
network that receives a history of observations and outputs a distribution of potential future scenarios
pθ(zP:T|z0:P), where z0:Prepresents compressed observations over Ptimesteps, Tsignifies
the total sequence length, and θare the network weights. To simplify notation, we assume an
abuse of notation in that when ztcorresponds to an observation, it includes all sensor modalities;
conversely, when referring to the future, it includes only the OGM representation. The environment
prediction task is inherently multimodal, and the latent vectors contributing to this stochasticity
are unobservable. Drawing from prior work on video prediction [60], we introduce a latent vector
zstoch p(zstoch)to encapsulate this stochasticity and extend our model to pθ(zP:T|z0:P, zstoch).
During training, we extract the true posterior p(zstoch |z0:T)using an inference network zstoch
qϕ(zstoch |z0:T), while at test time, we sample from a pre-defined prior p(zstoch). The framework
4
摘要:

LOPR:LatentOccupancyPRedictionusingGenerativeModelsBernardLange,MashaItkina,andMykelJ.KochenderferDepartmentofAeronauticsandAstronauticsStanfordUniversityUnitedStates{blange,mitkina,mykel}@stanford.eduAbstract:Environmentpredictionframeworksareintegralforautonomousve-hicles,enablingsafenavigationind...

展开>> 收起<<
LOPR Latent Occupancy PRediction using Generative Models Bernard Lange Masha Itkina and Mykel J. Kochenderfer.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2.92MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注