
Hypothesis 1 Hypothesis K
…
Hypothesis 2
Transformer
Encoder
Causal
Transformer
Tokenizer
Latent Space
Observations
Method Rep. Maps Cam. Partial
obs.
Stochast. Prediction
Type
Chai et al. [6]Vector ✓✗ ✗ GMM Per-agent
Ivanovic et al. [5]Vector ✓✗ ✗ GMM Per-agent
Gu et al. [22]Vector ✓✗ ✗ Goal Per-agent
Nayakanti et al. [16]Vector ✓✗ ✗ GMM Per-agent
Shi et al. [23]Vector ✓✗ ✗ GMM Per-agent
Itkina et al. [1]L-OGM ✗ ✗ ✓✗Scene
Lange et al. [3]L-OGM ✗ ✗ ✓✗Scene
Toyungyernsub et al. [24]L-OGM ✗ ✗ ✓✗Scene
Mahjourian et al. [8]V-OGM ✓✗ ✗ ✗ Scene
Mersch et al. [25]PCL ✗ ✗ ✓✗Scene
Wu et al. [26]PCL ✓✓✓ ✗Scene
LOPR (ours) L-OGM ✓✓✓ Variat. Scene
Representations and prediction types in common approaches
Figure 1: (Left) Latent Occupancy PRediction (LOPR). We decouple the prediction task into task-independent
representation learning, and task-dependent prediction in the latent space. (Right) Comparison with other
approaches in terms of representation type, sensors, stochasticity assumptions, and prediction type. Only LOPR
makes stochastic predictions of the scene conditioned on all sensors without the need for manually labelled data.
Given these challenges, occupancy grid maps generated from LiDAR measurements (L-OGMs)
have gained popularity as a form of scene representation for prediction. This popularity is due to
their minimal data preprocessing requirements, eliminating the need for manual labeling, ability to
model the joint prediction of the scene with an arbitrary number of agents (including interactions
between agents), and robustness to partial observability and detection failures [1–3,17]. In addition,
the sole requirement for their deployment is a LiDAR sensor, simplifying transfer between different
platforms. Our focus is on end-to-end ego-centric L-OGM prediction generated using uncertainty-
aware occupancy state estimation approaches [18]. Due to its generality and ability to scale with
unlabeled data, we hypothesize that such an L-OGM prediction framework could also serve as a
pre-training objective, i.e. a foundational mode, for supervised tasks such as trajectory prediction.
The task of OGM prediction is typically approached similarly to video prediction, by framing the
problem as self-supervised sequence-to-sequence learning. In this approach, a scenario is dissected
into a history sequence and a target prediction sequence. ConvLSTM-based architectures [19] have
been used in previous work for this task due to their ability to handle the spatiotemporal represen-
tation of inputs and outputs [1–3,20,21]. These approaches are optimized end-to-end in grid cell
space, do not account for the stochasticity present in the scene, and neglect other available sensor
modalities, e.g. RGB cameras and high definition (HD) maps. As a result, they suffer from blurry
predictions, especially at longer time horizons. We propose a prediction framework that reasons
over potential futures in the latent space of generative models. It is trained on sensor modalities
such as L-OGMs, 2D RGB cameras, and maps without the need for manual labeling. We illustrate
our framework in Fig. 1and compare it with other methods.
Recent work has shown generative models can produce high-quality [27,28] and controllable [29–
31] samples. In robotics, generative models have been used to find compact representations of im-
ages in planning [32–34], control [35–37], and simulation [38]. We claim that generative models are
similarly capable of accurately encoding and decoding L-OGMs, alongside providing a controllable
latent space for high-quality predictions. We employ a generative model to learn a low-dimensional
latent space, which encodes the features needed to generate realistic predictions and makes use of
available input modalities, such as L-OGM, RGB camera, and map-based observations. We then
train a stochastic prediction network in this latent space to capture the dynamics of the scene.
Existing object-based methods use a vectorized representation to predict trajectories [5,6,22] or
vectorized OGMs (V-OGMs) [8], overlooking important perceptual cues in their predictions. Prior
L-OGM-based works [1,3,25] do not use available sensor modalities, and consider only determin-
istic predictions. Our framework addresses these weaknesses in the following contributions:
• We introduce a framework named Latent Occupancy PRediction (LOPR), which performs
stochastic L-OGM prediction in the latent space of a generative architecture conditioned
on other available sensor modalities, like RGB cameras and maps.
2