4 T. Khurana∗, P. Hu∗, A. Dave, J. Ziglar, D. Held, D. Ramanan
motion from other actors, which is arguably more relevant for motion planning,
with ego-motion. In this paper, we propose emergent occupancy to isolate motion
of other actors. While we focus on self-supervised learning at scale, we acknowl-
edge that for motion planning, some semantic labelling is required (e.g., state of
a traffic light) which can be incorporated via semi-supervised learning.
Differentiable raycasting: Differentiable raycasting has shown great promise
in learning the underlying scene structure given samples of observations for
downstream novel view synthesis [15], pose estimation [31], etc. In contrast, our
application is best described as “space-time scene completion”, where we learn a
network to predict an explicit space-time occupancy volume. Furthermore, our
approach differs from existing approaches in the following ways. We use LiDAR
sequences as input and raycast LiDAR sweeps given future occupancy and sensor
pose. We work with explicit volumetric representations [13] for dynamic scenes
with a feed-forward network instead of test-time optimization [19].
Self-supervision: Standard P&P solutions do not scale given how fast log
data is collected by large fleets and how slow it is to curate object track labels. To
enable learning on massive amount of unlabeled logs, supervision from simula-
tion [8,5,6,7], auto labeling using multi-view constraints [21], and self-supervision
have been proposed. Notably, tasks that can be naturally self-supervised by Li-
DAR sweeps e.g., scene flow [16] have the potential to generalize better as they
can leverage more data. More recently, LiDAR self-supervision has been explored
in the context of point cloud forecasting [28,29,30]. However, when predicting
future sweeps given the history, as stated before, past approaches often tend to
couple motion of the world with the motion of the ego-vehicle [28].
Motion Planning: An understanding of what is around an AV and what will
happen next [26] is crucial. This is typically done in the bird’s eye-view (BEV)
space by building a modular P&P pipeline. Although BEV motion planning does
not precisely reflect planning in the 3D world, it is widely used as the highest-
resolution and computation- and memory-efficient representation [32,24,3]. How-
ever, training such modules often requires a massive amount of data. End-to-end
learned planners requiring less human annotation have emerged, with end-to-end
imitation learning (IL) methods showing particular promise [6,23,5]. Such meth-
ods often learn a neural network to map sensor data to either action (known
as behavior cloning) or “action-ready” cost function (known as inverse optimal
control) [18]. However, they are often criticized for lack of explainable inter-
mediate representations, making them less accountable for safety-critical appli-
cations [20]. More recently, end-to-end learned but modular methods producing
explainable representations, e.g., neural motion planners [32,24,3] have been pro-
posed. However, these still require costly object track labels. Unlike them, our
approach learns explainable intermediate representations that are explainable
quantities for safety-critical motion planning without the need of track labels.