Differentiable Raycasting for Self-supervised Occupancy Forecasting Tarasha Khurana1 Peiyun Hu2 Achal Dave3 Jason Ziglar2 David Held12

2025-04-27 0 0 8.5MB 21 页 10玖币
侵权投诉
Differentiable Raycasting for Self-supervised
Occupancy Forecasting
Tarasha Khurana1, Peiyun Hu2, Achal Dave3, Jason Ziglar2, David Held1,2,
and Deva Ramanan1,2
1Carnegie Mellon University
2Argo AI
3Amazon
(a) Sensor position !, Scene "(b) New sensor position ! + Δ!, Scene "(c) New sensor position ! + Δ!, New scene " + Δ"
Fig. 1. We propose emergent occupancy as a novel self-supervised representation for
motion planning. Occupancy is independent of changes in sensor pose ∆y, which is in
contrast to prior work on self-supervised learning from LiDAR [29,30,16,10], specifically,
ego-centric freespace [10], which changes with (a-b) sensor pose motion ∆y and (b-
c) scene motion ∆s. We use differentiable raycasting to naturally decouple ego motion
from scene motion, allowing us to learn to forecast occupancy by self-supervision from
pose-aligned LiDAR sweeps.
Abstract. Motion planning for safe autonomous driving requires learn-
ing how the environment around an ego-vehicle evolves with time. Ego-
centric perception of driveable regions in a scene not only changes with
the motion of actors in the environment, but also with the movement of
the ego-vehicle itself. Self-supervised representations proposed for large-
scale planning, such as ego-centric freespace, confound these two motions,
making the representation difficult to use for downstream motion plan-
ners. In this paper, we use geometric occupancy as a natural alternative
to view-dependent representations such as freespace. Occupancy maps
naturally disentagle the motion of the environment from the motion of
the ego-vehicle. However, one cannot directly observe the full 3D occu-
pancy of a scene (due to occlusion), making it difficult to use as a signal
for learning. Our key insight is to use differentiable raycasting to “render”
future occupancy predictions into future LiDAR sweep predictions, which
can be compared with ground-truth sweeps for self-supervised learning.
The use of differentiable raycasting allows occupancy to emerge as an
internal representation within the forecasting network. In the absence
of groundtruth occupancy, we quantitatively evaluate the forecasting of
raycasted LiDAR sweeps and show improvements of upto 15 F1 points.
equal contribution
arXiv:2210.01917v2 [cs.CV] 18 Oct 2022
2 T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, D. Ramanan
For downstream motion planners, where emergent occupancy can be di-
rectly used to guide non-driveable regions, this representation relatively
reduces the number of collisions with objects by up to 17% as compared
to freespace-centric motion planners.
1 Introduction
To navigate in complex and dynamic environments such as urban cores, au-
tonomous vehicles need to perceive actors and predict their future movements.
Such knowledge is often represented in some form of forecasted occupancy [24],
which downstream motion planners rely on to produce safe trajectories. When
tackling the tasks of perception and prediction, standard solutions consist of per-
ceptual modules such as object detection, tracking, and trajectory forecasting,
which require a massive amount of object track labels. Such solutions do not
scale given the speed that log data is being collected by large fleets.
Freespace versus occupancy: To avoid the need for costly human anno-
tations, and to enable learning at scale, self-supervised representations such as
ego-centric freespace [10] have been proposed. However, such a representation
couples the motion of the world with the motion of the ego-vehicle (Fig. 1). Our
key innovation in this paper is to learn an ego-pose independent and explain-
able representation for safe motion planning, which we call emergent occupancy.
Emergent occupancy decouples ego motion and scene motion using differentiable
raycasting: we design a network that learns to “space-time complete” the future
volumetric state of the world (in a world-coordinate frame) given past LiDAR
observations. Consider an ego-vehicle that moves in a static scene. Here, LiDAR
returns (even when aligned to a world-coordinate frame) will still swim along
the surfaces of the fixed scene (Fig. 2). This implies that even when the world
is static, most of what the ego-vehicle observes through the LiDAR sensor ap-
pears to move with complex nonlinear motion, but in fact those observations
can be fully explained by static geometry and ego-motion (via raycasting). Li-
DAR forecasters need to implicitly predict this ego-motion of the car to produce
accurate future returns. However, we argue that such prediction doesn’t make
sense for autonomous agents that plan their future motion. Importantly, our dif-
ferentiable raycasting network has access to future camera ego-poses as input,
both during training (since they are available in archival logs) and testing (since
state-of-the-art planners explicitly search over candidate trajectories).
Self-supervision: Note that ground-truth future volumetric occupancy is
largely unavailable without human supervision, because the full 3D world is
rarely observed; the ego-vehicle only sees a limited number of future views as
recorded in a single archival log. To this end, we apply a differentiable raycaster
that projects the forecasted volumetric occupancy into a LiDAR sweep, as seen
by the future ego-vehicle motion in the log. We then use the difference between
the raycasted sweep and actual sweep as a signal for self-supervised learning,
allowing us to train models on massive amounts of unannotated logs.
Planning: Lastly, we show that such forecasted space-time occupancy can
be jointly learned with space-time costmaps for end-to-end motion planning.
Differentiable Raycasting for Self-supervised Occupancy Forecasting 3
! ! + Δ!
Fig. 2. We pose-align two succesive LiDAR sweeps of a static scene sto a common
world coordinate-frame (using the notation of Fig. 1). Even though there is zero scene
motion ∆s, points appear to drift or swim across surfaces. This is due to the fact that
points are obtained by intersecting rays from a moving sensor ∆y with static scene
geometry. This in turn implies that points can appear to move since they are not tied
to physical locations on a surface. This apparent movement (˜s) is in general a complex
nonlinear transformation, even when the sensor motion ∆y is a simple translation (as
shown above). Traditional methods for self-supervised LiDAR forecasting [29,30,16,10]
require predicting the complex transformation ˜swhich depends on the unknown ∆y,
while our differentiable-raycasting framework assumes ∆y is an input, dramatically
simplifying the task of the forecasting network. From a planning perspective, we argue
that the future (planned) change-in-pose should be an input rather than an output.
Owing to LiDAR self-supervision, we are able to train on recent unsupervised
LiDAR datasets [14] that are orders of magnitude larger than their annotated
counterparts, resulting in significant improvement in accuracy for both forecasted
occupancy and motion plans. Interestingly, as we increase the amount of archival
training data at the cost of zero additional human annotation, object shape,
tracks, and multiple futures “emerge” in the arbitrary quantities predicted by
our model despite there being no direct supervision on ground-truth occupancy.
2 Related Work
Occupancy as a scene representation: Knowledge regarding what is around
an autonomous vehicle (AV) and what will happen next is captured in differ-
ent representations throughout the standard modular perception and predic-
tion (P&P) pipeline [12,27,4,25]. Instead of separate optimization of these mod-
ules [26,17], Sadat et al. [24] propose bird’s-eye view (BEV) semantic occupancy
that is end-to-end optimizable. As an alternative to semantic occupancy, Hu et
al. [11] propose BEV ego-centric freespace that can be self-supervised by ray-
casting on aligned LiDAR sweeps. However, the ego-centric freespace entangles
4 T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, D. Ramanan
motion from other actors, which is arguably more relevant for motion planning,
with ego-motion. In this paper, we propose emergent occupancy to isolate motion
of other actors. While we focus on self-supervised learning at scale, we acknowl-
edge that for motion planning, some semantic labelling is required (e.g., state of
a traffic light) which can be incorporated via semi-supervised learning.
Differentiable raycasting: Differentiable raycasting has shown great promise
in learning the underlying scene structure given samples of observations for
downstream novel view synthesis [15], pose estimation [31], etc. In contrast, our
application is best described as “space-time scene completion”, where we learn a
network to predict an explicit space-time occupancy volume. Furthermore, our
approach differs from existing approaches in the following ways. We use LiDAR
sequences as input and raycast LiDAR sweeps given future occupancy and sensor
pose. We work with explicit volumetric representations [13] for dynamic scenes
with a feed-forward network instead of test-time optimization [19].
Self-supervision: Standard P&P solutions do not scale given how fast log
data is collected by large fleets and how slow it is to curate object track labels. To
enable learning on massive amount of unlabeled logs, supervision from simula-
tion [8,5,6,7], auto labeling using multi-view constraints [21], and self-supervision
have been proposed. Notably, tasks that can be naturally self-supervised by Li-
DAR sweeps e.g., scene flow [16] have the potential to generalize better as they
can leverage more data. More recently, LiDAR self-supervision has been explored
in the context of point cloud forecasting [28,29,30]. However, when predicting
future sweeps given the history, as stated before, past approaches often tend to
couple motion of the world with the motion of the ego-vehicle [28].
Motion Planning: An understanding of what is around an AV and what will
happen next [26] is crucial. This is typically done in the bird’s eye-view (BEV)
space by building a modular P&P pipeline. Although BEV motion planning does
not precisely reflect planning in the 3D world, it is widely used as the highest-
resolution and computation- and memory-efficient representation [32,24,3]. How-
ever, training such modules often requires a massive amount of data. End-to-end
learned planners requiring less human annotation have emerged, with end-to-end
imitation learning (IL) methods showing particular promise [6,23,5]. Such meth-
ods often learn a neural network to map sensor data to either action (known
as behavior cloning) or “action-ready” cost function (known as inverse optimal
control) [18]. However, they are often criticized for lack of explainable inter-
mediate representations, making them less accountable for safety-critical appli-
cations [20]. More recently, end-to-end learned but modular methods producing
explainable representations, e.g., neural motion planners [32,24,3] have been pro-
posed. However, these still require costly object track labels. Unlike them, our
approach learns explainable intermediate representations that are explainable
quantities for safety-critical motion planning without the need of track labels.
Differentiable Raycasting for Self-supervised Occupancy Forecasting 5
3 Method
Autonomous fleets provide an abundance of aligned sequences of LiDAR sweeps
xand ego vehicle trajectories y. How can we make use of such data to improve
perception, prediction, and planning? In the sections to follow, we first define
occupancy. Then we describe a self-supervised approach to predicting future
occupancy. Finally, we describe an approach for integrating this forecasted oc-
cupancy into neural motion planners. Note that in the text that follows, we use
ego-centric freespace and freespace interchangeably.
3.1 Occupancy
We define occupancy as the state of occupied space at a particular time instance.
We use zto denote the true occupancy, which may not be directly observable
due to visibility constraints. Let us write
z[u]∈ {0,1},u= (x, y, t),uU(1)
to denote the occupancy of a voxel uin the space-time voxel grid U, which
can be occupied (1) or free (0). The spatial index of u, i.e., (x, y) represents the
spatial location from a bird’s-eye view. Given a sequence of aligned sensor data
and ego-vehicle trajectory (x,y), there may be multiple plausible occupancy
states zthat “explain” the sensor measurements. We denote this set of plausible
occupancy states as Z.
Forecasting Occupancy. Suppose we split an aligned sequence of LiDAR
sweeps and ego-vehicle trajectory (x,y) into a historic pair (x1,y1) and a future
pair (x2,y2). Our goal is to learn a function fthat takes historical observations
(x1,y1) as input and predicts emergent future occupancy ˆz2. Formally,
ˆz2=f(x1,y1),(2)
If the true occupancy z2were observable, we could directly supervise our fore-
caster, f. Unfortunately, in practice, we only observe LiDAR sweeps, x. We show
in the next section how to supervise fwith LiDAR sweeps using differentiable
raycasting techniques.
3.2 Raycasting
Given an occupancy estimate ˆz, sensor origin yand directional unit vectors for
rays r, a differentiable raycaster Rcan raycast LiDAR sweeps ˆx. We use ˆ
dto
represent the expected distance these rays travel before hitting obstacles: ˆ
d=
R(r;ˆz,y). Then we can reconstruct the raycast LiDAR sweep ˆx as ˆx =y+ˆ
dr.
3.3 Learning to Forecast Occupancy
Given the predicted occupancy ˆz2(Eq. 2), and the captured sensor pose y2, a dif-
ferentiable raycaster Rcan take rays r2as input and produce ˆ
d2=R(r2;ˆz2,y2).
摘要:

DifferentiableRaycastingforSelf-supervisedOccupancyForecastingTarashaKhurana1⋆,PeiyunHu2∗,AchalDave3,JasonZiglar2,DavidHeld1,2,andDevaRamanan1,21CarnegieMellonUniversity2ArgoAI3AmazonFig.1.Weproposeemergentoccupancyasanovelself-supervisedrepresentationformotionplanning.Occupancyisindependentofchange...

展开>> 收起<<
Differentiable Raycasting for Self-supervised Occupancy Forecasting Tarasha Khurana1 Peiyun Hu2 Achal Dave3 Jason Ziglar2 David Held12.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:8.5MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注