Differentiable Raycasting for Self-supervised Occupancy Forecasting Tarasha Khurana1 Peiyun Hu2 Achal Dave3 Jason Ziglar2 David Held12

2025-04-27 0 0 8.5MB 21 页 10玖币

侵权投诉

Diﬀerentiable Raycasting for Self-supervised

Occupancy Forecasting

Tarasha Khurana1⋆, Peiyun Hu2∗, Achal Dave3, Jason Ziglar2, David Held1,2,

and Deva Ramanan1,2

1Carnegie Mellon University

2Argo AI

3Amazon

(a) Sensor position !, Scene "(b) New sensor position ! + Δ!, Scene "(c) New sensor position ! + Δ!, New scene " + Δ"

Fig. 1. We propose emergent occupancy as a novel self-supervised representation for

motion planning. Occupancy is independent of changes in sensor pose ∆y, which is in

contrast to prior work on self-supervised learning from LiDAR [29,30,16,10], speciﬁcally,

ego-centric freespace [10], which changes with (a-b) sensor pose motion ∆y and (b-

c) scene motion ∆s. We use diﬀerentiable raycasting to naturally decouple ego motion

from scene motion, allowing us to learn to forecast occupancy by self-supervision from

pose-aligned LiDAR sweeps.

Abstract. Motion planning for safe autonomous driving requires learn-

ing how the environment around an ego-vehicle evolves with time. Ego-

centric perception of driveable regions in a scene not only changes with

the motion of actors in the environment, but also with the movement of

the ego-vehicle itself. Self-supervised representations proposed for large-

scale planning, such as ego-centric freespace, confound these two motions,

making the representation diﬃcult to use for downstream motion plan-

ners. In this paper, we use geometric occupancy as a natural alternative

to view-dependent representations such as freespace. Occupancy maps

naturally disentagle the motion of the environment from the motion of

the ego-vehicle. However, one cannot directly observe the full 3D occu-

pancy of a scene (due to occlusion), making it diﬃcult to use as a signal

for learning. Our key insight is to use diﬀerentiable raycasting to “render”

future occupancy predictions into future LiDAR sweep predictions, which

can be compared with ground-truth sweeps for self-supervised learning.

The use of diﬀerentiable raycasting allows occupancy to emerge as an

internal representation within the forecasting network. In the absence

of groundtruth occupancy, we quantitatively evaluate the forecasting of

raycasted LiDAR sweeps and show improvements of upto 15 F1 points.

⋆equal contribution

arXiv:2210.01917v2 [cs.CV] 18 Oct 2022

2 T. Khurana∗, P. Hu∗, A. Dave, J. Ziglar, D. Held, D. Ramanan

For downstream motion planners, where emergent occupancy can be di-

rectly used to guide non-driveable regions, this representation relatively

reduces the number of collisions with objects by up to 17% as compared

to freespace-centric motion planners.

1 Introduction

To navigate in complex and dynamic environments such as urban cores, au-

tonomous vehicles need to perceive actors and predict their future movements.

Such knowledge is often represented in some form of forecasted occupancy [24],

which downstream motion planners rely on to produce safe trajectories. When

tackling the tasks of perception and prediction, standard solutions consist of per-

ceptual modules such as object detection, tracking, and trajectory forecasting,

which require a massive amount of object track labels. Such solutions do not

scale given the speed that log data is being collected by large ﬂeets.

Freespace versus occupancy: To avoid the need for costly human anno-

tations, and to enable learning at scale, self-supervised representations such as

ego-centric freespace [10] have been proposed. However, such a representation

couples the motion of the world with the motion of the ego-vehicle (Fig. 1). Our

key innovation in this paper is to learn an ego-pose independent and explain-

able representation for safe motion planning, which we call emergent occupancy.

Emergent occupancy decouples ego motion and scene motion using diﬀerentiable

raycasting: we design a network that learns to “space-time complete” the future

volumetric state of the world (in a world-coordinate frame) given past LiDAR

observations. Consider an ego-vehicle that moves in a static scene. Here, LiDAR

returns (even when aligned to a world-coordinate frame) will still swim along

the surfaces of the ﬁxed scene (Fig. 2). This implies that even when the world

is static, most of what the ego-vehicle observes through the LiDAR sensor ap-

pears to move with complex nonlinear motion, but in fact those observations

can be fully explained by static geometry and ego-motion (via raycasting). Li-

DAR forecasters need to implicitly predict this ego-motion of the car to produce

accurate future returns. However, we argue that such prediction doesn’t make

sense for autonomous agents that plan their future motion. Importantly, our dif-

ferentiable raycasting network has access to future camera ego-poses as input,

both during training (since they are available in archival logs) and testing (since

state-of-the-art planners explicitly search over candidate trajectories).

Self-supervision: Note that ground-truth future volumetric occupancy is

largely unavailable without human supervision, because the full 3D world is

rarely observed; the ego-vehicle only sees a limited number of future views as

recorded in a single archival log. To this end, we apply a diﬀerentiable raycaster

that projects the forecasted volumetric occupancy into a LiDAR sweep, as seen

by the future ego-vehicle motion in the log. We then use the diﬀerence between

the raycasted sweep and actual sweep as a signal for self-supervised learning,

allowing us to train models on massive amounts of unannotated logs.

Planning: Lastly, we show that such forecasted space-time occupancy can

be jointly learned with space-time costmaps for end-to-end motion planning.

Diﬀerentiable Raycasting for Self-supervised Occupancy Forecasting 3

! ! + Δ!

Fig. 2. We pose-align two succesive LiDAR sweeps of a static scene sto a common

world coordinate-frame (using the notation of Fig. 1). Even though there is zero scene

motion ∆s, points appear to drift or swim across surfaces. This is due to the fact that

points are obtained by intersecting rays from a moving sensor ∆y with static scene

geometry. This in turn implies that points can appear to move since they are not tied

to physical locations on a surface. This apparent movement (∆˜s) is in general a complex

nonlinear transformation, even when the sensor motion ∆y is a simple translation (as

shown above). Traditional methods for self-supervised LiDAR forecasting [29,30,16,10]

require predicting the complex transformation ∆˜swhich depends on the unknown ∆y,

while our diﬀerentiable-raycasting framework assumes ∆y is an input, dramatically

simplifying the task of the forecasting network. From a planning perspective, we argue

that the future (planned) change-in-pose should be an input rather than an output.

Owing to LiDAR self-supervision, we are able to train on recent unsupervised

LiDAR datasets [14] that are orders of magnitude larger than their annotated

counterparts, resulting in signiﬁcant improvement in accuracy for both forecasted

occupancy and motion plans. Interestingly, as we increase the amount of archival

training data at the cost of zero additional human annotation, object shape,

tracks, and multiple futures “emerge” in the arbitrary quantities predicted by

our model despite there being no direct supervision on ground-truth occupancy.

2 Related Work

Occupancy as a scene representation: Knowledge regarding what is around

an autonomous vehicle (AV) and what will happen next is captured in diﬀer-

ent representations throughout the standard modular perception and predic-

tion (P&P) pipeline [12,27,4,25]. Instead of separate optimization of these mod-

ules [26,17], Sadat et al. [24] propose bird’s-eye view (BEV) semantic occupancy

that is end-to-end optimizable. As an alternative to semantic occupancy, Hu et

al. [11] propose BEV ego-centric freespace that can be self-supervised by ray-

casting on aligned LiDAR sweeps. However, the ego-centric freespace entangles

4 T. Khurana∗, P. Hu∗, A. Dave, J. Ziglar, D. Held, D. Ramanan

motion from other actors, which is arguably more relevant for motion planning,

with ego-motion. In this paper, we propose emergent occupancy to isolate motion

of other actors. While we focus on self-supervised learning at scale, we acknowl-

edge that for motion planning, some semantic labelling is required (e.g., state of

a traﬃc light) which can be incorporated via semi-supervised learning.

Diﬀerentiable raycasting: Diﬀerentiable raycasting has shown great promise

in learning the underlying scene structure given samples of observations for

downstream novel view synthesis [15], pose estimation [31], etc. In contrast, our

application is best described as “space-time scene completion”, where we learn a

network to predict an explicit space-time occupancy volume. Furthermore, our

approach diﬀers from existing approaches in the following ways. We use LiDAR

sequences as input and raycast LiDAR sweeps given future occupancy and sensor

pose. We work with explicit volumetric representations [13] for dynamic scenes

with a feed-forward network instead of test-time optimization [19].

Self-supervision: Standard P&P solutions do not scale given how fast log

data is collected by large ﬂeets and how slow it is to curate object track labels. To

enable learning on massive amount of unlabeled logs, supervision from simula-

tion [8,5,6,7], auto labeling using multi-view constraints [21], and self-supervision

have been proposed. Notably, tasks that can be naturally self-supervised by Li-

DAR sweeps e.g., scene ﬂow [16] have the potential to generalize better as they

can leverage more data. More recently, LiDAR self-supervision has been explored

in the context of point cloud forecasting [28,29,30]. However, when predicting

future sweeps given the history, as stated before, past approaches often tend to

couple motion of the world with the motion of the ego-vehicle [28].

Motion Planning: An understanding of what is around an AV and what will

happen next [26] is crucial. This is typically done in the bird’s eye-view (BEV)

space by building a modular P&P pipeline. Although BEV motion planning does

not precisely reﬂect planning in the 3D world, it is widely used as the highest-

resolution and computation- and memory-eﬃcient representation [32,24,3]. How-

ever, training such modules often requires a massive amount of data. End-to-end

learned planners requiring less human annotation have emerged, with end-to-end

imitation learning (IL) methods showing particular promise [6,23,5]. Such meth-

ods often learn a neural network to map sensor data to either action (known

as behavior cloning) or “action-ready” cost function (known as inverse optimal

control) [18]. However, they are often criticized for lack of explainable inter-

mediate representations, making them less accountable for safety-critical appli-

cations [20]. More recently, end-to-end learned but modular methods producing

explainable representations, e.g., neural motion planners [32,24,3] have been pro-

posed. However, these still require costly object track labels. Unlike them, our

approach learns explainable intermediate representations that are explainable

quantities for safety-critical motion planning without the need of track labels.

Diﬀerentiable Raycasting for Self-supervised Occupancy Forecasting 5

3 Method

Autonomous ﬂeets provide an abundance of aligned sequences of LiDAR sweeps

xand ego vehicle trajectories y. How can we make use of such data to improve

perception, prediction, and planning? In the sections to follow, we ﬁrst deﬁne

occupancy. Then we describe a self-supervised approach to predicting future

occupancy. Finally, we describe an approach for integrating this forecasted oc-

cupancy into neural motion planners. Note that in the text that follows, we use

ego-centric freespace and freespace interchangeably.

3.1 Occupancy

We deﬁne occupancy as the state of occupied space at a particular time instance.

We use zto denote the true occupancy, which may not be directly observable

due to visibility constraints. Let us write

z[u]∈ {0,1},u= (x, y, t),u∈U(1)

to denote the occupancy of a voxel uin the space-time voxel grid U, which

can be occupied (1) or free (0). The spatial index of u, i.e., (x, y) represents the

spatial location from a bird’s-eye view. Given a sequence of aligned sensor data

and ego-vehicle trajectory (x,y), there may be multiple plausible occupancy

states zthat “explain” the sensor measurements. We denote this set of plausible

occupancy states as Z.

Forecasting Occupancy. Suppose we split an aligned sequence of LiDAR

sweeps and ego-vehicle trajectory (x,y) into a historic pair (x1,y1) and a future

pair (x2,y2). Our goal is to learn a function fthat takes historical observations

(x1,y1) as input and predicts emergent future occupancy ˆz2. Formally,

ˆz2=f(x1,y1),(2)

If the true occupancy z2were observable, we could directly supervise our fore-

caster, f. Unfortunately, in practice, we only observe LiDAR sweeps, x. We show

in the next section how to supervise fwith LiDAR sweeps using diﬀerentiable

raycasting techniques.

3.2 Raycasting

Given an occupancy estimate ˆz, sensor origin yand directional unit vectors for

rays r, a diﬀerentiable raycaster Rcan raycast LiDAR sweeps ˆx. We use ˆ

dto

represent the expected distance these rays travel before hitting obstacles: ˆ

R(r;ˆz,y). Then we can reconstruct the raycast LiDAR sweep ˆx as ˆx =y+ˆ

d∗r.

3.3 Learning to Forecast Occupancy

Given the predicted occupancy ˆz2(Eq. 2), and the captured sensor pose y2, a dif-

ferentiable raycaster Rcan take rays r2as input and produce ˆ

d2=R(r2;ˆz2,y2).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DifferentiableRaycastingforSelf-supervisedOccupancyForecastingTarashaKhurana1⋆,PeiyunHu2∗,AchalDave3,JasonZiglar2,DavidHeld1,2,andDevaRamanan1,21CarnegieMellonUniversity2ArgoAI3AmazonFig.1.Weproposeemergentoccupancyasanovelself-supervisedrepresentationformotionplanning.Occupancyisindependentofchange...

展开>> 收起<<

Differentiable Raycasting for Self-supervised Occupancy Forecasting Tarasha Khurana1 Peiyun Hu2 Achal Dave3 Jason Ziglar2 David Held12.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Differentiable Raycasting for Self-supervised Occupancy Forecasting Tarasha Khurana1 Peiyun Hu2 Achal Dave3 Jason Ziglar2 David Held12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: