CramNet Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection

2025-05-06 0 0 4.53MB 21 页 10玖币

侵权投诉

CramNet: Camera-Radar Fusion

with Ray-Constrained Cross-Attention

for Robust 3D Object Detection

Jyh-Jing Hwang, Henrik Kretzschmar, Joshua Manela, Sean Raﬀerty,

Nicholas Armstrong-Crews, Tiﬀany Chen, Dragomir Anguelov

Waymo

Abstract.

Robust 3D object detection is critical for safe autonomous

driving. Camera and radar sensors are synergistic as they capture com-

plementary information and work well under diﬀerent environmental

conditions. Fusing camera and radar data is challenging, however, as each

of the sensors lacks information along a perpendicular axis, that is, depth

is unknown to camera and elevation is unknown to radar. We propose the

camera-radar matching network CramNet, an eﬃcient approach to fuse

the sensor readings from camera and radar in a joint 3D space. To lever-

age radar range measurements for better camera depth predictions, we

propose a novel ray-constrained cross-attention mechanism that resolves

the ambiguity in the geometric correspondences between camera features

and radar features. Our method supports training with sensor modality

dropout, which leads to robust 3D object detection, even when a camera

or radar sensor suddenly malfunctions on a vehicle. We demonstrate the

eﬀectiveness of our fusion approach through extensive experiments on

the RADIATE dataset, one of the few large-scale datasets that provide

radar radio frequency imagery. A camera-only variant of our method

achieves competitive performance in monocular 3D object detection on

the Waymo Open Dataset.

Keywords: Sensor fusion; cross attention; robust 3D object detection.

1 Introduction

3D object detection that is robust to diﬀerent weather conditions and sensor

failures is critical for safe autonomous driving. Fusion between camera and

radar sensors stands out as they are both relatively resistant to various weather

conditions [

] compared to the popular lidar sensor [

]. A fusion design that

naturally accepts single-sensor failures (lidar, radar, or camera or radar) is thus

desired and boosts safety in an autonomous driving system (Figure 1).

Most sensor fusion research has focused on fusion between lidar and another

sensor [

] because lidar provides complete geometric

information, i.e., azimuth, range, and elevation. Sparse correspondences between

arXiv:2210.09267v2 [cs.CV] 18 Oct 2022

2 Hwang et al.

Fig. 1:

Our approach takes as input a camera image (top left) and a radar RF

image (bottom left). The model then predicts foreground segmentation for both

native 2D representations before projecting the foreground points with features

into a joint 3D space (middle bottom) for sensor fusion. Finally, the method

runs sparse convolutions in the joint space for 3D object detection. The network

architecture naturally supports training with sensor dropout. This allows the

resulting model to cope with sensor failures at inference time as it can run on

camera only and radar only input depending on which sensors are available.

lidar and another sensor is thus well deﬁned, making lidar an ideal carrier for

fusion. On the other hand, even though camera and radar sensors are lighter

and cheaper, consume less power, and endure longer than lidar, camera-radar

fusion is understudied. Camera-radar fusion is especially challenging as each

sensor lacks information along one perpendicular axis: depth unknown for camera

and elevation unknown for emerging imaging radar, as summarized in Table 1.

Radar produces radio frequency (RF) imagery that encodes the environment

approximately in the bird’s-eye view (BEV) with various noise patterns, an

example shown in Figure 1. As a result, camera data (in perspective view) and

radar data (in BEV) form many-to-many mappings and the exact matching is

unclear from geometry alone.

To solve the matching problem, we consider three possible schemes for fusion:

(1) Perspective view primary

[

]: This scheme implies we trust the depth

reasoning from the perspective view. One can project camera pixels to their

3D locations with depth estimates and ﬁnd their vertical nearest neighbors of

corresponding radar points. If depth is unknown, one can project a pixel along

a ray in 3D and perform matching.

(2) Bird’s-eye view primary

[

]: This

scheme implies we trust the elevation reasoning from the bird’s-eye view. However,

since it’s diﬃcult to predict elevation from radar imagery directly, one might

borrow elevation information from the map. Hence, the inferred elevation for

radar is sometimes inaccurate, resulting in rare usage unless LiDAR is available.

(3) Cross-view matching

[

]: This scheme implies we perform matching in a

joint 3D space. For example, one can use supplementary information (map or

camera depth estimation) to upgrade camera and radar 2D image pixels to 3D

point clouds (with some uncertainty) and perform matching between point clouds

CramNet 3

Sensor Azimuth Range Elevation Resistance to weather 3D detection literature

Camera XxXmedium abundant

Radar X X x∗high scarce

Lidar X X X low abundant

Table 1:

Characteristics of major sensors commonly used for autonomous driving.

Both camera and radar tend to be less aﬀected by inclement weather compared

to lidar scanners. However, whereas regular camera does not directly measure

range, radar does not measure elevation. This poses a unique challenge for fusing

camera and radar readings as the geometric correspondences between the two

sensors are underconstrained. Overall, camera-radar fusion is still underexplored

in the literature.

∗

Although there exists radars with elevation, this paper focuses

on planar radar which, at the moment, is more common for automotive radar.

directly. This is supposedly the most powerful scheme if we can properly handle

uncertainties. Our architecture is designed to enable this matching scheme, hence

we name it CramNet (Camera and RAdar Matching Network).

Since the eﬀectiveness of projecting into 3D space heavily relies on accurate

camera depth estimates, we propose a ray-constrained cross-attention mechanism

to leverage radar for better depth estimation. The idea is to match radar responses

along each camera ray emitted from a pixel. The correct projection should be

the locations where radar senses reﬂections. Our architecture is further designed

to accept sensor failures naturally. As shown in Figure 1, the model is able to

operate even when one of the modalities is corrupted during inference. To this

end, we incorporate sensor dropout [

] in the point cloud fusion stage during

training to boost the sensor robustness.

We summarize the contributions of this paper as follows:

We present a camera-radar fusion architecture for 3D object detection that

is ﬂexible enough to fall back to a single sensor modality in the event of a

sensor failure.

We demonstrate that the sensor fusion model eﬀectively leverages data

from both sensors as the model outperforms both the camera-only and the

radar-only variants signiﬁcantly.

3. We propose a ray-constrained cross-attention mechanism that leverages the

range measurements from radar to improve camera depth estimates, leading

to improved detection performance.

We incorporate sensor dropout during training to further improve the accuracy

and the robustness of camera-radar 3D object detection.

We demonstrate state-of-the-art radar-only and camera-radar detection perfor-

mance on the RADIATE dataset [

] and competitive camera-only detection

performance on the Waymo Open Dataset [47].

2 Related Work

Camera-based 3D object detection.

Monocular camera 3D object detec-

tion is ﬁrst approached by directly extending 2D detection architectures and

4 Hwang et al.

incorporating geometric relationships between the 2D perspective view and 3D

space [

]. Utilizing pixel-wise depth maps as an additional input

shows improved results, either for lifting detected boxes [

] or projecting im-

age pixels into 3D point clouds [

] (also known as Pseudo-LiDAR [

]).

More recently, another camp of methods emerge to be promising, i.e., projecting

intermediate features into BEV grid features along the projection ray without

explicitly forming 3D point clouds [36,46,34,18].

The BEV grid methods beneﬁt from naturally expressing the 3D projection

uncertainty along the depth dimension. However, these methods suﬀer from

signiﬁcantly increased compute requirements as the detection range expands. In

contrast, we model the depth uncertainty through sampling along the projection

ray and consulting radar features for more accurate range signals. This also

enables the adoption of foreground extraction that allows a balanced trade-oﬀ

between detection range and computation.

Radar-based 3D object detection.

Frequency modulated continuous wave

(FMCW) radar is usually presented by two kinds of data representations, i.e.,

radio frequency (RF) images and radar points. The RF images are generated

from the raw radar signals using a series of fast Fourier transforms that encode a

wide variety of sensing context whereas the radar points are derived from these

RF images through a peak detection algorithm, such as Constant False Alarm

Rate (CFAR) algorithm [

]. The downside of the radar points is that recall

is imperfect and the contextual information of radar returns is lost, with only

the range, azimuth and doppler information retained. As a result, radar points

are not suitable for eﬀective single modality object detection [

], which is

why most works use this data format only to foster fusion [

]. On the

other hand, the RF images maintain rich environmental context information

and even complete object motion information to enable a deep learning model

to understand the semantic meaning of a scene [

]. Our work is therefore

built upon radar RF images and can produce reasonable 3D object detection

predictions with radar-only inputs.

Sensor fusion for 3D object detection.

Sensor fusion for 3D object detection

has been studied extensively using lidar and camera. The reasons are twofold: 1)

Lidar scans provide comprehensive representations in 3D for inferring correspon-

dences between sensors, and 2) camera images contain more semantic information

to further boost the recognition ability. Various directions have been explored,

such as image detection in 2D before projecting into frustums [

], two-stage

frameworks with object-centric modality fusion [

], image feature-based

lidar point decoration [

], or multi-level fusion [

]. Since sparse corre-

spondences between camera and lidar are well deﬁned, fusion is mostly focused

on integrating information rather than matching points from diﬀerent sensors.

As a result, these fusion techniques are not directly applicable to camera-

radar fusion where associations are underconstrained. Early work, Lim et al. [

applies feature fusion directly between camera and radar features without any

geometric considerations. Recently, more works tend to leverage camera models

and geometry for association. For example, CenterFusion [

] creates camera

CramNet 5

Fig. 2:

Architecture overview. Our method can be partitioned into three stages:

(1a) camera 2D foreground segmentation and depth estimation, (1b) radar 2D

foreground segmentation, (2) projection from 2D to 3D and subsequent point

cloud fusion, and (3) 3D foreground point cloud object detection. The cross-

attention mechanism modiﬁes the camera depth estimation by consulting radar

features, as further illustrated in Figure 3. The modality coding module appends

a camera or radar binary code to the features that are fed into the 3D stage,

enabling sensor dropout and enhancing robustness. We depict the camera stream

in blue, the radar stream in green, and the fused stream in red.

object proposal frustums to associate radar features and GRIF Net [

] projects

3D RoI to camera perspective and radar BEV to associate features. Our model, on

the other hand, fuses camera-radar data in a joint 3D space with the ﬂexibility to

perform 3D detection with either single modality, leading to increased robustness.

3 CramNet for Robust 3D Object Detection

We describe the overall architecture for camera-radar fusion in Section 3.1. In

Section 3.2, We then introduce a ray-constrained cross-attention mechanism to

leverage radar for better camera 3D point localization. Finally, we propose sensor

dropout that can be integrated seamlessly into the architecture in Section 3.3 to

further improve the robustness of 3D object detection.

3.1 Overall Architecture

Our model architecture, in Figure 2, is inspired by Range Sparse Net (RSN) [

which is an eﬃcient two-stage lidar-based object detection framework. The

RSN framework takes input of perspective range images, segments perspective

foreground pixels, extracts 3D (BEV) features on foreground regions using sparse

convolution [

], and performs CenterNet-style [

] detection. We adapt the

framework for camera-radar fusion and the overall architecture can be partitioned

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CramNet:Camera-RadarFusionwithRay-ConstrainedCross-AttentionforRobust3DObjectDetectionJyh-JingHwang,HenrikKretzschmar,JoshuaManela,SeanRaerty,NicholasArmstrong-Crews,TianyChen,DragomirAnguelovWaymoAbstract.Robust3Dobjectdetectioniscriticalforsafeautonomousdriving.Cameraandradarsensorsaresynergisti...

展开>> 收起<<

CramNet Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CramNet Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: