CramNet Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection

2025-05-06 0 0 4.53MB 21 页 10玖币
侵权投诉
CramNet: Camera-Radar Fusion
with Ray-Constrained Cross-Attention
for Robust 3D Object Detection
Jyh-Jing Hwang, Henrik Kretzschmar, Joshua Manela, Sean Rafferty,
Nicholas Armstrong-Crews, Tiffany Chen, Dragomir Anguelov
Waymo
Abstract.
Robust 3D object detection is critical for safe autonomous
driving. Camera and radar sensors are synergistic as they capture com-
plementary information and work well under different environmental
conditions. Fusing camera and radar data is challenging, however, as each
of the sensors lacks information along a perpendicular axis, that is, depth
is unknown to camera and elevation is unknown to radar. We propose the
camera-radar matching network CramNet, an efficient approach to fuse
the sensor readings from camera and radar in a joint 3D space. To lever-
age radar range measurements for better camera depth predictions, we
propose a novel ray-constrained cross-attention mechanism that resolves
the ambiguity in the geometric correspondences between camera features
and radar features. Our method supports training with sensor modality
dropout, which leads to robust 3D object detection, even when a camera
or radar sensor suddenly malfunctions on a vehicle. We demonstrate the
effectiveness of our fusion approach through extensive experiments on
the RADIATE dataset, one of the few large-scale datasets that provide
radar radio frequency imagery. A camera-only variant of our method
achieves competitive performance in monocular 3D object detection on
the Waymo Open Dataset.
Keywords: Sensor fusion; cross attention; robust 3D object detection.
1 Introduction
3D object detection that is robust to different weather conditions and sensor
failures is critical for safe autonomous driving. Fusion between camera and
radar sensors stands out as they are both relatively resistant to various weather
conditions [
2
] compared to the popular lidar sensor [
3
]. A fusion design that
naturally accepts single-sensor failures (lidar, radar, or camera or radar) is thus
desired and boosts safety in an autonomous driving system (Figure 1).
Most sensor fusion research has focused on fusion between lidar and another
sensor [
32
,
54
,
7
,
11
,
50
,
51
,
19
,
11
,
31
,
57
,
39
] because lidar provides complete geometric
information, i.e., azimuth, range, and elevation. Sparse correspondences between
arXiv:2210.09267v2 [cs.CV] 18 Oct 2022
2 Hwang et al.
Fig. 1:
Our approach takes as input a camera image (top left) and a radar RF
image (bottom left). The model then predicts foreground segmentation for both
native 2D representations before projecting the foreground points with features
into a joint 3D space (middle bottom) for sensor fusion. Finally, the method
runs sparse convolutions in the joint space for 3D object detection. The network
architecture naturally supports training with sensor dropout. This allows the
resulting model to cope with sensor failures at inference time as it can run on
camera only and radar only input depending on which sensors are available.
lidar and another sensor is thus well defined, making lidar an ideal carrier for
fusion. On the other hand, even though camera and radar sensors are lighter
and cheaper, consume less power, and endure longer than lidar, camera-radar
fusion is understudied. Camera-radar fusion is especially challenging as each
sensor lacks information along one perpendicular axis: depth unknown for camera
and elevation unknown for emerging imaging radar, as summarized in Table 1.
Radar produces radio frequency (RF) imagery that encodes the environment
approximately in the bird’s-eye view (BEV) with various noise patterns, an
example shown in Figure 1. As a result, camera data (in perspective view) and
radar data (in BEV) form many-to-many mappings and the exact matching is
unclear from geometry alone.
To solve the matching problem, we consider three possible schemes for fusion:
(1) Perspective view primary
[
32
]: This scheme implies we trust the depth
reasoning from the perspective view. One can project camera pixels to their
3D locations with depth estimates and find their vertical nearest neighbors of
corresponding radar points. If depth is unknown, one can project a pixel along
a ray in 3D and perform matching.
(2) Bird’s-eye view primary
[
50
]: This
scheme implies we trust the elevation reasoning from the bird’s-eye view. However,
since it’s difficult to predict elevation from radar imagery directly, one might
borrow elevation information from the map. Hence, the inferred elevation for
radar is sometimes inaccurate, resulting in rare usage unless LiDAR is available.
(3) Cross-view matching
[
13
]: This scheme implies we perform matching in a
joint 3D space. For example, one can use supplementary information (map or
camera depth estimation) to upgrade camera and radar 2D image pixels to 3D
point clouds (with some uncertainty) and perform matching between point clouds
CramNet 3
Sensor Azimuth Range Elevation Resistance to weather 3D detection literature
Camera XxXmedium abundant
Radar X X xhigh scarce
Lidar X X X low abundant
Table 1:
Characteristics of major sensors commonly used for autonomous driving.
Both camera and radar tend to be less affected by inclement weather compared
to lidar scanners. However, whereas regular camera does not directly measure
range, radar does not measure elevation. This poses a unique challenge for fusing
camera and radar readings as the geometric correspondences between the two
sensors are underconstrained. Overall, camera-radar fusion is still underexplored
in the literature.
Although there exists radars with elevation, this paper focuses
on planar radar which, at the moment, is more common for automotive radar.
directly. This is supposedly the most powerful scheme if we can properly handle
uncertainties. Our architecture is designed to enable this matching scheme, hence
we name it CramNet (Camera and RAdar Matching Network).
Since the effectiveness of projecting into 3D space heavily relies on accurate
camera depth estimates, we propose a ray-constrained cross-attention mechanism
to leverage radar for better depth estimation. The idea is to match radar responses
along each camera ray emitted from a pixel. The correct projection should be
the locations where radar senses reflections. Our architecture is further designed
to accept sensor failures naturally. As shown in Figure 1, the model is able to
operate even when one of the modalities is corrupted during inference. To this
end, we incorporate sensor dropout [
7
,
52
] in the point cloud fusion stage during
training to boost the sensor robustness.
We summarize the contributions of this paper as follows:
1.
We present a camera-radar fusion architecture for 3D object detection that
is flexible enough to fall back to a single sensor modality in the event of a
sensor failure.
2.
We demonstrate that the sensor fusion model effectively leverages data
from both sensors as the model outperforms both the camera-only and the
radar-only variants significantly.
3. We propose a ray-constrained cross-attention mechanism that leverages the
range measurements from radar to improve camera depth estimates, leading
to improved detection performance.
4.
We incorporate sensor dropout during training to further improve the accuracy
and the robustness of camera-radar 3D object detection.
5.
We demonstrate state-of-the-art radar-only and camera-radar detection perfor-
mance on the RADIATE dataset [
40
] and competitive camera-only detection
performance on the Waymo Open Dataset [47].
2 Related Work
Camera-based 3D object detection.
Monocular camera 3D object detec-
tion is first approached by directly extending 2D detection architectures and
4 Hwang et al.
incorporating geometric relationships between the 2D perspective view and 3D
space [
6
,
27
,
4
,
43
,
23
,
44
,
8
,
16
]. Utilizing pixel-wise depth maps as an additional input
shows improved results, either for lifting detected boxes [
26
,
42
] or projecting im-
age pixels into 3D point clouds [
53
,
24
,
58
,
9
,
55
] (also known as Pseudo-LiDAR [
53
]).
More recently, another camp of methods emerge to be promising, i.e., projecting
intermediate features into BEV grid features along the projection ray without
explicitly forming 3D point clouds [36,46,34,18].
The BEV grid methods benefit from naturally expressing the 3D projection
uncertainty along the depth dimension. However, these methods suffer from
significantly increased compute requirements as the detection range expands. In
contrast, we model the depth uncertainty through sampling along the projection
ray and consulting radar features for more accurate range signals. This also
enables the adoption of foreground extraction that allows a balanced trade-off
between detection range and computation.
Radar-based 3D object detection.
Frequency modulated continuous wave
(FMCW) radar is usually presented by two kinds of data representations, i.e.,
radio frequency (RF) images and radar points. The RF images are generated
from the raw radar signals using a series of fast Fourier transforms that encode a
wide variety of sensing context whereas the radar points are derived from these
RF images through a peak detection algorithm, such as Constant False Alarm
Rate (CFAR) algorithm [
35
]. The downside of the radar points is that recall
is imperfect and the contextual information of radar returns is lost, with only
the range, azimuth and doppler information retained. As a result, radar points
are not suitable for effective single modality object detection [
38
,
33
], which is
why most works use this data format only to foster fusion [
2
,
13
,
29
,
28
]. On the
other hand, the RF images maintain rich environmental context information
and even complete object motion information to enable a deep learning model
to understand the semantic meaning of a scene [
25
,
40
]. Our work is therefore
built upon radar RF images and can produce reasonable 3D object detection
predictions with radar-only inputs.
Sensor fusion for 3D object detection.
Sensor fusion for 3D object detection
has been studied extensively using lidar and camera. The reasons are twofold: 1)
Lidar scans provide comprehensive representations in 3D for inferring correspon-
dences between sensors, and 2) camera images contain more semantic information
to further boost the recognition ability. Various directions have been explored,
such as image detection in 2D before projecting into frustums [
32
,
54
], two-stage
frameworks with object-centric modality fusion [
7
,
11
,
17
], image feature-based
lidar point decoration [
50
,
51
], or multi-level fusion [
19
,
11
,
31
]. Since sparse corre-
spondences between camera and lidar are well defined, fusion is mostly focused
on integrating information rather than matching points from different sensors.
As a result, these fusion techniques are not directly applicable to camera-
radar fusion where associations are underconstrained. Early work, Lim et al. [
20
],
applies feature fusion directly between camera and radar features without any
geometric considerations. Recently, more works tend to leverage camera models
and geometry for association. For example, CenterFusion [
28
] creates camera
CramNet 5
Fig. 2:
Architecture overview. Our method can be partitioned into three stages:
(1a) camera 2D foreground segmentation and depth estimation, (1b) radar 2D
foreground segmentation, (2) projection from 2D to 3D and subsequent point
cloud fusion, and (3) 3D foreground point cloud object detection. The cross-
attention mechanism modifies the camera depth estimation by consulting radar
features, as further illustrated in Figure 3. The modality coding module appends
a camera or radar binary code to the features that are fed into the 3D stage,
enabling sensor dropout and enhancing robustness. We depict the camera stream
in blue, the radar stream in green, and the fused stream in red.
object proposal frustums to associate radar features and GRIF Net [
13
] projects
3D RoI to camera perspective and radar BEV to associate features. Our model, on
the other hand, fuses camera-radar data in a joint 3D space with the flexibility to
perform 3D detection with either single modality, leading to increased robustness.
3 CramNet for Robust 3D Object Detection
We describe the overall architecture for camera-radar fusion in Section 3.1. In
Section 3.2, We then introduce a ray-constrained cross-attention mechanism to
leverage radar for better camera 3D point localization. Finally, we propose sensor
dropout that can be integrated seamlessly into the architecture in Section 3.3 to
further improve the robustness of 3D object detection.
3.1 Overall Architecture
Our model architecture, in Figure 2, is inspired by Range Sparse Net (RSN) [
48
],
which is an efficient two-stage lidar-based object detection framework. The
RSN framework takes input of perspective range images, segments perspective
foreground pixels, extracts 3D (BEV) features on foreground regions using sparse
convolution [
56
], and performs CenterNet-style [
60
] detection. We adapt the
framework for camera-radar fusion and the overall architecture can be partitioned
摘要:

CramNet:Camera-RadarFusionwithRay-ConstrainedCross-AttentionforRobust3DObjectDetectionJyh-JingHwang,HenrikKretzschmar,JoshuaManela,SeanRa erty,NicholasArmstrong-Crews,Ti anyChen,DragomirAnguelovWaymoAbstract.Robust3Dobjectdetectioniscriticalforsafeautonomousdriving.Cameraandradarsensorsaresynergisti...

展开>> 收起<<
CramNet Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:4.53MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注