Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection Xin Li1 Botian Shi2 Yuenan Hou2 Xingjiao Wu13 Tianlong Ma13

2025-05-06 0 0 1.04MB 17 页 10玖币
侵权投诉
Homogeneous Multi-modal Feature Fusion and
Interaction for 3D Object Detection
Xin Li1, Botian Shi2, Yuenan Hou2, Xingjiao Wu1,3, Tianlong Ma1,3,
Yikang Li2, and Liang He1
1East China Normal University 2Shanghai AI Lab 3Fudan University
{sankin0528, wuxingjiao2885}@gmail.com
{shibotian, houyuenan, liyikang}@pjlab.org.cn {tlma, lhe}@cs.ecnu.edu.cn
Corresponding author
Abstract. Multi-modal 3D object detection has been an active research
topic in autonomous driving. Nevertheless, it is non-trivial to explore the
cross-modal feature fusion between sparse 3D points and dense 2D pixels.
Recent approaches either fuse the image features with the point cloud
features that are projected onto the 2D image plane or combine the sparse
point cloud with dense image pixels. These fusion approaches often suf-
fer from severe information loss, thus causing sub-optimal performance.
To address these problems, we construct the homogeneous structure be-
tween the point cloud and images to avoid projective information loss by
transforming the camera features into the LiDAR 3D space. In this paper,
we propose a homogeneous multi-modal feature fusion and interaction
method (HMFI) for 3D object detection. Specifically, we first design an
image voxel lifter module (IVLM) to lift 2D image features into the 3D
space and generate homogeneous image voxel features. Then, we fuse
the voxelized point cloud features with the image features from different
regions by introducing the self-attention based query fusion mechanism
(QFM). Next, we propose a voxel feature interaction module (VFIM) to
enforce the consistency of semantic information from identical objects in
the homogeneous point cloud and image voxel representations, which can
provide object-level alignment guidance for cross-modal feature fusion
and strengthen the discriminative ability in complex backgrounds. We
conduct extensive experiments on the KITTI and Waymo Open Dataset,
and the proposed HMFI achieves better performance compared with the
state-of-the-art multi-modal methods. Particularly, for the 3D detection
of cyclist on the KITTI benchmark, HMFI surpasses all the published
algorithms by a large margin.
Keywords: 3D object detection, multi-modal, feature-level fusion, self-
attention
1 Introduction
3D object detection is an important task that aims to precisely localize and clas-
sify each object in the 3D space, thus allowing vehicles to perceive and under-
arXiv:2210.09615v1 [cs.CV] 18 Oct 2022
2 X. Li et al.
99
Point Cloud
Image
ROI Fusion Point/Voxel
Fusion
Homogeneous
Fusion (ours)
Ours
3D-CVF EPNet
PIRCNN
AVOD
MVXNet
(a) (b)
ROI Fusion
Point/Voxel Fusion
Homogenous Fusion
MMF
3D Car Mean AP / %
Inference / ms
Fig. 1: (a) Schematic comparison between different feature-level fusion based
methods. (b) Quantitative comparison with competitive multi-modal feature-
level fusion methods. Our method achieves good performance-efficiency trade-
off for the car category (Mean AP of all difficulty levels) on the KITTI [7]
benchmark.
stand their surrounding environment comprehensively. So far, various LiDAR-
based and image-based 3D detection approaches [33,34,36,24,40,18,41,39,9,6,26]
have been proposed.
LiDAR-based methods can achieve superior performance over image-based
approaches as point cloud contains precise spatial information. However, LiDAR
points are usually sparse and do not have enough color and texture information.
As to image-based approaches, they perform better in capturing semantic infor-
mation while suffering from the lack of depth signal. Therefore, multi-modal 3D
object detection is a promising direction that can fully utilize the complementary
information of images and point cloud.
Recent multi-modal approaches can be generally categorized into two types:
decision-level fusion and feature-level fusion. Decision-level fusion methods en-
semble the detected objects in respective modalities and their performance is
bounded by each stage [29]. Feature level fusion is more prevalent as they fuse
the rich informative features of two modalities. Three representative feature-
level fusion methods are depicted in Fig. 1(a). The first one is fusing multi-
modal features at the regions of interest (RoI). However, these methods have
severe spatial information loss when projecting 3D points onto the bird’s eye
view (BEV) or front view (FV) in 2D plane, while 3D information plays a key
role in accurate 3D object localization. Another line of work conducts fusion on
the point/voxel-level [43,49,55,21,22,50,14,59], which can achieve complementary
fusion at a much finer granularity and involve the combination of low-level multi-
modal features at 3D points or 2D pixels. However, they can only approximately
establish a relatively coarse correspondence between the point/voxel features
and image features. Moreover, these two schemes of feature fusion usually suffer
from severe information loss due to the mismatched projection between 2D dense
image pixels and 3D sparse LiDAR points.
HMFI 3
To address the aforementioned problems, we propose a homogeneous fusion
scheme that lifts image features from 2D plane to 3D dense voxel structure.
In our homogeneous fusion scheme, we propose the Homogeneous Multi-modal
Feature Fusion and Interaction method (HMFI), which exploits the complemen-
tary information in multi-modal features and alleviates severe information loss
caused by the dimensional reduction mapping. Furthermore, we build the cross-
modal feature interaction between the point cloud features and image features at
object-level based on the homogeneous 3D structure to strengthen the model’s
ability to fuse image semantic information with the point cloud.
Specifically, we design an image voxel lifter module (IVLM) to lift the 2D
image features to the 3D space first and construct a homogeneous voxel structure
of 2D images for multi-modal feature fusion, which is guided by the point cloud
as depth hint. It will not cause information loss for fusing these two multi-modal
data. We also notice that the homogeneous voxel structure of cross-modal data
can help in feature fusion and interaction. Thus, we introduce the query fusion
mechanism (QFM) that introduces a self-attention based operation that can
adaptively combine point cloud and image features. Each point cloud voxel will
query all image voxels to achieve homogeneous feature fusion and combine with
the original point cloud voxel features to form the joint camera-LiDAR features.
QFM enables each point cloud voxel to perceive image features in the common
3D space adaptively and fuse these two homogeneous representations effectively.
Besides, we explore building a feature interaction between the homogeneous
point cloud and image voxel features instead of refining in regions of interest
(RoI) based pooling which is applied to fuse low-level LiDAR and camera fea-
tures with the joint camera-LiDAR features. We consider that, although point
cloud and image representations are in different modalities, the object-level se-
mantic properties should be similar in the homogeneous structure. Therefore, to
strengthen the abstract representation of point cloud and images in a shared 3D
space and exploit the similarity of identical objects’ properties in two modalities,
we propose a voxel feature interaction module (VFIM) at the object-level to im-
prove the consistency of point cloud and image homogeneous representations in
the 3D RoI. To be specific, we use the voxel RoI pooling [6] to extract features
in these two homogeneous features according to the predicted proposals and
produce the paired RoI feature set. Then we adopt the cosine similarity loss [5]
between each pair of RoI features and enforce the consistency of object-level
properties in point cloud and images. In VFIM, building the feature interaction
in these homogeneous paired RoI features improves the object-level semantic
consistency between two homogeneous representations and enhances the model’s
ability to achieve cross-modal feature fusion. Extensive experiments conducted
on KITTI and Waymo Open Dataset demonstrate that the proposed method
can achieve better performance compared to the state-of-the-art multi-modal
methods. Our contributions are summarized as below:
1. We propose an image voxel lifter module (IVLM) to lift 2D image features
into the 3D space and construct two homogeneous features for multi-modal
fusion, which retains original information of image and point cloud.
4 X. Li et al.
2. We introduce the query fusion mechanism (QFM) to fuse two homogeneous
representations of the point cloud voxel features and image voxel features
effectively, which enables the fused voxels to perceive objects in a unified 3D
space for each frame adaptively.
3. We propose a voxel feature interaction module (VFIM) to improve the con-
sistency of identical objects’ semantic information in the homogeneous point
cloud and image voxel features which can guide the cross-modal feature fu-
sion and greatly improve the detection performance.
4. Extensive experiments demonstrate the effectiveness of the proposed HMFI
and achieve competitive performance on KITTI and Waymo Open Dataset.
Notably, on the KITTI benchmark, HMFI surpasses all the published com-
petitive methods by a large margin on detecting cyclist.
2 Related Works
2.1 LiDAR-based 3D Object Detection
Point-based methods: These methods [33,34,42,40] take the raw point cloud
as input and employ stacked MLP layers to extract point features. PointR-
CNN [40] uses the PointNets [33,34] as point cloud encoder, then generates
proposals based on the extracted semantic and geometric features, and refines
these coarse proposals via 3D ROI pooling operation. Point-GNN [42] designs
a graph neural network to detect 3D objects and encodes the point clouds in a
fixed radius near the neighbors’ graph. Since the point clouds are unordered and
large in number, point-based methods typically suffer from high computational
costs.
Voxel-based methods: These voxel-based approaches [57,63,6,41,20,39,27] tend
to convert the point cloud into voxels and utilize voxel encoding layers to ex-
tract voxel features. SECOND [57] proposes a novel sparse convolution layer to
replace the original computation-intensive 3D convolution. PointPillars [18] con-
verts the point cloud to a pseudo-image and applies 2D CNN to produce the final
detection results. Some other works [6,39,20,41,26] follow [57] to utilize the 3D
sparse convolutional operations to encode the voxel features and obtain more
accurate detection results in the coarse-to-refine two-stage manner. The more
recent CT3D [38] designs a channel-wise transformer architecture to constitute
3D object detection framework with minimal hand-crafted design.
2.2 Image-based 3D Object Detection
Many researchers are also very concerned about how to use camera images to per-
form 3D detection [24,60,25,9,36]. Specifically, CaDDN [36] designs a Frustum
Feature Network to project image information into 3D space. We directly intro-
duce depth bins through point cloud projection and use a non-parametrical mod-
ule to lift image features into 3D space. LIGA-Stereo [9] utilizes the LiDAR-based
model to guide the training of stereo-based 3D detection model and achieves the
摘要:

HomogeneousMulti-modalFeatureFusionandInteractionfor3DObjectDetectionXinLi1,BotianShi2,YuenanHou2,XingjiaoWu1,3,TianlongMa1,3,YikangLi2†,andLiangHe1†1EastChinaNormalUniversity2ShanghaiAILab3FudanUniversity{sankin0528,wuxingjiao2885}@gmail.com{shibotian,houyuenan,liyikang}@pjlab.org.cn{tlma,lhe}@cs.e...

展开>> 收起<<
Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection Xin Li1 Botian Shi2 Yuenan Hou2 Xingjiao Wu13 Tianlong Ma13.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.04MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注