Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection Xin Li1 Botian Shi2 Yuenan Hou2 Xingjiao Wu13 Tianlong Ma13

2025-05-06 0 0 1.04MB 17 页 10玖币

侵权投诉

Homogeneous Multi-modal Feature Fusion and

Interaction for 3D Object Detection

Xin Li1, Botian Shi2, Yuenan Hou2, Xingjiao Wu1,3, Tianlong Ma1,3,

Yikang Li2†, and Liang He1†

1East China Normal University 2Shanghai AI Lab 3Fudan University

{sankin0528, wuxingjiao2885}@gmail.com

{shibotian, houyuenan, liyikang}@pjlab.org.cn {tlma, lhe}@cs.ecnu.edu.cn

†Corresponding author

Abstract. Multi-modal 3D object detection has been an active research

topic in autonomous driving. Nevertheless, it is non-trivial to explore the

cross-modal feature fusion between sparse 3D points and dense 2D pixels.

Recent approaches either fuse the image features with the point cloud

features that are projected onto the 2D image plane or combine the sparse

point cloud with dense image pixels. These fusion approaches often suf-

fer from severe information loss, thus causing sub-optimal performance.

To address these problems, we construct the homogeneous structure be-

tween the point cloud and images to avoid projective information loss by

transforming the camera features into the LiDAR 3D space. In this paper,

we propose a homogeneous multi-modal feature fusion and interaction

method (HMFI) for 3D object detection. Speciﬁcally, we ﬁrst design an

image voxel lifter module (IVLM) to lift 2D image features into the 3D

space and generate homogeneous image voxel features. Then, we fuse

the voxelized point cloud features with the image features from diﬀerent

regions by introducing the self-attention based query fusion mechanism

(QFM). Next, we propose a voxel feature interaction module (VFIM) to

enforce the consistency of semantic information from identical objects in

the homogeneous point cloud and image voxel representations, which can

provide object-level alignment guidance for cross-modal feature fusion

and strengthen the discriminative ability in complex backgrounds. We

conduct extensive experiments on the KITTI and Waymo Open Dataset,

and the proposed HMFI achieves better performance compared with the

state-of-the-art multi-modal methods. Particularly, for the 3D detection

of cyclist on the KITTI benchmark, HMFI surpasses all the published

algorithms by a large margin.

Keywords: 3D object detection, multi-modal, feature-level fusion, self-

attention

1 Introduction

3D object detection is an important task that aims to precisely localize and clas-

sify each object in the 3D space, thus allowing vehicles to perceive and under-

arXiv:2210.09615v1 [cs.CV] 18 Oct 2022

2 X. Li et al.

Point Cloud

Image

ROI Fusion Point/Voxel

Fusion

Homogeneous

Fusion (ours)

Ours

3D-CVF EPNet

PIRCNN

AVOD

MVXNet

(a) (b)

ROI Fusion

Point/Voxel Fusion

Homogenous Fusion

MMF

3D Car Mean AP / %

Inference / ms

Fig. 1: (a) Schematic comparison between diﬀerent feature-level fusion based

methods. (b) Quantitative comparison with competitive multi-modal feature-

level fusion methods. Our method achieves good performance-eﬃciency trade-

oﬀ for the car category (Mean AP of all diﬃculty levels) on the KITTI [7]

benchmark.

stand their surrounding environment comprehensively. So far, various LiDAR-

based and image-based 3D detection approaches [33,34,36,24,40,18,41,39,9,6,26]

have been proposed.

LiDAR-based methods can achieve superior performance over image-based

approaches as point cloud contains precise spatial information. However, LiDAR

points are usually sparse and do not have enough color and texture information.

As to image-based approaches, they perform better in capturing semantic infor-

mation while suﬀering from the lack of depth signal. Therefore, multi-modal 3D

object detection is a promising direction that can fully utilize the complementary

information of images and point cloud.

Recent multi-modal approaches can be generally categorized into two types:

decision-level fusion and feature-level fusion. Decision-level fusion methods en-

semble the detected objects in respective modalities and their performance is

bounded by each stage [29]. Feature level fusion is more prevalent as they fuse

the rich informative features of two modalities. Three representative feature-

level fusion methods are depicted in Fig. 1(a). The ﬁrst one is fusing multi-

modal features at the regions of interest (RoI). However, these methods have

severe spatial information loss when projecting 3D points onto the bird’s eye

view (BEV) or front view (FV) in 2D plane, while 3D information plays a key

role in accurate 3D object localization. Another line of work conducts fusion on

the point/voxel-level [43,49,55,21,22,50,14,59], which can achieve complementary

fusion at a much ﬁner granularity and involve the combination of low-level multi-

modal features at 3D points or 2D pixels. However, they can only approximately

establish a relatively coarse correspondence between the point/voxel features

and image features. Moreover, these two schemes of feature fusion usually suﬀer

from severe information loss due to the mismatched projection between 2D dense

image pixels and 3D sparse LiDAR points.

HMFI 3

To address the aforementioned problems, we propose a homogeneous fusion

scheme that lifts image features from 2D plane to 3D dense voxel structure.

In our homogeneous fusion scheme, we propose the Homogeneous Multi-modal

Feature Fusion and Interaction method (HMFI), which exploits the complemen-

tary information in multi-modal features and alleviates severe information loss

caused by the dimensional reduction mapping. Furthermore, we build the cross-

modal feature interaction between the point cloud features and image features at

object-level based on the homogeneous 3D structure to strengthen the model’s

ability to fuse image semantic information with the point cloud.

Speciﬁcally, we design an image voxel lifter module (IVLM) to lift the 2D

image features to the 3D space ﬁrst and construct a homogeneous voxel structure

of 2D images for multi-modal feature fusion, which is guided by the point cloud

as depth hint. It will not cause information loss for fusing these two multi-modal

data. We also notice that the homogeneous voxel structure of cross-modal data

can help in feature fusion and interaction. Thus, we introduce the query fusion

mechanism (QFM) that introduces a self-attention based operation that can

adaptively combine point cloud and image features. Each point cloud voxel will

query all image voxels to achieve homogeneous feature fusion and combine with

the original point cloud voxel features to form the joint camera-LiDAR features.

QFM enables each point cloud voxel to perceive image features in the common

3D space adaptively and fuse these two homogeneous representations eﬀectively.

Besides, we explore building a feature interaction between the homogeneous

point cloud and image voxel features instead of reﬁning in regions of interest

(RoI) based pooling which is applied to fuse low-level LiDAR and camera fea-

tures with the joint camera-LiDAR features. We consider that, although point

cloud and image representations are in diﬀerent modalities, the object-level se-

mantic properties should be similar in the homogeneous structure. Therefore, to

strengthen the abstract representation of point cloud and images in a shared 3D

space and exploit the similarity of identical objects’ properties in two modalities,

we propose a voxel feature interaction module (VFIM) at the object-level to im-

prove the consistency of point cloud and image homogeneous representations in

the 3D RoI. To be speciﬁc, we use the voxel RoI pooling [6] to extract features

in these two homogeneous features according to the predicted proposals and

produce the paired RoI feature set. Then we adopt the cosine similarity loss [5]

between each pair of RoI features and enforce the consistency of object-level

properties in point cloud and images. In VFIM, building the feature interaction

in these homogeneous paired RoI features improves the object-level semantic

consistency between two homogeneous representations and enhances the model’s

ability to achieve cross-modal feature fusion. Extensive experiments conducted

on KITTI and Waymo Open Dataset demonstrate that the proposed method

can achieve better performance compared to the state-of-the-art multi-modal

methods. Our contributions are summarized as below:

1. We propose an image voxel lifter module (IVLM) to lift 2D image features

into the 3D space and construct two homogeneous features for multi-modal

fusion, which retains original information of image and point cloud.

4 X. Li et al.

2. We introduce the query fusion mechanism (QFM) to fuse two homogeneous

representations of the point cloud voxel features and image voxel features

eﬀectively, which enables the fused voxels to perceive objects in a uniﬁed 3D

space for each frame adaptively.

3. We propose a voxel feature interaction module (VFIM) to improve the con-

sistency of identical objects’ semantic information in the homogeneous point

cloud and image voxel features which can guide the cross-modal feature fu-

sion and greatly improve the detection performance.

4. Extensive experiments demonstrate the eﬀectiveness of the proposed HMFI

and achieve competitive performance on KITTI and Waymo Open Dataset.

Notably, on the KITTI benchmark, HMFI surpasses all the published com-

petitive methods by a large margin on detecting cyclist.

2 Related Works

2.1 LiDAR-based 3D Object Detection

Point-based methods: These methods [33,34,42,40] take the raw point cloud

as input and employ stacked MLP layers to extract point features. PointR-

CNN [40] uses the PointNets [33,34] as point cloud encoder, then generates

proposals based on the extracted semantic and geometric features, and reﬁnes

these coarse proposals via 3D ROI pooling operation. Point-GNN [42] designs

a graph neural network to detect 3D objects and encodes the point clouds in a

ﬁxed radius near the neighbors’ graph. Since the point clouds are unordered and

large in number, point-based methods typically suﬀer from high computational

costs.

Voxel-based methods: These voxel-based approaches [57,63,6,41,20,39,27] tend

to convert the point cloud into voxels and utilize voxel encoding layers to ex-

tract voxel features. SECOND [57] proposes a novel sparse convolution layer to

replace the original computation-intensive 3D convolution. PointPillars [18] con-

verts the point cloud to a pseudo-image and applies 2D CNN to produce the ﬁnal

detection results. Some other works [6,39,20,41,26] follow [57] to utilize the 3D

sparse convolutional operations to encode the voxel features and obtain more

accurate detection results in the coarse-to-reﬁne two-stage manner. The more

recent CT3D [38] designs a channel-wise transformer architecture to constitute

3D object detection framework with minimal hand-crafted design.

2.2 Image-based 3D Object Detection

Many researchers are also very concerned about how to use camera images to per-

form 3D detection [24,60,25,9,36]. Speciﬁcally, CaDDN [36] designs a Frustum

Feature Network to project image information into 3D space. We directly intro-

duce depth bins through point cloud projection and use a non-parametrical mod-

ule to lift image features into 3D space. LIGA-Stereo [9] utilizes the LiDAR-based

model to guide the training of stereo-based 3D detection model and achieves the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HomogeneousMulti-modalFeatureFusionandInteractionfor3DObjectDetectionXinLi1,BotianShi2,YuenanHou2,XingjiaoWu1,3,TianlongMa1,3,YikangLi2†,andLiangHe1†1EastChinaNormalUniversity2ShanghaiAILab3FudanUniversity{sankin0528,wuxingjiao2885}@gmail.com{shibotian,houyuenan,liyikang}@pjlab.org.cn{tlma,lhe}@cs.e...

展开>> 收起<<

Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection Xin Li1 Botian Shi2 Yuenan Hou2 Xingjiao Wu13 Tianlong Ma13.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection Xin Li1 Botian Shi2 Yuenan Hou2 Xingjiao Wu13 Tianlong Ma13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: