HMFI 3
To address the aforementioned problems, we propose a homogeneous fusion
scheme that lifts image features from 2D plane to 3D dense voxel structure.
In our homogeneous fusion scheme, we propose the Homogeneous Multi-modal
Feature Fusion and Interaction method (HMFI), which exploits the complemen-
tary information in multi-modal features and alleviates severe information loss
caused by the dimensional reduction mapping. Furthermore, we build the cross-
modal feature interaction between the point cloud features and image features at
object-level based on the homogeneous 3D structure to strengthen the model’s
ability to fuse image semantic information with the point cloud.
Specifically, we design an image voxel lifter module (IVLM) to lift the 2D
image features to the 3D space first and construct a homogeneous voxel structure
of 2D images for multi-modal feature fusion, which is guided by the point cloud
as depth hint. It will not cause information loss for fusing these two multi-modal
data. We also notice that the homogeneous voxel structure of cross-modal data
can help in feature fusion and interaction. Thus, we introduce the query fusion
mechanism (QFM) that introduces a self-attention based operation that can
adaptively combine point cloud and image features. Each point cloud voxel will
query all image voxels to achieve homogeneous feature fusion and combine with
the original point cloud voxel features to form the joint camera-LiDAR features.
QFM enables each point cloud voxel to perceive image features in the common
3D space adaptively and fuse these two homogeneous representations effectively.
Besides, we explore building a feature interaction between the homogeneous
point cloud and image voxel features instead of refining in regions of interest
(RoI) based pooling which is applied to fuse low-level LiDAR and camera fea-
tures with the joint camera-LiDAR features. We consider that, although point
cloud and image representations are in different modalities, the object-level se-
mantic properties should be similar in the homogeneous structure. Therefore, to
strengthen the abstract representation of point cloud and images in a shared 3D
space and exploit the similarity of identical objects’ properties in two modalities,
we propose a voxel feature interaction module (VFIM) at the object-level to im-
prove the consistency of point cloud and image homogeneous representations in
the 3D RoI. To be specific, we use the voxel RoI pooling [6] to extract features
in these two homogeneous features according to the predicted proposals and
produce the paired RoI feature set. Then we adopt the cosine similarity loss [5]
between each pair of RoI features and enforce the consistency of object-level
properties in point cloud and images. In VFIM, building the feature interaction
in these homogeneous paired RoI features improves the object-level semantic
consistency between two homogeneous representations and enhances the model’s
ability to achieve cross-modal feature fusion. Extensive experiments conducted
on KITTI and Waymo Open Dataset demonstrate that the proposed method
can achieve better performance compared to the state-of-the-art multi-modal
methods. Our contributions are summarized as below:
1. We propose an image voxel lifter module (IVLM) to lift 2D image features
into the 3D space and construct two homogeneous features for multi-modal
fusion, which retains original information of image and point cloud.