CAGroup3D Class-Aware Grouping for 3D Object Detection on Point Clouds Haiyang Wang167 Lihe Ding2 Shaocong Dong2 Shaoshuai Shi3y

2025-05-01 0 0 5.91MB 19 页 10玖币
侵权投诉
CAGroup3D: Class-Aware Grouping for 3D Object
Detection on Point Clouds
Haiyang Wang1,6,7
, Lihe Ding2
, Shaocong Dong2, Shaoshuai Shi3
,
Aoxue Li4, Jianan Li2, Zhenguo Li4, Liwei Wang1,5
1Center for Data Science, Peking University 2Beijing institute of Technology
3Max Planck Institute for Informatics 4Huawei Noah’s Ark Lab, China
5Key Laboratory of Machine Perception, MOE, School of Intelligence Science and Technology,
Peking University 6Peng Cheng Laboratory 7Pazhou Laboratory (Huangpu)
{wanghaiyang@stu, wanglw@cls}.pku.edu.cn, {dean.dinglihe, shaocong}@bit.edu.cn
sshi@mpi-inf.mpg.de, lijianan15@gmail.com {liaoxue2, Li.Zhenguo}@huawei.com
Abstract
We present a novel two-stage fully sparse convolutional 3D object detection frame-
work, named CAGroup3D. Our proposed method first generates some high-quality
3D proposals by leveraging the class-aware local group strategy on the object
surface voxels with the same semantic predictions, which considers semantic
consistency and diverse locality abandoned in previous bottom-up approaches.
Then, to recover the features of missed voxels due to incorrect voxel-wise seg-
mentation, we build a fully sparse convolutional RoI pooling module to directly
aggregate fine-grained spatial information from backbone for further proposal
refinement. It is memory-and-computation efficient and can better encode the
geometry-specific features of each 3D proposal. Our model achieves state-of-the-
art 3D detection performance with remarkable gains of +3.6% on ScanNet V2
and +2.6% on SUN RGB-D in term of mAP@0.25. Code will be available at
https://github.com/Haiyang-W/CAGroup3D.
1 Introduction
As a crucial step towards understanding 3D visual world, 3D object detection aims to estimate the
oriented 3D bounding boxes and semantic labels of objects in real 3D scenes. It has been studied for
a long time in both academia and industry since it benefits various downstream applications, such as
autonomous driving [
2
,
36
], robotics [
54
,
37
] and augmented reality [
1
,
3
]. In this paper, we focus on
detecting 3D objects from unordered, sparse and irregular point clouds. Those natural characteristics
make it more challenging to directly extend well-studied 2D techniques to 3D detection.
Unlike 3D object detection from autonomous driving scenarios that only considers bird’s eye view
(BEV) boxes [
30
,
49
,
19
,
11
,
32
,
31
], most of existing 3D indoor object detectors [
27
,
38
,
22
,
5
,
51
]
typically handle this task through a bottom-up scheme, which extracts the point-wise features from
input point clouds, and then groups the points into their respective instances to generate a set of
proposals. However, the above grouping algorithms are usually carried out in a class-agnostic manner,
which abandons semantic consistency within the same group and also ignores diverse locality among
different categories. For example, VoteNet [
27
] learns the point-wise center offsets and aggregates the
points that vote to similar semantic-irrelevant local region. Though impressive, as shown in Figure
1, these methods may fail in cluttered indoor scenes where various objects are close but belong to
different categories. Also, the object sizes are diverse for different categories, so that a class-agnostic
*Equal contribution.
Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04264v1 [cs.CV] 9 Oct 2022
seed point (table) seed point (chair) seed point (sofa) ball query
vote point (table) vote point (chair) vote point (sofa)
(a) mis-grouping of different categories (b) partial or over coverage
Figure 1: Class-agnostic grouping methods suffer from (a) mis-grouping of different categories within
the same local regions, (b) partial coverage of the object surfaces; outliers from the cluttered scene.
local grouping may partially cover the boundary points of large objects and involve more noise
outliers for small objects.
Hence, we propose CAGroup3D, a two-stage fully convolutional 3D object detection framework.
Our method consists of two novel components. One is the class-aware 3D proposal generation
module, which aims to generate reliable proposals by utilizing class-specific local group strategy on
the object surface voxels with same semantic predictions. The other one is an efficient fully sparse
convolutional RoI pooling module for recovering the features of the missed surface voxels due to
semantic segmentation errors, so as to improve the quality of predicted boxes.
Specifically, a backbone network with 3D sparse convolution is firstly utilized to extract descriptive
voxel-wise features from raw point clouds. Based on the learned features, we conduct a class-aware
local grouping module to cluster surface voxels into their corresponding instance centroids. Different
from [
27
], in order to consider the semantic consistency, we not only shift voxels of the same
instance towards the same centroid but also predict per-voxel semantic scores. Given the contiguously
distributed vote points with their semantic predictions, we initially voxelize them according to the
predicted semantic categories and vote coordinates, so as to generate class-specific 3D voxels for
different categories. The voxel size of each category is adaptive to its average spatial dimension.
To maintain the structure of fully convolution, we apply sparse convolution as grouping operation
centered on each voted voxel to aggregate adjacent voxel features in the same semantic space. Note
that these grouping layers are class-dependent but share the same kernel size, thus the larger classes
are preferred to be aggregated with larger local regions.
Secondly, given the proposal candidates, fine-grained specific features within 3D proposals need to
be revisited from 3D backbone through certain pooling operation for the following box refinement.
However, state-of-the-art pooling strategies [
31
,
9
] are memory-and-computation intensive due to the
hand-crafted set abstraction [
25
]. Besides that, its max-pooling operation also harms the geometry
distribution. To tackle this problem, we propose RoI-Conv pooling module, which directly adopts the
well-optimized 3D sparse convolutions to aggregate voxel features from backbone. It can encode
effective geometric representations with a memory-efficient design for further proposal refinement.
In summary, our contributions are three-fold: 1) We propose a novel class-aware 3D proposal
generation strategy, which considers both the voxel-wise semantic consistency within the same local
group and the object-level shape diversity among different categories. 2) We present RoI-Conv
pooling module, an efficient fully convolutional 3D pooling operation for revisiting voxel features
directly from backbone to refine 3D proposals. 3) Our approach outperforms state-of-the-art methods
with remarkable gains on two challenging indoor datasets, i.e., ScanNet V2 [
7
] and SUN RGB-D [
33
],
demonstrating its effectiveness and generality.
2 Related Work
3D Object Detection on Point Clouds.
Detecting 3D objects from point clouds is challenging due
to orderless, sparse and irregular characteristics. Previous approaches can be coarsely classified into
2
two lines in terms of point representations, i.e., the voxel-based methods [
53
,
45
,
32
,
44
,
31
,
49
,
32
]
and the point-based methods [
27
,
38
,
5
,
22
,
51
,
46
]. Voxel-based methods are mainly applied in
outdoor autonomous driving scenarios where objects are distributed on the large-scale 2D ground
plane. They process the sparse point clouds by efficient 3D sparse convolution, then project these
3D volumes to 2D grids for detecting bird’s eye view (BEV) bboxes by 2D ConvNet. Powered by
PointNet series [
25
,
29
], point-based methods are also widely used to predict 3D bounding bboxes.
Most of existing methods are in a bottom-up manner, which extracts the point-wise features and
groups them to obtain object features. This pipeline has been a great success for estimating 3D bboxes
directly from cluttered and dense 3D scenes. However, due to the hand-crafted point sampling and
computation intensive grouping scheme applied in PointNet++ [
29
], they are difficult to be extended
to large-scale point clouds. Hence, we propose an efficient fully convolutional bottom-up framework
to efficiently detect 3D bboxes directly from dense 3D point clouds.
Feature Grouping.
Feature grouping is a crucial step for bottom-up 3D object detectors [
27
,
38
,
22
,
5
,
51
,
35
], which clusters a group of point-wise features to generate high-quality 3D bounding
boxes. Among the numerous successors, voting-based framework [
27
] is widely used, which groups
the points that vote to the same local region. Though impressive, it doesn’t consider the semantic
consistency, so that may fail in cluttered indoor scenes where the objects of different classes are
distributed closely. Moreover, voting-based methods usually adopt a class-agonistic local region for
all objects, which may incorrectly group the boundary points of large objects and involve more noise
points for small objects. To address the above limitations, we present a class-aware local grouping
strategy to aggregate the points of the same category with class-specific center regions.
Two-stage 3D Object Detection.
Many state-of-the-art methods considered applying RCNN style 2D
detectors to the 3D scenes, which apply 3D RoI-pooling scheme or its variants [
30
,
32
,
9
,
31
,
47
,
43
]
to aggregate the specific features within 3D proposals for the box refinement in a second stage. These
pooling algorithms are usually equipped with set abstraction [
25
] to encode local spatial features,
which consists of a hand-crafted query operation (e.g., ball query [
25
] or vector query [
9
]) to capture
the local points and a max-pooling operation to group the assigned features. Therefore these RoI
pooling modules are mostly computation expensive. Moreover, the max-pooling operation also harms
the spatial distribution information. To tackle these problems, we propose RoI-Conv pooling, a
memory-and-computation efficient fully convolutional RoI pooling operation to aggregate the specific
features for the following refinement.
3 Methodology
In this paper, we propose CAGroup3D, a two-stage fully convolutional 3D object detection framework
for estimating accurate 3D bounding boxes from point clouds. The overall architecture of CAGroup3D
is depicted in Figure 2. Our framework consists of three major components: an efficient 3D voxel
CNN with sparse convolution as the backbone network for point cloud feature learning (§3.1), a
class-aware 3D proposal generation module for predicting high quality 3D proposals by aggregating
voxel features of the same category within the class-specific local regions (§3.2) and RoI-Conv
pooling module for directly extracting complete and fine-grained voxel features from the backbone to
revisit the miss-segmented surface voxels and refine 3D proposals. Finally, we formulate the learning
objective of our framework in §A.4.
3.1 3D Voxel CNN for Point Cloud Feature Learning
For generating accurate 3D proposals, we first need to learn discriminative geometric representation
for describing input point clouds. Voxel CNN with 3D sparse convolution [
32
,
44
,
53
,
13
,
12
] is
widely used by state-of-the-art 3D detectors thanks to its high efficiency and scalability of converting
the point clouds to regular 3D volumes. In this paper, we adopt sparse convolution based backbone
for feature encoding and 3D proposal generation.
3D backbone network equipped with high-resolution feature maps and large receptive fields is critical
for accurate 3D bounding box estimation and voxel-wise semantic segmentation. The latter is closely
related to the accuracy of succeeding grouping module. To maintain these two characteristics, inspired
by the success of HRNet series [
39
,
16
,
34
] in segmentation community, we implement a 3D voxel
bilateral network with dual resolution based on ResNet [
15
]. For brevity, we refer it as BiResNet. As
shown in Figure 2, our backbone network contains two branches. One is the sparse modification of
3
Box Refinement
Block1 (1/2)
Block2 (1/4)
Block3 (1/8) Feat (1/2)
Feat (1/2)
Input
Voxelize
N x 3
Block3 (1/16)
Class Aware 3D Proposals Generation
BiResNet
a. Class-Aware 3D Proposal Generation
b. RoI-Conv Point Cloud Feature Pooling and Box Refinement
VFE.1
VFE.2
VFE.n
Feat (1/2)
semantic branch
Group.1
Group.2
Group.n
class-aware voxelization class-aware local grouping
3D Proposals Multiple Sparse
Abstraction blocks
downsamplehigh-resolution feat fusion
vote branch
RoI-Conv Point Cloud Feature Pooling
Grouping via SpConvRoI-guidence sampling
Sparse Abstraction
Figure 2: The overall architecture of CAGroup3D. (a) Generate 3D proposals by utilizing class-aware
local grouping on the vote space with same semantic predictions. (b) Aggregating the specific features
within the 3D proposals by the efficient RoI-Conv pooling module for the following box refinement.
ResNet18 [
15
] where all 2D convolutions are replaced with 3D sparse convolutions. It can extract
multi-scale contextual information with proper downsampling modules. The other one is a auxiliary
branch that maintains a high-resolution feature map whose resolution is 1/2 of the input 3D voxels.
Specifically, the auxiliary branch is inserted following the first stage of ResNet backbone and doesn’t
contain any downsampling operation. Similar to [
39
], we adopt the bridge operation between the
two paths to perform the bilateral feature fusion. Finally, the fine-grained voxel-wise geometric
features with rich contextual information are generated by the high-resolution branch and facilitate
the following module. Experiments also demonstrate that our voxel backbone performs better than
previous FPN-based ResNet [20]. More architecture details are in Appendix.
3.2 Class-Aware 3D Proposal Generation
Given the voxel-wise geometric features generated by the backbone network, a bottom-up grouping
algorithm is generally adopted to aggregate object surface voxels into their respective ground truth
instances and generate reliable 3D proposals. Voting-based grouping method [
27
] has shown great
success for 3D object detection, which is performed in a class-agnostic manner. It reformulates
Hough voting to learn point-wise center offsets, and then generates object candidates by clustering
the points that vote to similar center regions. However, this method may incorrectly group the outliers
in the cluttered indoor scenarios (e.g., votes are close together but belong to different categories),
which degrades the performance of 3D object detection. Moreover, due to the diverse object sizes of
different categories, class-agnostic local regions may mis-group the boundary points of large objects
and involve more noise points for small objects.
To address this limitation, we propose the class-aware 3D proposal generation module, which first
produces voxel-wise predictions (e.g., semantic maps and geometric shifts), and then clusters the
object surface voxels of the same semantic predictions with class-specific local groups.
Voxel-wise Semantic and Vote Prediction.
After obtaining the voxel features from backbone
network, two branches are constructed to output the voxel-wise semantic scores and center offset
vectors. Specifically, the backbone network generates a number of
N
non-empty voxels
{oi}N
i=1
from
backbone, where
oi= [xi;fi]
with
xiR3
and
fiRC
. A voting branch encodes the voxel feature
fi
to learn the spatial center offset
xiR3
and feature offset
fiRC
. Based on the learned
spatial and feature offset, we shift voxel
oi
to the center of its respective instance and generate vote
4
摘要:

CAGroup3D:Class-AwareGroupingfor3DObjectDetectiononPointCloudsHaiyangWang1;6;7,LiheDing2,ShaocongDong2,ShaoshuaiShi3y,AoxueLi4,JiananLi2,ZhenguoLi4,LiweiWang1;5y1CenterforDataScience,PekingUniversity2BeijinginstituteofTechnology3MaxPlanckInstituteforInformatics4HuaweiNoah'sArkLab,China5KeyLaborato...

展开>> 收起<<
CAGroup3D Class-Aware Grouping for 3D Object Detection on Point Clouds Haiyang Wang167 Lihe Ding2 Shaocong Dong2 Shaoshuai Shi3y.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:5.91MB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注