CAGroup3D Class-Aware Grouping for 3D Object Detection on Point Clouds Haiyang Wang167 Lihe Ding2 Shaocong Dong2 Shaoshuai Shi3y

2025-05-01 0 0 5.91MB 19 页 10玖币

侵权投诉

CAGroup3D: Class-Aware Grouping for 3D Object

Detection on Point Clouds

Haiyang Wang1,6,7∗

, Lihe Ding2∗

, Shaocong Dong2, Shaoshuai Shi3†

Aoxue Li4, Jianan Li2, Zhenguo Li4, Liwei Wang1,5†

1Center for Data Science, Peking University 2Beijing institute of Technology

3Max Planck Institute for Informatics 4Huawei Noah’s Ark Lab, China

5Key Laboratory of Machine Perception, MOE, School of Intelligence Science and Technology,

Peking University 6Peng Cheng Laboratory 7Pazhou Laboratory (Huangpu)

{wanghaiyang@stu, wanglw@cls}.pku.edu.cn, {dean.dinglihe, shaocong}@bit.edu.cn

sshi@mpi-inf.mpg.de, lijianan15@gmail.com {liaoxue2, Li.Zhenguo}@huawei.com

Abstract

We present a novel two-stage fully sparse convolutional 3D object detection frame-

work, named CAGroup3D. Our proposed method ﬁrst generates some high-quality

3D proposals by leveraging the class-aware local group strategy on the object

surface voxels with the same semantic predictions, which considers semantic

consistency and diverse locality abandoned in previous bottom-up approaches.

Then, to recover the features of missed voxels due to incorrect voxel-wise seg-

mentation, we build a fully sparse convolutional RoI pooling module to directly

aggregate ﬁne-grained spatial information from backbone for further proposal

reﬁnement. It is memory-and-computation efﬁcient and can better encode the

geometry-speciﬁc features of each 3D proposal. Our model achieves state-of-the-

art 3D detection performance with remarkable gains of +3.6% on ScanNet V2

and +2.6% on SUN RGB-D in term of mAP@0.25. Code will be available at

https://github.com/Haiyang-W/CAGroup3D.

1 Introduction

As a crucial step towards understanding 3D visual world, 3D object detection aims to estimate the

oriented 3D bounding boxes and semantic labels of objects in real 3D scenes. It has been studied for

a long time in both academia and industry since it beneﬁts various downstream applications, such as

autonomous driving [

], robotics [

] and augmented reality [

]. In this paper, we focus on

detecting 3D objects from unordered, sparse and irregular point clouds. Those natural characteristics

make it more challenging to directly extend well-studied 2D techniques to 3D detection.

Unlike 3D object detection from autonomous driving scenarios that only considers bird’s eye view

(BEV) boxes [

], most of existing 3D indoor object detectors [

]

typically handle this task through a bottom-up scheme, which extracts the point-wise features from

input point clouds, and then groups the points into their respective instances to generate a set of

proposals. However, the above grouping algorithms are usually carried out in a class-agnostic manner,

which abandons semantic consistency within the same group and also ignores diverse locality among

different categories. For example, VoteNet [

] learns the point-wise center offsets and aggregates the

points that vote to similar semantic-irrelevant local region. Though impressive, as shown in Figure

1, these methods may fail in cluttered indoor scenes where various objects are close but belong to

different categories. Also, the object sizes are diverse for different categories, so that a class-agnostic

*Equal contribution.

†Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04264v1 [cs.CV] 9 Oct 2022

seed point (table) seed point (chair) seed point (sofa) ball query

vote point (table) vote point (chair) vote point (sofa)

(a) mis-grouping of different categories (b) partial or over coverage

Figure 1: Class-agnostic grouping methods suffer from (a) mis-grouping of different categories within

the same local regions, (b) partial coverage of the object surfaces; outliers from the cluttered scene.

local grouping may partially cover the boundary points of large objects and involve more noise

outliers for small objects.

Hence, we propose CAGroup3D, a two-stage fully convolutional 3D object detection framework.

Our method consists of two novel components. One is the class-aware 3D proposal generation

module, which aims to generate reliable proposals by utilizing class-speciﬁc local group strategy on

the object surface voxels with same semantic predictions. The other one is an efﬁcient fully sparse

convolutional RoI pooling module for recovering the features of the missed surface voxels due to

semantic segmentation errors, so as to improve the quality of predicted boxes.

Speciﬁcally, a backbone network with 3D sparse convolution is ﬁrstly utilized to extract descriptive

voxel-wise features from raw point clouds. Based on the learned features, we conduct a class-aware

local grouping module to cluster surface voxels into their corresponding instance centroids. Different

from [

], in order to consider the semantic consistency, we not only shift voxels of the same

instance towards the same centroid but also predict per-voxel semantic scores. Given the contiguously

distributed vote points with their semantic predictions, we initially voxelize them according to the

predicted semantic categories and vote coordinates, so as to generate class-speciﬁc 3D voxels for

different categories. The voxel size of each category is adaptive to its average spatial dimension.

To maintain the structure of fully convolution, we apply sparse convolution as grouping operation

centered on each voted voxel to aggregate adjacent voxel features in the same semantic space. Note

that these grouping layers are class-dependent but share the same kernel size, thus the larger classes

are preferred to be aggregated with larger local regions.

Secondly, given the proposal candidates, ﬁne-grained speciﬁc features within 3D proposals need to

be revisited from 3D backbone through certain pooling operation for the following box reﬁnement.

However, state-of-the-art pooling strategies [

] are memory-and-computation intensive due to the

hand-crafted set abstraction [

]. Besides that, its max-pooling operation also harms the geometry

distribution. To tackle this problem, we propose RoI-Conv pooling module, which directly adopts the

well-optimized 3D sparse convolutions to aggregate voxel features from backbone. It can encode

effective geometric representations with a memory-efﬁcient design for further proposal reﬁnement.

In summary, our contributions are three-fold: 1) We propose a novel class-aware 3D proposal

generation strategy, which considers both the voxel-wise semantic consistency within the same local

group and the object-level shape diversity among different categories. 2) We present RoI-Conv

pooling module, an efﬁcient fully convolutional 3D pooling operation for revisiting voxel features

directly from backbone to reﬁne 3D proposals. 3) Our approach outperforms state-of-the-art methods

with remarkable gains on two challenging indoor datasets, i.e., ScanNet V2 [

] and SUN RGB-D [

demonstrating its effectiveness and generality.

2 Related Work

3D Object Detection on Point Clouds.

Detecting 3D objects from point clouds is challenging due

to orderless, sparse and irregular characteristics. Previous approaches can be coarsely classiﬁed into

two lines in terms of point representations, i.e., the voxel-based methods [

]

and the point-based methods [

]. Voxel-based methods are mainly applied in

outdoor autonomous driving scenarios where objects are distributed on the large-scale 2D ground

plane. They process the sparse point clouds by efﬁcient 3D sparse convolution, then project these

3D volumes to 2D grids for detecting bird’s eye view (BEV) bboxes by 2D ConvNet. Powered by

PointNet series [

], point-based methods are also widely used to predict 3D bounding bboxes.

Most of existing methods are in a bottom-up manner, which extracts the point-wise features and

groups them to obtain object features. This pipeline has been a great success for estimating 3D bboxes

directly from cluttered and dense 3D scenes. However, due to the hand-crafted point sampling and

computation intensive grouping scheme applied in PointNet++ [

], they are difﬁcult to be extended

to large-scale point clouds. Hence, we propose an efﬁcient fully convolutional bottom-up framework

to efﬁciently detect 3D bboxes directly from dense 3D point clouds.

Feature Grouping.

Feature grouping is a crucial step for bottom-up 3D object detectors [

], which clusters a group of point-wise features to generate high-quality 3D bounding

boxes. Among the numerous successors, voting-based framework [

] is widely used, which groups

the points that vote to the same local region. Though impressive, it doesn’t consider the semantic

consistency, so that may fail in cluttered indoor scenes where the objects of different classes are

distributed closely. Moreover, voting-based methods usually adopt a class-agonistic local region for

all objects, which may incorrectly group the boundary points of large objects and involve more noise

points for small objects. To address the above limitations, we present a class-aware local grouping

strategy to aggregate the points of the same category with class-speciﬁc center regions.

Two-stage 3D Object Detection.

Many state-of-the-art methods considered applying RCNN style 2D

detectors to the 3D scenes, which apply 3D RoI-pooling scheme or its variants [

]

to aggregate the speciﬁc features within 3D proposals for the box reﬁnement in a second stage. These

pooling algorithms are usually equipped with set abstraction [

] to encode local spatial features,

which consists of a hand-crafted query operation (e.g., ball query [

] or vector query [

]) to capture

the local points and a max-pooling operation to group the assigned features. Therefore these RoI

pooling modules are mostly computation expensive. Moreover, the max-pooling operation also harms

the spatial distribution information. To tackle these problems, we propose RoI-Conv pooling, a

memory-and-computation efﬁcient fully convolutional RoI pooling operation to aggregate the speciﬁc

features for the following reﬁnement.

3 Methodology

In this paper, we propose CAGroup3D, a two-stage fully convolutional 3D object detection framework

for estimating accurate 3D bounding boxes from point clouds. The overall architecture of CAGroup3D

is depicted in Figure 2. Our framework consists of three major components: an efﬁcient 3D voxel

CNN with sparse convolution as the backbone network for point cloud feature learning (§3.1), a

class-aware 3D proposal generation module for predicting high quality 3D proposals by aggregating

voxel features of the same category within the class-speciﬁc local regions (§3.2) and RoI-Conv

pooling module for directly extracting complete and ﬁne-grained voxel features from the backbone to

revisit the miss-segmented surface voxels and reﬁne 3D proposals. Finally, we formulate the learning

objective of our framework in §A.4.

3.1 3D Voxel CNN for Point Cloud Feature Learning

For generating accurate 3D proposals, we ﬁrst need to learn discriminative geometric representation

for describing input point clouds. Voxel CNN with 3D sparse convolution [

] is

widely used by state-of-the-art 3D detectors thanks to its high efﬁciency and scalability of converting

the point clouds to regular 3D volumes. In this paper, we adopt sparse convolution based backbone

for feature encoding and 3D proposal generation.

3D backbone network equipped with high-resolution feature maps and large receptive ﬁelds is critical

for accurate 3D bounding box estimation and voxel-wise semantic segmentation. The latter is closely

related to the accuracy of succeeding grouping module. To maintain these two characteristics, inspired

by the success of HRNet series [

] in segmentation community, we implement a 3D voxel

bilateral network with dual resolution based on ResNet [

]. For brevity, we refer it as BiResNet. As

shown in Figure 2, our backbone network contains two branches. One is the sparse modiﬁcation of

Box Refinement

Block1 (1/2)

Block2 (1/4)

Block3 (1/8) Feat (1/2)

Feat (1/2)

Input

Voxelize

N x 3

Block3 (1/16)

Class Aware 3D Proposals Generation

BiResNet

a. Class-Aware 3D Proposal Generation

b. RoI-Conv Point Cloud Feature Pooling and Box Refinement

VFE.1

VFE.2

VFE.n

Feat (1/2)

…

semantic branch

Group.1

Group.2

Group.n

…

class-aware voxelization class-aware local grouping

3D Proposals Multiple Sparse

Abstraction blocks

downsamplehigh-resolution feat fusion

vote branch

…

RoI-Conv Point Cloud Feature Pooling

Grouping via SpConvRoI-guidence sampling

Sparse Abstraction

Figure 2: The overall architecture of CAGroup3D. (a) Generate 3D proposals by utilizing class-aware

local grouping on the vote space with same semantic predictions. (b) Aggregating the speciﬁc features

within the 3D proposals by the efﬁcient RoI-Conv pooling module for the following box reﬁnement.

ResNet18 [

] where all 2D convolutions are replaced with 3D sparse convolutions. It can extract

multi-scale contextual information with proper downsampling modules. The other one is a auxiliary

branch that maintains a high-resolution feature map whose resolution is 1/2 of the input 3D voxels.

Speciﬁcally, the auxiliary branch is inserted following the ﬁrst stage of ResNet backbone and doesn’t

contain any downsampling operation. Similar to [

], we adopt the bridge operation between the

two paths to perform the bilateral feature fusion. Finally, the ﬁne-grained voxel-wise geometric

features with rich contextual information are generated by the high-resolution branch and facilitate

the following module. Experiments also demonstrate that our voxel backbone performs better than

previous FPN-based ResNet [20]. More architecture details are in Appendix.

3.2 Class-Aware 3D Proposal Generation

Given the voxel-wise geometric features generated by the backbone network, a bottom-up grouping

algorithm is generally adopted to aggregate object surface voxels into their respective ground truth

instances and generate reliable 3D proposals. Voting-based grouping method [

] has shown great

success for 3D object detection, which is performed in a class-agnostic manner. It reformulates

Hough voting to learn point-wise center offsets, and then generates object candidates by clustering

the points that vote to similar center regions. However, this method may incorrectly group the outliers

in the cluttered indoor scenarios (e.g., votes are close together but belong to different categories),

which degrades the performance of 3D object detection. Moreover, due to the diverse object sizes of

different categories, class-agnostic local regions may mis-group the boundary points of large objects

and involve more noise points for small objects.

To address this limitation, we propose the class-aware 3D proposal generation module, which ﬁrst

produces voxel-wise predictions (e.g., semantic maps and geometric shifts), and then clusters the

object surface voxels of the same semantic predictions with class-speciﬁc local groups.

Voxel-wise Semantic and Vote Prediction.

After obtaining the voxel features from backbone

network, two branches are constructed to output the voxel-wise semantic scores and center offset

vectors. Speciﬁcally, the backbone network generates a number of

non-empty voxels

{oi}N

i=1

from

backbone, where

oi= [xi;fi]

with

xi∈R3

and

fi∈RC

. A voting branch encodes the voxel feature

to learn the spatial center offset

∆xi∈R3

and feature offset

∆fi∈RC

. Based on the learned

spatial and feature offset, we shift voxel

to the center of its respective instance and generate vote

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CAGroup3D:Class-AwareGroupingfor3DObjectDetectiononPointCloudsHaiyangWang1;6;7,LiheDing2,ShaocongDong2,ShaoshuaiShi3y,AoxueLi4,JiananLi2,ZhenguoLi4,LiweiWang1;5y1CenterforDataScience,PekingUniversity2BeijinginstituteofTechnology3MaxPlanckInstituteforInformatics4HuaweiNoah'sArkLab,China5KeyLaborato...

展开>> 收起<<

CAGroup3D Class-Aware Grouping for 3D Object Detection on Point Clouds Haiyang Wang167 Lihe Ding2 Shaocong Dong2 Shaoshuai Shi3y.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CAGroup3D Class-Aware Grouping for 3D Object Detection on Point Clouds Haiyang Wang167 Lihe Ding2 Shaocong Dong2 Shaoshuai Shi3y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: