two lines in terms of point representations, i.e., the voxel-based methods [
53
,
45
,
32
,
44
,
31
,
49
,
32
]
and the point-based methods [
27
,
38
,
5
,
22
,
51
,
46
]. Voxel-based methods are mainly applied in
outdoor autonomous driving scenarios where objects are distributed on the large-scale 2D ground
plane. They process the sparse point clouds by efficient 3D sparse convolution, then project these
3D volumes to 2D grids for detecting bird’s eye view (BEV) bboxes by 2D ConvNet. Powered by
PointNet series [
25
,
29
], point-based methods are also widely used to predict 3D bounding bboxes.
Most of existing methods are in a bottom-up manner, which extracts the point-wise features and
groups them to obtain object features. This pipeline has been a great success for estimating 3D bboxes
directly from cluttered and dense 3D scenes. However, due to the hand-crafted point sampling and
computation intensive grouping scheme applied in PointNet++ [
29
], they are difficult to be extended
to large-scale point clouds. Hence, we propose an efficient fully convolutional bottom-up framework
to efficiently detect 3D bboxes directly from dense 3D point clouds.
Feature Grouping.
Feature grouping is a crucial step for bottom-up 3D object detectors [
27
,
38
,
22
,
5
,
51
,
35
], which clusters a group of point-wise features to generate high-quality 3D bounding
boxes. Among the numerous successors, voting-based framework [
27
] is widely used, which groups
the points that vote to the same local region. Though impressive, it doesn’t consider the semantic
consistency, so that may fail in cluttered indoor scenes where the objects of different classes are
distributed closely. Moreover, voting-based methods usually adopt a class-agonistic local region for
all objects, which may incorrectly group the boundary points of large objects and involve more noise
points for small objects. To address the above limitations, we present a class-aware local grouping
strategy to aggregate the points of the same category with class-specific center regions.
Two-stage 3D Object Detection.
Many state-of-the-art methods considered applying RCNN style 2D
detectors to the 3D scenes, which apply 3D RoI-pooling scheme or its variants [
30
,
32
,
9
,
31
,
47
,
43
]
to aggregate the specific features within 3D proposals for the box refinement in a second stage. These
pooling algorithms are usually equipped with set abstraction [
25
] to encode local spatial features,
which consists of a hand-crafted query operation (e.g., ball query [
25
] or vector query [
9
]) to capture
the local points and a max-pooling operation to group the assigned features. Therefore these RoI
pooling modules are mostly computation expensive. Moreover, the max-pooling operation also harms
the spatial distribution information. To tackle these problems, we propose RoI-Conv pooling, a
memory-and-computation efficient fully convolutional RoI pooling operation to aggregate the specific
features for the following refinement.
3 Methodology
In this paper, we propose CAGroup3D, a two-stage fully convolutional 3D object detection framework
for estimating accurate 3D bounding boxes from point clouds. The overall architecture of CAGroup3D
is depicted in Figure 2. Our framework consists of three major components: an efficient 3D voxel
CNN with sparse convolution as the backbone network for point cloud feature learning (§3.1), a
class-aware 3D proposal generation module for predicting high quality 3D proposals by aggregating
voxel features of the same category within the class-specific local regions (§3.2) and RoI-Conv
pooling module for directly extracting complete and fine-grained voxel features from the backbone to
revisit the miss-segmented surface voxels and refine 3D proposals. Finally, we formulate the learning
objective of our framework in §A.4.
3.1 3D Voxel CNN for Point Cloud Feature Learning
For generating accurate 3D proposals, we first need to learn discriminative geometric representation
for describing input point clouds. Voxel CNN with 3D sparse convolution [
32
,
44
,
53
,
13
,
12
] is
widely used by state-of-the-art 3D detectors thanks to its high efficiency and scalability of converting
the point clouds to regular 3D volumes. In this paper, we adopt sparse convolution based backbone
for feature encoding and 3D proposal generation.
3D backbone network equipped with high-resolution feature maps and large receptive fields is critical
for accurate 3D bounding box estimation and voxel-wise semantic segmentation. The latter is closely
related to the accuracy of succeeding grouping module. To maintain these two characteristics, inspired
by the success of HRNet series [
39
,
16
,
34
] in segmentation community, we implement a 3D voxel
bilateral network with dual resolution based on ResNet [
15
]. For brevity, we refer it as BiResNet. As
shown in Figure 2, our backbone network contains two branches. One is the sparse modification of
3