Prototypical VoteNet for Few-Shot 3D Point Cloud Object Detection Shizhen Zhao Xiaojuan Qi

2025-04-26 0 0 2.09MB 18 页 10玖币
侵权投诉
Prototypical VoteNet for Few-Shot 3D Point Cloud
Object Detection
Shizhen Zhao, Xiaojuan Qi
The University of Hong Kong
{zhaosz,xjqi}@eee.hku.hk
Abstract
Most existing 3D point cloud object detection approaches heavily rely on large
amounts of labeled training data. However, the labeling process is costly and
time-consuming. This paper considers few-shot 3D point cloud object detection,
where only a few annotated samples of novel classes are needed with abundant
samples of base classes. To this end, we propose Prototypical VoteNet to recognize
and localize novel instances, which incorporates two new modules: Prototypical
Vote Module (PVM) and Prototypical Head Module (PHM). Specifically, as the
3D basic geometric structures can be shared among categories, PVM is designed
to leverage class-agnostic geometric prototypes, which are learned from base
classes, to refine local features of novel categories. Then PHM is proposed to
utilize class prototypes to enhance the global feature of each object, facilitating
subsequent object localization and classification, which is trained by the episodic
training strategy. To evaluate the model in this new setting, we contribute two
new benchmark datasets, FS-ScanNet and FS-SUNRGBD. We conduct extensive
experiments to demonstrate the effectiveness of Prototypical VoteNet, and our
proposed method shows significant and consistent improvements compared to
baselines on two benchmark datasets. This project will be available at
https:
//shizhen-zhao.github.io/FS3D_page/.
1 Introduction
3D object detection aims to localize and recognize objects from point clouds with many applications
in augmented reality, autonomous driving, and robotics manipulation. Recently, a number of fully
supervised 3D object detection approaches have made remarkable progress with deep learning [
23
,
19
,
32
,
25
]. Nonetheless, their success heavily relies on large amounts of labeled training data, which
are time-consuming and costly to obtain. On the contrary, a human can quickly learn to recognize
novel classes by seeing only a few samples. To imitate such human ability, we consider few-shot
point cloud 3D object detection, which aims to train a model to recognize novel categorizes from
limited annotated samples of novel classes together with sufficient annotated data of base classes.
Few-shot learning has been extensively studied in various 2D visual understanding tasks such as object
detection [
40
,
41
,
44
,
47
], image classification [
15
,
10
,
3
,
33
], and semantic segmentation [
24
,
22
,
50
,
20
]. Early attempts [
10
,
17
,
12
,
39
] employ meta-learning to learn transferable knowledge from a
collection of tasks and attained remarkable progress. Recently, benefited from large-scale datasets (e.g.
ImageNet [
7
]) and advanced pre-training methods [
28
,
51
,
11
,
56
], finetuning large-scale pre-trained
visual models on down-stream few-shot datasets emerges as an effective approach to address this
problem [
34
,
40
,
57
]. Among different streams of work, prototype-based methods [
43
,
55
,
21
,
18
]
have been incorporated into both streams and show the great advantages, since they can capture the
Corresponding author
Preprint. Under review.
arXiv:2210.05593v2 [cs.CV] 21 Dec 2022
Figure 1: Illustration of the basic geometry of 3D objects, which can be shared among classes.
representative features of categories that can be further utilized for feature refinement [
47
,
53
] or
classification [27, 33].
This motivates us to explore effective 3D cues to build prototypes for few-shot 3D detection. Different
from 2D visual data, 3D data can get rid of distortions caused by perspective projections, and offer
geometric cues with accurate shape and scale information. Besides, 3D primitives to constitute objects
can often be shared among different categories. For instance, as shown in Figure 1, rectangular plates
and corners can be found in many categories. Based on these observations, in this work, we propose
Prototypical VoteNet, which employs such robust 3D shape and primitive clues to design geometric
prototypes to facilitate representation learning in the few-shot setting.
Prototypical VoteNet incorporates two new modules, namely Prototypical Vote Module (PVM)
and Prototypical Head Module (PHM), to enhance local and global feature learning, respectively,
for few-shot 3D detection. Specifically, based on extracted features from a backbone network
(i.e. PointNet++ [
26
]), PVM firstly constructs a class-agnostic 3D primitive memory bank to store
geometric prototypes, which are shared by all categories and updated iteratively during training. To
exploit the transferability of geometric structures, PVM then incorporates a multi-head cross-attention
module to associate geometric prototypes with points in a given scene and utilize them to refine their
feature representations. PVM is majorly developed to exploit shared geometric structures among base
and novel categories to enhance feature learning of local information in the few-shot setting. Further,
to facilitate learning discriminative features for object categorization, PHM is designed to employ a
multi-head cross-attention module and leverage class-specific prototypes from a few support samples
to refine global representations of objects. Moreover, episodic training [
33
,
39
] is adopted to simulate
few-shot circumstances, where PHM is trained by a distribution of similar few-shot tasks instead of
only one target object detection task.
Our contributions are listed as follows:
We are the first to study the promising few-shot 3D point cloud object detection task, which
allows a model to detect new classes, given a few examples.
We propose Prototypical VoteNet, which incorporates Prototypical Vote Module and Proto-
typical Head Module, to address this new challenge. Prototypical Vote Module leverages
class-agnostic geometric prototypes to enhance the local features of novel samples. Proto-
typical Head Module utilizes the class-specific prototypes to refine the object features with
the aid of episode training.
We contribute two new benchmark dataset settings called FS-ScanNet and FS-SUNRGBD,
which are specifically designed for this problem. Our experimental results on these two
benchmark datasets show that the proposed model effectively addresses the few-shot 3D
point cloud object detection problem, yielding significant improvement over several compet-
itive baseline approaches.
2 Related Work
3D Point Cloud Object Detection.
Current 3D point cloud object detection approaches can be
divided into two streams: Grid Projection/Voxelization based [
48
,
16
,
36
,
4
,
54
,
42
] and point-
based [
31
,
23
,
19
,
9
,
2
]. The former projects point cloud to 2D grids or 3D voxels so that the
advanced convolutional networks can be directly applied. The latter methods take the raw point cloud
2
Sampling
&
Grouping
Backbone
Cross
Attention
Module
.
.
.
Vo t e
Layer
Prototypical Vote Module
Shared
Network
Geometric Prototypes
Pooling
Query
Support
Update
Add
&
Norm
Cross
Attention
Module
Prediction Layer
Class Prototypes
Object Features
Add
&
Norm
.
.
.
.
.
.
Prototypical Head Module
VoteNet-Style Component
Figure 2: Illustration of Prototypical VoteNet. Prototypical VoteNet introduces two modules for
few-shot 3D detection: 1) Prototypical Vote Module for enhancing local feature representation of
novel samples by leveraging the geometric prototypes, 2) Prototypical Head Module for refining
global features of novel objects, utilizing the class-specific prototypes.
feature extraction network such as PointNet++ [
26
] to generate point-wise features for the subsequent
detection. Although these fully supervised approaches achieved promising 3D detection performance,
their requirement for large amounts of training data precludes their application in many real-world
scenarios where training data is costly or hard to acquire. To alleviate this limitation, we explore the
direction of few-shot 3D object detection in this paper.
Few-Shot Recognition.
Few-shot recognition aims to classify novel instances with abundant base
samples and a few novel samples. Simple pre-training and finetuning approaches first train the model
on the base classes, then finetune the model on the novel categories [
3
,
8
]. Meta-learning based
methods [
10
,
17
,
12
,
39
,
33
] are proposed to learn classifier across tasks and then transfer to the
few-shot classification task. The most related work is Prototypical Network [
33
], which represents a
class as one prototype so that classification can be performed by computing distances to the prototype
representation of each class. The above works mainly focus on 2D image understanding. Recently,
some few-shot learning approaches for point cloud understanding [
30
,
53
,
49
] are proposed. For
instance, Sharma et al. [
53
] propose a graph-based method to propagate the knowledge from few-shot
samples to the input point cloud. However, there is no work studying few-shot 3D point cloud object
detection. In this paper, we first study this problem and introduce the spirit of Prototypical Network
into few-shot 3D object detection with 3D geometric prototypes and 3D class-specific prototypes.
2D Few-shot Object Detection.
Most existing 2D few-shot detectors employ a meta-learning [
41
,
15
,
47
] or fine-tuning based mechanism [
45
,
44
,
27
,
37
]. Particularly, Kang et al. [
15
] propose a
one-stage few-shot detector which contains a meta feature learner and a feature re-weighting module.
Meta R-CNN [
47
] presents meta-learning over RoI (Region-of-Interest) features and incorporates it
into Faster R-CNN [
29
] and Mask R-CNN [
12
]. TFA [
40
] reveals that simply fine-tuning the box
classifier and regressor outperforms many meta-learning based methods. Cao et al. [
1
] improve the
few-shot detection performance by associating each novel class with a well-trained base class based
on their semantic similarity.
3 Our Approach
In few-shot 3D point cloud object detection (FS3D), the object class set
C
is split into
Cbase
and
Cnovel
such that
C=Cbase Cnovel
and
Cbase Cnovel =
. For each class
rC
, its annotation dataset
Tr
contains all the data samples with object bounding boxes, that is
Tr={(u, P )|uR6, P RN×3}
.
Here,
(u, P )
is a 3D object bounding box
u= (x, y, z, h, w, l)
, representing box center locations and
box dimensions, in a point cloud scene P.
There are only a few examples/shots for each novel class
rCnovel
, which are known as support
samples. Besides, there are plenty of annotated samples for each base class
rCbase
. Given the
above dataset, FS3D aims to train a model to detect object instances in the novel classes leveraging
3
such sufficient annotations for base categories
Cbase
and limited annotations for novel categories
Cnovel.
In the following, we introduce Prototypical VoteNet for few-shot 3D object detection. We will
describe the preliminaries of our framework in Section 3.1, which adopts the architecture of VoteNet-
style 3D detectors [
25
,
52
,
2
]. Then, we present Prototypical VoteNet consisting of Prototypical Vote
Module (Section 3.2.1) and Prototypical Head Module (Section 3.2.2) to enhance feature learning for
FS3D.
3.1 Preliminaries
VoteNet-style 3D detectors [
25
,
52
,
2
] takes a point cloud scene
Pi
as input, and localizes and
categorizes 3D objects. As shown in Figure 2, it firstly incorporates a 3D backbone network (i.e.
PointNet++ [
26
]) parameterized by
θ1
with downsampling layers for point feature extraction as
Equation (1).
Fi=h1(Pi;θ1),(1)
where
N
and
M
represent the original and subsampled number of points, respectively,
PiRN×3
represents an input point cloud scene
i
, and
FiRM×(3+d)
is the subsampled scene points (also
called seeds) with d-dimensional features and 3-dimensional location coordinates.
Then,
Fi
is fed into the vote module with parameters
θ2
which outputs a
3
-dimensional coordinate
offset
dj= (∆xj,yj,zj)
relative to its corresponding object center
c= (cx, cy, cz)
and a
residual feature vector fjfor each point jin Fi={fj}ias in Equation (2).
{dj,fj}i=h2(Fi;θ2).(2)
Given the predicted offset
dj
, the estimated corresponding object center
cj= (cxj, cyj, czj)
that
point jbelongs to can be calculated as Equation (3).
cxj=xj+ ∆xj, cyj=yj+ ∆yj, czj=zj+ ∆zj.(3)
Similarly, the point features are updated as FiFi+ ∆Fiwhere Fi={fj}i.
Next, the detector samples object centers from
{(cxj, cyj, czj)}i
using farthest point sampling and
group points with nearby centers together (see Figure 2: Sampling
&
grouping) to form a set of
object proposals
Oi={ot}i
. Each object proposal is characterized by a feature vector
fot
which is
obtained by applying a max pooling operation on features of all points belonging to ot.
Further, equipped with object features
{fot}i
, the prediction layer with parameters
θ3
is adopted
to yield the bounding boxes
bt
, objectiveness scores
st
, and classification logits
rt
for each object
proposal otfollowing Equation (4).
{bt, st, rt}i=h3({fot}i;θ3).(4)
3.2 Prototypical VoteNet
Here, we present Prototypical VoteNet which incorporates two new designs – Prototypical Vote
Module (PVM) and Prototypical Head Module (PHM) to improve feature learning for novel categories
with few annotated samples (see Figure 2). Specifically, PVM builds a class-agnostic memory bank
of geometric prototypes
G={gk}K
k=1
with a size of
K
, which models transferable class-agnostic
3D primitives learned from rich base categories, and further employs them to enhance local feature
representation for novel categories via a multi-head cross-attention module. The enhanced features
are then utilized by the Vote Layer to output the offset of coordinates and features as Equation
(2)
.
Second, to facilitate learning discriminative features for novel class prediction, PHM employs an
attention-based design to leverage class-specific prototypes
E={er}R
r=1
extracted from the support
set
Dsupport
with
R
categories to refine global discriminate feature for representing each object
proposal (see Figure 2). The output features are fed to the prediction layer for producing results
as Equation
(4)
. To make the model more generalizable to novel classes, we exploit the episodic
training [
33
,
39
] strategy to train PHM, where a distribution of similar few-shot tasks instead of
only one object detection task is learned in the training phase. PVM and PHM are elaborated in the
following sections.
4
摘要:

PrototypicalVoteNetforFew-Shot3DPointCloudObjectDetectionShizhenZhao,XiaojuanQiTheUniversityofHongKong{zhaosz,xjqi}@eee.hku.hkAbstractMostexisting3Dpointcloudobjectdetectionapproachesheavilyrelyonlargeamountsoflabeledtrainingdata.However,thelabelingprocessiscostlyandtime-consuming.Thispaperconsider...

展开>> 收起<<
Prototypical VoteNet for Few-Shot 3D Point Cloud Object Detection Shizhen Zhao Xiaojuan Qi.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:2.09MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注