Prototypical VoteNet for Few-Shot 3D Point Cloud Object Detection Shizhen Zhao Xiaojuan Qi

2025-04-26 0 0 2.09MB 18 页 10玖币

侵权投诉

Prototypical VoteNet for Few-Shot 3D Point Cloud

Object Detection

Shizhen Zhao, Xiaojuan Qi∗

The University of Hong Kong

{zhaosz,xjqi}@eee.hku.hk

Abstract

Most existing 3D point cloud object detection approaches heavily rely on large

amounts of labeled training data. However, the labeling process is costly and

time-consuming. This paper considers few-shot 3D point cloud object detection,

where only a few annotated samples of novel classes are needed with abundant

samples of base classes. To this end, we propose Prototypical VoteNet to recognize

and localize novel instances, which incorporates two new modules: Prototypical

Vote Module (PVM) and Prototypical Head Module (PHM). Speciﬁcally, as the

3D basic geometric structures can be shared among categories, PVM is designed

to leverage class-agnostic geometric prototypes, which are learned from base

classes, to reﬁne local features of novel categories. Then PHM is proposed to

utilize class prototypes to enhance the global feature of each object, facilitating

subsequent object localization and classiﬁcation, which is trained by the episodic

training strategy. To evaluate the model in this new setting, we contribute two

new benchmark datasets, FS-ScanNet and FS-SUNRGBD. We conduct extensive

experiments to demonstrate the effectiveness of Prototypical VoteNet, and our

proposed method shows signiﬁcant and consistent improvements compared to

baselines on two benchmark datasets. This project will be available at

https:

//shizhen-zhao.github.io/FS3D_page/.

1 Introduction

3D object detection aims to localize and recognize objects from point clouds with many applications

in augmented reality, autonomous driving, and robotics manipulation. Recently, a number of fully

supervised 3D object detection approaches have made remarkable progress with deep learning [

]. Nonetheless, their success heavily relies on large amounts of labeled training data, which

are time-consuming and costly to obtain. On the contrary, a human can quickly learn to recognize

novel classes by seeing only a few samples. To imitate such human ability, we consider few-shot

point cloud 3D object detection, which aims to train a model to recognize novel categorizes from

limited annotated samples of novel classes together with sufﬁcient annotated data of base classes.

Few-shot learning has been extensively studied in various 2D visual understanding tasks such as object

detection [

], image classiﬁcation [

], and semantic segmentation [

]. Early attempts [

] employ meta-learning to learn transferable knowledge from a

collection of tasks and attained remarkable progress. Recently, beneﬁted from large-scale datasets (e.g.

ImageNet [

]) and advanced pre-training methods [

], ﬁnetuning large-scale pre-trained

visual models on down-stream few-shot datasets emerges as an effective approach to address this

problem [

]. Among different streams of work, prototype-based methods [

]

have been incorporated into both streams and show the great advantages, since they can capture the

∗Corresponding author

Preprint. Under review.

arXiv:2210.05593v2 [cs.CV] 21 Dec 2022

Figure 1: Illustration of the basic geometry of 3D objects, which can be shared among classes.

representative features of categories that can be further utilized for feature reﬁnement [

] or

classiﬁcation [27, 33].

This motivates us to explore effective 3D cues to build prototypes for few-shot 3D detection. Different

from 2D visual data, 3D data can get rid of distortions caused by perspective projections, and offer

geometric cues with accurate shape and scale information. Besides, 3D primitives to constitute objects

can often be shared among different categories. For instance, as shown in Figure 1, rectangular plates

and corners can be found in many categories. Based on these observations, in this work, we propose

Prototypical VoteNet, which employs such robust 3D shape and primitive clues to design geometric

prototypes to facilitate representation learning in the few-shot setting.

Prototypical VoteNet incorporates two new modules, namely Prototypical Vote Module (PVM)

and Prototypical Head Module (PHM), to enhance local and global feature learning, respectively,

for few-shot 3D detection. Speciﬁcally, based on extracted features from a backbone network

(i.e. PointNet++ [

]), PVM ﬁrstly constructs a class-agnostic 3D primitive memory bank to store

geometric prototypes, which are shared by all categories and updated iteratively during training. To

exploit the transferability of geometric structures, PVM then incorporates a multi-head cross-attention

module to associate geometric prototypes with points in a given scene and utilize them to reﬁne their

feature representations. PVM is majorly developed to exploit shared geometric structures among base

and novel categories to enhance feature learning of local information in the few-shot setting. Further,

to facilitate learning discriminative features for object categorization, PHM is designed to employ a

multi-head cross-attention module and leverage class-speciﬁc prototypes from a few support samples

to reﬁne global representations of objects. Moreover, episodic training [

] is adopted to simulate

few-shot circumstances, where PHM is trained by a distribution of similar few-shot tasks instead of

only one target object detection task.

Our contributions are listed as follows:

•

We are the ﬁrst to study the promising few-shot 3D point cloud object detection task, which

allows a model to detect new classes, given a few examples.

•

We propose Prototypical VoteNet, which incorporates Prototypical Vote Module and Proto-

typical Head Module, to address this new challenge. Prototypical Vote Module leverages

class-agnostic geometric prototypes to enhance the local features of novel samples. Proto-

typical Head Module utilizes the class-speciﬁc prototypes to reﬁne the object features with

the aid of episode training.

•

We contribute two new benchmark dataset settings called FS-ScanNet and FS-SUNRGBD,

which are speciﬁcally designed for this problem. Our experimental results on these two

benchmark datasets show that the proposed model effectively addresses the few-shot 3D

point cloud object detection problem, yielding signiﬁcant improvement over several compet-

itive baseline approaches.

2 Related Work

3D Point Cloud Object Detection.

Current 3D point cloud object detection approaches can be

divided into two streams: Grid Projection/Voxelization based [

] and point-

based [

]. The former projects point cloud to 2D grids or 3D voxels so that the

advanced convolutional networks can be directly applied. The latter methods take the raw point cloud

Sampling

Grouping

Backbone

Cross

Attention

Module

Vo t e

Layer

Prototypical Vote Module

Shared

Network

Geometric Prototypes

Pooling

Query

Support

Update

…

Add

Norm

Cross

Attention

Module

Prediction Layer

Class Prototypes

Object Features

Add

Norm

Prototypical Head Module

VoteNet-Style Component

Figure 2: Illustration of Prototypical VoteNet. Prototypical VoteNet introduces two modules for

few-shot 3D detection: 1) Prototypical Vote Module for enhancing local feature representation of

novel samples by leveraging the geometric prototypes, 2) Prototypical Head Module for reﬁning

global features of novel objects, utilizing the class-speciﬁc prototypes.

feature extraction network such as PointNet++ [

] to generate point-wise features for the subsequent

detection. Although these fully supervised approaches achieved promising 3D detection performance,

their requirement for large amounts of training data precludes their application in many real-world

scenarios where training data is costly or hard to acquire. To alleviate this limitation, we explore the

direction of few-shot 3D object detection in this paper.

Few-Shot Recognition.

Few-shot recognition aims to classify novel instances with abundant base

samples and a few novel samples. Simple pre-training and ﬁnetuning approaches ﬁrst train the model

on the base classes, then ﬁnetune the model on the novel categories [

]. Meta-learning based

methods [

] are proposed to learn classiﬁer across tasks and then transfer to the

few-shot classiﬁcation task. The most related work is Prototypical Network [

], which represents a

class as one prototype so that classiﬁcation can be performed by computing distances to the prototype

representation of each class. The above works mainly focus on 2D image understanding. Recently,

some few-shot learning approaches for point cloud understanding [

] are proposed. For

instance, Sharma et al. [

] propose a graph-based method to propagate the knowledge from few-shot

samples to the input point cloud. However, there is no work studying few-shot 3D point cloud object

detection. In this paper, we ﬁrst study this problem and introduce the spirit of Prototypical Network

into few-shot 3D object detection with 3D geometric prototypes and 3D class-speciﬁc prototypes.

2D Few-shot Object Detection.

Most existing 2D few-shot detectors employ a meta-learning [

] or ﬁne-tuning based mechanism [

]. Particularly, Kang et al. [

] propose a

one-stage few-shot detector which contains a meta feature learner and a feature re-weighting module.

Meta R-CNN [

] presents meta-learning over RoI (Region-of-Interest) features and incorporates it

into Faster R-CNN [

] and Mask R-CNN [

]. TFA [

] reveals that simply ﬁne-tuning the box

classiﬁer and regressor outperforms many meta-learning based methods. Cao et al. [

] improve the

few-shot detection performance by associating each novel class with a well-trained base class based

on their semantic similarity.

3 Our Approach

In few-shot 3D point cloud object detection (FS3D), the object class set

is split into

Cbase

and

Cnovel

such that

C=Cbase ∪Cnovel

and

Cbase ∩Cnovel =∅

. For each class

r∈C

, its annotation dataset

contains all the data samples with object bounding boxes, that is

Tr={(u, P )|u∈R6, P ∈RN×3}

Here,

(u, P )

is a 3D object bounding box

u= (x, y, z, h, w, l)

, representing box center locations and

box dimensions, in a point cloud scene P.

There are only a few examples/shots for each novel class

r∈Cnovel

, which are known as support

samples. Besides, there are plenty of annotated samples for each base class

r∈Cbase

. Given the

above dataset, FS3D aims to train a model to detect object instances in the novel classes leveraging

such sufﬁcient annotations for base categories

Cbase

and limited annotations for novel categories

Cnovel.

In the following, we introduce Prototypical VoteNet for few-shot 3D object detection. We will

describe the preliminaries of our framework in Section 3.1, which adopts the architecture of VoteNet-

style 3D detectors [

]. Then, we present Prototypical VoteNet consisting of Prototypical Vote

Module (Section 3.2.1) and Prototypical Head Module (Section 3.2.2) to enhance feature learning for

FS3D.

3.1 Preliminaries

VoteNet-style 3D detectors [

] takes a point cloud scene

as input, and localizes and

categorizes 3D objects. As shown in Figure 2, it ﬁrstly incorporates a 3D backbone network (i.e.

PointNet++ [

]) parameterized by

θ1

with downsampling layers for point feature extraction as

Equation (1).

Fi=h1(Pi;θ1),(1)

where

and

represent the original and subsampled number of points, respectively,

Pi∈RN×3

represents an input point cloud scene

, and

Fi∈RM×(3+d)

is the subsampled scene points (also

called seeds) with d-dimensional features and 3-dimensional location coordinates.

Then,

is fed into the vote module with parameters

θ2

which outputs a

-dimensional coordinate

offset

∆dj= (∆xj,∆yj,∆zj)

relative to its corresponding object center

c= (cx, cy, cz)

and a

residual feature vector ∆fjfor each point jin Fi={fj}ias in Equation (2).

{∆dj,∆fj}i=h2(Fi;θ2).(2)

Given the predicted offset

∆dj

, the estimated corresponding object center

cj= (cxj, cyj, czj)

that

point jbelongs to can be calculated as Equation (3).

cxj=xj+ ∆xj, cyj=yj+ ∆yj, czj=zj+ ∆zj.(3)

Similarly, the point features are updated as Fi←Fi+ ∆Fiwhere ∆Fi={∆fj}i.

Next, the detector samples object centers from

{(cxj, cyj, czj)}i

using farthest point sampling and

group points with nearby centers together (see Figure 2: Sampling

grouping) to form a set of

object proposals

Oi={ot}i

. Each object proposal is characterized by a feature vector

fot

which is

obtained by applying a max pooling operation on features of all points belonging to ot.

Further, equipped with object features

{fot}i

, the prediction layer with parameters

θ3

is adopted

to yield the bounding boxes

, objectiveness scores

, and classiﬁcation logits

for each object

proposal otfollowing Equation (4).

{bt, st, rt}i=h3({fot}i;θ3).(4)

3.2 Prototypical VoteNet

Here, we present Prototypical VoteNet which incorporates two new designs – Prototypical Vote

Module (PVM) and Prototypical Head Module (PHM) to improve feature learning for novel categories

with few annotated samples (see Figure 2). Speciﬁcally, PVM builds a class-agnostic memory bank

of geometric prototypes

G={gk}K

k=1

with a size of

, which models transferable class-agnostic

3D primitives learned from rich base categories, and further employs them to enhance local feature

representation for novel categories via a multi-head cross-attention module. The enhanced features

are then utilized by the Vote Layer to output the offset of coordinates and features as Equation

(2)

Second, to facilitate learning discriminative features for novel class prediction, PHM employs an

attention-based design to leverage class-speciﬁc prototypes

E={er}R

r=1

extracted from the support

set

Dsupport

with

categories to reﬁne global discriminate feature for representing each object

proposal (see Figure 2). The output features are fed to the prediction layer for producing results

as Equation

(4)

. To make the model more generalizable to novel classes, we exploit the episodic

training [

] strategy to train PHM, where a distribution of similar few-shot tasks instead of

only one object detection task is learned in the training phase. PVM and PHM are elaborated in the

following sections.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PrototypicalVoteNetforFew-Shot3DPointCloudObjectDetectionShizhenZhao,XiaojuanQiTheUniversityofHongKong{zhaosz,xjqi}@eee.hku.hkAbstractMostexisting3Dpointcloudobjectdetectionapproachesheavilyrelyonlargeamountsoflabeledtrainingdata.However,thelabelingprocessiscostlyandtime-consuming.Thispaperconsider...

展开>> 收起<<

Prototypical VoteNet for Few-Shot 3D Point Cloud Object Detection Shizhen Zhao Xiaojuan Qi.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Prototypical VoteNet for Few-Shot 3D Point Cloud Object Detection Shizhen Zhao Xiaojuan Qi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: