such sufficient annotations for base categories
Cbase
and limited annotations for novel categories
Cnovel.
In the following, we introduce Prototypical VoteNet for few-shot 3D object detection. We will
describe the preliminaries of our framework in Section 3.1, which adopts the architecture of VoteNet-
style 3D detectors [
25
,
52
,
2
]. Then, we present Prototypical VoteNet consisting of Prototypical Vote
Module (Section 3.2.1) and Prototypical Head Module (Section 3.2.2) to enhance feature learning for
FS3D.
3.1 Preliminaries
VoteNet-style 3D detectors [
25
,
52
,
2
] takes a point cloud scene
Pi
as input, and localizes and
categorizes 3D objects. As shown in Figure 2, it firstly incorporates a 3D backbone network (i.e.
PointNet++ [
26
]) parameterized by
θ1
with downsampling layers for point feature extraction as
Equation (1).
Fi=h1(Pi;θ1),(1)
where
N
and
M
represent the original and subsampled number of points, respectively,
Pi∈RN×3
represents an input point cloud scene
i
, and
Fi∈RM×(3+d)
is the subsampled scene points (also
called seeds) with d-dimensional features and 3-dimensional location coordinates.
Then,
Fi
is fed into the vote module with parameters
θ2
which outputs a
3
-dimensional coordinate
offset
∆dj= (∆xj,∆yj,∆zj)
relative to its corresponding object center
c= (cx, cy, cz)
and a
residual feature vector ∆fjfor each point jin Fi={fj}ias in Equation (2).
{∆dj,∆fj}i=h2(Fi;θ2).(2)
Given the predicted offset
∆dj
, the estimated corresponding object center
cj= (cxj, cyj, czj)
that
point jbelongs to can be calculated as Equation (3).
cxj=xj+ ∆xj, cyj=yj+ ∆yj, czj=zj+ ∆zj.(3)
Similarly, the point features are updated as Fi←Fi+ ∆Fiwhere ∆Fi={∆fj}i.
Next, the detector samples object centers from
{(cxj, cyj, czj)}i
using farthest point sampling and
group points with nearby centers together (see Figure 2: Sampling
&
grouping) to form a set of
object proposals
Oi={ot}i
. Each object proposal is characterized by a feature vector
fot
which is
obtained by applying a max pooling operation on features of all points belonging to ot.
Further, equipped with object features
{fot}i
, the prediction layer with parameters
θ3
is adopted
to yield the bounding boxes
bt
, objectiveness scores
st
, and classification logits
rt
for each object
proposal otfollowing Equation (4).
{bt, st, rt}i=h3({fot}i;θ3).(4)
3.2 Prototypical VoteNet
Here, we present Prototypical VoteNet which incorporates two new designs – Prototypical Vote
Module (PVM) and Prototypical Head Module (PHM) to improve feature learning for novel categories
with few annotated samples (see Figure 2). Specifically, PVM builds a class-agnostic memory bank
of geometric prototypes
G={gk}K
k=1
with a size of
K
, which models transferable class-agnostic
3D primitives learned from rich base categories, and further employs them to enhance local feature
representation for novel categories via a multi-head cross-attention module. The enhanced features
are then utilized by the Vote Layer to output the offset of coordinates and features as Equation
(2)
.
Second, to facilitate learning discriminative features for novel class prediction, PHM employs an
attention-based design to leverage class-specific prototypes
E={er}R
r=1
extracted from the support
set
Dsupport
with
R
categories to refine global discriminate feature for representing each object
proposal (see Figure 2). The output features are fed to the prediction layer for producing results
as Equation
(4)
. To make the model more generalizable to novel classes, we exploit the episodic
training [
33
,
39
] strategy to train PHM, where a distribution of similar few-shot tasks instead of
only one object detection task is learned in the training phase. PVM and PHM are elaborated in the
following sections.
4