MM ’23, October 29–November 3, 2023, Oawa, ON, Canada. Runnan Chen, Xinge Zhu, Nenglun Chen, Wei Li, Yuexin Ma, Ruigang Yang, and Wenping Wang
categories often appear together in a scene. The ZSS problem we
studied belongs to the transductive GZSL.
Impressive progress of ZSS has been achieved on the 2D images
[
3
,
20
,
21
,
26
,
43
]. They typically generate the fake features of un-
seen categories for training the classier or enhance the structure
consistency between the visual features and semantic representa-
tion. ZSS is not fully explored in the 3D point cloud scenario. Only
one method [
30
] investigates this problem to the best of our knowl-
edge. It generates unseen class features with semantic embeddings
for training the classier. However, the ne-grained relationship
between the language and the 3D geometric elements in seen and
unseen categories, which are important to reason the unseen object
types, are not explicitly considered.
In this paper, we investigate transductive zero-shot segmenta-
tion (ZSS) on the 3D point cloud, i.e., the visual features of unseen
categories are available during training [
30
,
43
]. Our key obser-
vation is that the 3D geometric elements are essential cues that
imply a novel 3D object type. (Figure 1). For example, chairs and
sofas have similar geometric elements such as armrest, backrest
and cushion, and they are also close in the semantic embedding
space (Figure 4). Based on the observation, we propose a novel
framework to learn the geometric primitives shared in seen and
unseen categories and ne-grained align the language semantics
with the learned geometric primitives. Specically, inspired by the
bag-of-words model [
19
,
39
], we formulate a novel point visual rep-
resentation that encodes geometric primitive information, i.e., the
similarity vector of the point’s feature to the geometric primitives,
where geometric primitives are a group of learnable prototypes
updated by back-propagation. For bridging the language and the
geometric primitives, we rst conduct the language semantic rep-
resentation as a mixture-distributed embedding, it is because a
3D object composed of multiple geometric primitives. Besides, the
network is naturally biased towards the seen classes, leading to
signicant misclassications of the unseen classes (Figure 5). To
this end, we propose an Unknown-aware InfoNCE Loss that ne-
grained aligns the visual and semantic representation at the same
time alleviating the misclassication issue. Essentially, it pushes
the unseen visual representations away from the seen categories’
semantic representations (Figure 4), enabling the network to distin-
guish the seen and unseen objects. In the inferring state, under the
guidance of semantic representations, a novel object is represented
with learned geometric primitives and can be classied into the
correct class.
Extensive experiments conducted on S3DIS, ScanNet, Semantic-
KITTI and nuScenes datasets show that our method outperforms
other state-of-the-art methods, with the improvement of 17.8%,
30.4%, 9.2% and 7.9%, respectively.
The contributions of our work are as follows.
•
To solve the transductive zero-shot segmentation on the 3D
point cloud, we propose a novel framework to model the
ne-grained relationship between language and geometric
primitives that transfer the knowledge from seen to unseen
categories.
•
We propose an Unknown-aware InfoNCE Loss for the ne-
grained visual and semantic representation alignment among
seen and unseen categories.
•
Our method achieves state-of-the-art performance for zero-
shot point cloud segmentation on S3DIS, ScanNet, Semantic-
KITTI and nuScenes datasets.
2 RELATED WORK
2.1 Zero-Shot Segmentation on 2D images
Zero-shot semantic segmentation (ZSS) is dominated by generalized
zero-shot learning because the objects of seen and unseen categories
often appear in a scene together. ZS3Net [
3
] generates pixel-wise
fake features from semantic information of unseen classes and then
integrate the real feature of seen classes for training the classier.
Gu et al. [
20
] further improve ZS3Net by introducing a contextual
module that generates context-aware visual features from semantic
information. Li et al. [
26
] propose a Consistent Structural Relation
Learning (CSRL) approach to model category-level semantic rela-
tions and to learn a better visual feature generator. However, they
improperly use each unseen ground truth pixel locations for fake
feature generation. Hu et al. [
21
] promote the performance by alle-
viating noisy and outlying training samples from seen classes with
Bayesian uncertainty estimation. There is an obvious bias between
real and fake features that hinder the knowledge transferring from
seen classes to unseen classes. Zhang et al. [
43
] replace the unseen
objects with other images to generate training samples and perform
segmentation with prototypical matching and open set rejection. Lv
et al. [
29
] mitigate this issue using a transductive setting that uses
both labelled seen images and unlabeled unseen images for train-
ing. In this paper, we follow the transductive setting that leverages
unseen objects’ features for supervision, and the ground truth pixel
locations of individual unseen objects are not accessible, which
naturally meets the semantic segmentation.
2.2 Zero-shot Learning on 3D Point Cloud
Unlike promising progress of zero-shot learning on 2D images,
there are few studies conducted on the 3D point cloud. Some meth-
ods [
11
–
14
] are to study point cloud classication. Cheraghian et al.
[
14
] adapts pointNet [
34
] to extract object representation and GloVe
[
33
] or W2V [
32
] to obtain semantic information for reasoning the
unseen object’s types. Cheraghian et al. [
11
] adapt GZSL setting
and propose a loss function composed of a regression term [
44
] and
a skewness term [
35
,
36
] to alleviate the hubness problem, which
indicates that the model may predict only a few target classes for
most of the test instances. Cheraghian et al. [
12
] further improves
[
11
] by using a triplet loss. To the best of our knowledge, only one
method [
30
] is proposed for semantic segmentation that generates
fake features with class prototypes for training the classier. How-
ever, they do not explicitly consider the 3D geometric elements
shared across the seen and unseen categories, which are important
cues to align the 3D visual features and semantic embedding. In
this paper, our method learns geometric primitives that transfer
the knowledge from seen to unseen categories for more accurate
reasoning of unseen category objects. Besides, instead of generating
fake features for the unseen classes, we naturally leverage unseen
visual features extracted from the backbone network, which is more
intuitive and natural under the transductive setting.