Bridging Language and Geometric Primitives for Zero-shot Point Cloud Segmentation

2025-04-30 0 0 1.7MB 9 页 10玖币
侵权投诉
Bridging Language and Geometric Primitives for Zero-shot Point
Cloud Segmentation
Runnan Chen
The University of Hong Kong
Xinge Zhu
The Chinese University of Hong Kong
Nenglun Chen
The University of Hong Kong
Wei Li
Inceptio
Yuexin Ma
ShanghaiTech University
Ruigang Yang
Inceptio
Wenping Wang
Texas A&M University
ABSTRACT
We investigate transductive zero-shot point cloud semantic seg-
mentation, where the network is trained on seen objects and able to
segment unseen objects. The 3D geometric elements are essential
cues to imply a novel 3D object type. However, previous methods
neglect the ne-grained relationship between the language and
the 3D geometric elements. To this end, we propose a novel frame-
work to learn the geometric primitives shared in seen and unseen
categories’ objects and employ a ne-grained alignment between
language and the learned geometric primitives. Therefore, guided
by language, the network recognizes the novel objects represented
with geometric primitives. Specically, we formulate a novel point
visual representation, the similarity vector of the point’s feature
to the learnable prototypes, where the prototypes automatically
encode geometric primitives via back-propagation. Besides, we pro-
pose a novel Unknown-aware InfoNCE Loss to ne-grained align
the visual representation with language. Extensive experiments
show that our method signicantly outperforms other state-of-the-
art methods in the harmonic mean-intersection-over-union (hIoU),
with the improvement of 17.8%, 30.4%, 9.2% and 7.9% on S3DIS, Scan-
Net, SemanticKITTI and nuScenes datasets, respectively. Codes are
available1
KEYWORDS
Zero-shot learning, semantic segmentation, point cloud
ACM Reference Format:
Runnan Chen, Xinge Zhu, Nenglun Chen, Wei Li, Yuexin Ma, Ruigang Yang,
and Wenping Wang. 2023. Bridging Language and Geometric Primitives
for Zero-shot Point Cloud Segmentation. In Proceedings of the 31st ACM
International Conference on Multimedia (MM ’23), October 29–November
3, 2023, Ottawa, ON, Canada. ACM, New York, NY, USA, 9 pages. https:
//doi.org/10.1145/3581783.3612409
1https://github.com/runnanchen/Zero-Shot-Point-Cloud-Segmentation.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada.
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0108-5/23/10. . . $15.00
https://doi.org/10.1145/3581783.3612409
Seen categories
Geometric
Primitives
Unseen categories
table
desk
Shared Geometric
Primitives
cuboid
cube
sphere
cone
cylinder
torus
chairs
sofa
bookshelf
cabinet
cuboid
torus
cylinder
Figure 1: 3D object consists of geometric primitives such as
cuboid, cube, cylinder, etc. The 3D geometric elements are
essential cues that imply a novel 3D object type. For example,
both the table and desk have cuboid and cylinder structures
(green and blue dash lines) and similar language semantics
(Figure 4). Our method employs ne-grained alignment on
language ("table", "desk" et al.) and the geometric primitives
shared in seen and unseen categories. Therefore, by the guid-
ance of language, the network is able to recognize the novel
object represented with the learned geometric primitives.
1 INTRODUCTION
Semantic segmentation on the point cloud is a fundamental task in
3D scene understanding, boosting the development of autonomous
driving, service robots, digital cities, etc. Although some recent
methods [
6
,
10
,
16
,
22
25
,
27
,
38
,
41
,
42
,
45
] achieve promising per-
formance, they heavily rely on labour-intensive annotations for
supervision. By leveraging word embedding as auxiliary informa-
tion, zero-shot semantic segmentation can recognize the unseen
objects whose labels are unavailable. It is benecial for visual per-
ception in a new scene that contains novel objects. It can also be
a pre-annotation tool for automatically labelling the novel objects
[5, 7–9, 28, 30].
Zero-shot learning (ZSL) focuses on transferring knowledge
from seen to unseen categories. The preliminary ZSL setting pre-
dicts only unseen categories, while generalized ZSL (GZSL) predicts
both seen and unseen categories. In the aspect of training data, it
includes the inductive and transductive settings [
30
,
43
]. Only the
seen class samples and labels are available for training the network
in the inductive setting. As for the transductive setting, the unla-
beled objects of unseen classes are also accessible. In the semantic
segmentation scenario, transductive GZSL is a more common set-
ting for zero-shot segmentation (ZSS) because the seen and unseen
arXiv:2210.09923v3 [cs.CV] 29 Sep 2023
MM ’23, October 29–November 3, 2023, Oawa, ON, Canada. Runnan Chen, Xinge Zhu, Nenglun Chen, Wei Li, Yuexin Ma, Ruigang Yang, and Wenping Wang
categories often appear together in a scene. The ZSS problem we
studied belongs to the transductive GZSL.
Impressive progress of ZSS has been achieved on the 2D images
[
3
,
20
,
21
,
26
,
43
]. They typically generate the fake features of un-
seen categories for training the classier or enhance the structure
consistency between the visual features and semantic representa-
tion. ZSS is not fully explored in the 3D point cloud scenario. Only
one method [
30
] investigates this problem to the best of our knowl-
edge. It generates unseen class features with semantic embeddings
for training the classier. However, the ne-grained relationship
between the language and the 3D geometric elements in seen and
unseen categories, which are important to reason the unseen object
types, are not explicitly considered.
In this paper, we investigate transductive zero-shot segmenta-
tion (ZSS) on the 3D point cloud, i.e., the visual features of unseen
categories are available during training [
30
,
43
]. Our key obser-
vation is that the 3D geometric elements are essential cues that
imply a novel 3D object type. (Figure 1). For example, chairs and
sofas have similar geometric elements such as armrest, backrest
and cushion, and they are also close in the semantic embedding
space (Figure 4). Based on the observation, we propose a novel
framework to learn the geometric primitives shared in seen and
unseen categories and ne-grained align the language semantics
with the learned geometric primitives. Specically, inspired by the
bag-of-words model [
19
,
39
], we formulate a novel point visual rep-
resentation that encodes geometric primitive information, i.e., the
similarity vector of the point’s feature to the geometric primitives,
where geometric primitives are a group of learnable prototypes
updated by back-propagation. For bridging the language and the
geometric primitives, we rst conduct the language semantic rep-
resentation as a mixture-distributed embedding, it is because a
3D object composed of multiple geometric primitives. Besides, the
network is naturally biased towards the seen classes, leading to
signicant misclassications of the unseen classes (Figure 5). To
this end, we propose an Unknown-aware InfoNCE Loss that ne-
grained aligns the visual and semantic representation at the same
time alleviating the misclassication issue. Essentially, it pushes
the unseen visual representations away from the seen categories’
semantic representations (Figure 4), enabling the network to distin-
guish the seen and unseen objects. In the inferring state, under the
guidance of semantic representations, a novel object is represented
with learned geometric primitives and can be classied into the
correct class.
Extensive experiments conducted on S3DIS, ScanNet, Semantic-
KITTI and nuScenes datasets show that our method outperforms
other state-of-the-art methods, with the improvement of 17.8%,
30.4%, 9.2% and 7.9%, respectively.
The contributions of our work are as follows.
To solve the transductive zero-shot segmentation on the 3D
point cloud, we propose a novel framework to model the
ne-grained relationship between language and geometric
primitives that transfer the knowledge from seen to unseen
categories.
We propose an Unknown-aware InfoNCE Loss for the ne-
grained visual and semantic representation alignment among
seen and unseen categories.
Our method achieves state-of-the-art performance for zero-
shot point cloud segmentation on S3DIS, ScanNet, Semantic-
KITTI and nuScenes datasets.
2 RELATED WORK
2.1 Zero-Shot Segmentation on 2D images
Zero-shot semantic segmentation (ZSS) is dominated by generalized
zero-shot learning because the objects of seen and unseen categories
often appear in a scene together. ZS3Net [
3
] generates pixel-wise
fake features from semantic information of unseen classes and then
integrate the real feature of seen classes for training the classier.
Gu et al. [
20
] further improve ZS3Net by introducing a contextual
module that generates context-aware visual features from semantic
information. Li et al. [
26
] propose a Consistent Structural Relation
Learning (CSRL) approach to model category-level semantic rela-
tions and to learn a better visual feature generator. However, they
improperly use each unseen ground truth pixel locations for fake
feature generation. Hu et al. [
21
] promote the performance by alle-
viating noisy and outlying training samples from seen classes with
Bayesian uncertainty estimation. There is an obvious bias between
real and fake features that hinder the knowledge transferring from
seen classes to unseen classes. Zhang et al. [
43
] replace the unseen
objects with other images to generate training samples and perform
segmentation with prototypical matching and open set rejection. Lv
et al. [
29
] mitigate this issue using a transductive setting that uses
both labelled seen images and unlabeled unseen images for train-
ing. In this paper, we follow the transductive setting that leverages
unseen objects’ features for supervision, and the ground truth pixel
locations of individual unseen objects are not accessible, which
naturally meets the semantic segmentation.
2.2 Zero-shot Learning on 3D Point Cloud
Unlike promising progress of zero-shot learning on 2D images,
there are few studies conducted on the 3D point cloud. Some meth-
ods [
11
14
] are to study point cloud classication. Cheraghian et al.
[
14
] adapts pointNet [
34
] to extract object representation and GloVe
[
33
] or W2V [
32
] to obtain semantic information for reasoning the
unseen object’s types. Cheraghian et al. [
11
] adapt GZSL setting
and propose a loss function composed of a regression term [
44
] and
a skewness term [
35
,
36
] to alleviate the hubness problem, which
indicates that the model may predict only a few target classes for
most of the test instances. Cheraghian et al. [
12
] further improves
[
11
] by using a triplet loss. To the best of our knowledge, only one
method [
30
] is proposed for semantic segmentation that generates
fake features with class prototypes for training the classier. How-
ever, they do not explicitly consider the 3D geometric elements
shared across the seen and unseen categories, which are important
cues to align the 3D visual features and semantic embedding. In
this paper, our method learns geometric primitives that transfer
the knowledge from seen to unseen categories for more accurate
reasoning of unseen category objects. Besides, instead of generating
fake features for the unseen classes, we naturally leverage unseen
visual features extracted from the backbone network, which is more
intuitive and natural under the transductive setting.
摘要:

BridgingLanguageandGeometricPrimitivesforZero-shotPointCloudSegmentationRunnanChenTheUniversityofHongKongXingeZhuTheChineseUniversityofHongKongNenglunChenTheUniversityofHongKongWeiLiInceptioYuexinMaShanghaiTechUniversityRuigangYangInceptioWenpingWangTexasA&MUniversityABSTRACTWeinvestigatetransductiv...

展开>> 收起<<
Bridging Language and Geometric Primitives for Zero-shot Point Cloud Segmentation.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.7MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注