Bridging Language and Geometric Primitives for Zero-shot Point Cloud Segmentation

2025-04-30 0 0 1.7MB 9 页 10玖币

侵权投诉

Bridging Language and Geometric Primitives for Zero-shot Point

Cloud Segmentation

Runnan Chen

The University of Hong Kong

Xinge Zhu

The Chinese University of Hong Kong

Nenglun Chen

The University of Hong Kong

Wei Li

Inceptio

Yuexin Ma

ShanghaiTech University

Ruigang Yang

Inceptio

Wenping Wang

Texas A&M University

ABSTRACT

We investigate transductive zero-shot point cloud semantic seg-

mentation, where the network is trained on seen objects and able to

segment unseen objects. The 3D geometric elements are essential

cues to imply a novel 3D object type. However, previous methods

neglect the ne-grained relationship between the language and

the 3D geometric elements. To this end, we propose a novel frame-

work to learn the geometric primitives shared in seen and unseen

categories’ objects and employ a ne-grained alignment between

language and the learned geometric primitives. Therefore, guided

by language, the network recognizes the novel objects represented

with geometric primitives. Specically, we formulate a novel point

visual representation, the similarity vector of the point’s feature

to the learnable prototypes, where the prototypes automatically

encode geometric primitives via back-propagation. Besides, we pro-

pose a novel Unknown-aware InfoNCE Loss to ne-grained align

the visual representation with language. Extensive experiments

show that our method signicantly outperforms other state-of-the-

art methods in the harmonic mean-intersection-over-union (hIoU),

with the improvement of 17.8%, 30.4%, 9.2% and 7.9% on S3DIS, Scan-

Net, SemanticKITTI and nuScenes datasets, respectively. Codes are

available1

KEYWORDS

Zero-shot learning, semantic segmentation, point cloud

ACM Reference Format:

Runnan Chen, Xinge Zhu, Nenglun Chen, Wei Li, Yuexin Ma, Ruigang Yang,

and Wenping Wang. 2023. Bridging Language and Geometric Primitives

for Zero-shot Point Cloud Segmentation. In Proceedings of the 31st ACM

International Conference on Multimedia (MM ’23), October 29–November

3, 2023, Ottawa, ON, Canada. ACM, New York, NY, USA, 9 pages. https:

//doi.org/10.1145/3581783.3612409

1https://github.com/runnanchen/Zero-Shot-Point-Cloud-Segmentation.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specic permission

and/or a fee. Request permissions from permissions@acm.org.

MM ’23, October 29–November 3, 2023, Ottawa, ON, Canada.

ACM ISBN 979-8-4007-0108-5/23/10. . . $15.00

https://doi.org/10.1145/3581783.3612409

Seen categories

Geometric

Primitives

Unseen categories

table

desk

Shared Geometric

Primitives

cuboid

cube

sphere

cone

cylinder

torus

chairs

sofa

bookshelf

cabinet

cuboid

torus

cylinder

…

Figure 1: 3D object consists of geometric primitives such as

cuboid, cube, cylinder, etc. The 3D geometric elements are

essential cues that imply a novel 3D object type. For example,

both the table and desk have cuboid and cylinder structures

(green and blue dash lines) and similar language semantics

(Figure 4). Our method employs ne-grained alignment on

language ("table", "desk" et al.) and the geometric primitives

shared in seen and unseen categories. Therefore, by the guid-

ance of language, the network is able to recognize the novel

object represented with the learned geometric primitives.

1 INTRODUCTION

Semantic segmentation on the point cloud is a fundamental task in

3D scene understanding, boosting the development of autonomous

driving, service robots, digital cities, etc. Although some recent

methods [

–

] achieve promising per-

formance, they heavily rely on labour-intensive annotations for

supervision. By leveraging word embedding as auxiliary informa-

tion, zero-shot semantic segmentation can recognize the unseen

objects whose labels are unavailable. It is benecial for visual per-

ception in a new scene that contains novel objects. It can also be

a pre-annotation tool for automatically labelling the novel objects

[5, 7–9, 28, 30].

Zero-shot learning (ZSL) focuses on transferring knowledge

from seen to unseen categories. The preliminary ZSL setting pre-

dicts only unseen categories, while generalized ZSL (GZSL) predicts

both seen and unseen categories. In the aspect of training data, it

includes the inductive and transductive settings [

]. Only the

seen class samples and labels are available for training the network

in the inductive setting. As for the transductive setting, the unla-

beled objects of unseen classes are also accessible. In the semantic

segmentation scenario, transductive GZSL is a more common set-

ting for zero-shot segmentation (ZSS) because the seen and unseen

arXiv:2210.09923v3 [cs.CV] 29 Sep 2023

MM ’23, October 29–November 3, 2023, Oawa, ON, Canada. Runnan Chen, Xinge Zhu, Nenglun Chen, Wei Li, Yuexin Ma, Ruigang Yang, and Wenping Wang

categories often appear together in a scene. The ZSS problem we

studied belongs to the transductive GZSL.

Impressive progress of ZSS has been achieved on the 2D images

[

]. They typically generate the fake features of un-

seen categories for training the classier or enhance the structure

consistency between the visual features and semantic representa-

tion. ZSS is not fully explored in the 3D point cloud scenario. Only

one method [

] investigates this problem to the best of our knowl-

edge. It generates unseen class features with semantic embeddings

for training the classier. However, the ne-grained relationship

between the language and the 3D geometric elements in seen and

unseen categories, which are important to reason the unseen object

types, are not explicitly considered.

In this paper, we investigate transductive zero-shot segmenta-

tion (ZSS) on the 3D point cloud, i.e., the visual features of unseen

categories are available during training [

]. Our key obser-

vation is that the 3D geometric elements are essential cues that

imply a novel 3D object type. (Figure 1). For example, chairs and

sofas have similar geometric elements such as armrest, backrest

and cushion, and they are also close in the semantic embedding

space (Figure 4). Based on the observation, we propose a novel

framework to learn the geometric primitives shared in seen and

unseen categories and ne-grained align the language semantics

with the learned geometric primitives. Specically, inspired by the

bag-of-words model [

], we formulate a novel point visual rep-

resentation that encodes geometric primitive information, i.e., the

similarity vector of the point’s feature to the geometric primitives,

where geometric primitives are a group of learnable prototypes

updated by back-propagation. For bridging the language and the

geometric primitives, we rst conduct the language semantic rep-

resentation as a mixture-distributed embedding, it is because a

3D object composed of multiple geometric primitives. Besides, the

network is naturally biased towards the seen classes, leading to

signicant misclassications of the unseen classes (Figure 5). To

this end, we propose an Unknown-aware InfoNCE Loss that ne-

grained aligns the visual and semantic representation at the same

time alleviating the misclassication issue. Essentially, it pushes

the unseen visual representations away from the seen categories’

semantic representations (Figure 4), enabling the network to distin-

guish the seen and unseen objects. In the inferring state, under the

guidance of semantic representations, a novel object is represented

with learned geometric primitives and can be classied into the

correct class.

Extensive experiments conducted on S3DIS, ScanNet, Semantic-

KITTI and nuScenes datasets show that our method outperforms

other state-of-the-art methods, with the improvement of 17.8%,

30.4%, 9.2% and 7.9%, respectively.

The contributions of our work are as follows.

•

To solve the transductive zero-shot segmentation on the 3D

point cloud, we propose a novel framework to model the

ne-grained relationship between language and geometric

primitives that transfer the knowledge from seen to unseen

categories.

•

We propose an Unknown-aware InfoNCE Loss for the ne-

grained visual and semantic representation alignment among

seen and unseen categories.

•

Our method achieves state-of-the-art performance for zero-

shot point cloud segmentation on S3DIS, ScanNet, Semantic-

KITTI and nuScenes datasets.

2 RELATED WORK

2.1 Zero-Shot Segmentation on 2D images

Zero-shot semantic segmentation (ZSS) is dominated by generalized

zero-shot learning because the objects of seen and unseen categories

often appear in a scene together. ZS3Net [

] generates pixel-wise

fake features from semantic information of unseen classes and then

integrate the real feature of seen classes for training the classier.

Gu et al. [

] further improve ZS3Net by introducing a contextual

module that generates context-aware visual features from semantic

information. Li et al. [

] propose a Consistent Structural Relation

Learning (CSRL) approach to model category-level semantic rela-

tions and to learn a better visual feature generator. However, they

improperly use each unseen ground truth pixel locations for fake

feature generation. Hu et al. [

] promote the performance by alle-

viating noisy and outlying training samples from seen classes with

Bayesian uncertainty estimation. There is an obvious bias between

real and fake features that hinder the knowledge transferring from

seen classes to unseen classes. Zhang et al. [

] replace the unseen

objects with other images to generate training samples and perform

segmentation with prototypical matching and open set rejection. Lv

et al. [

] mitigate this issue using a transductive setting that uses

both labelled seen images and unlabeled unseen images for train-

ing. In this paper, we follow the transductive setting that leverages

unseen objects’ features for supervision, and the ground truth pixel

locations of individual unseen objects are not accessible, which

naturally meets the semantic segmentation.

2.2 Zero-shot Learning on 3D Point Cloud

Unlike promising progress of zero-shot learning on 2D images,

there are few studies conducted on the 3D point cloud. Some meth-

ods [

–

] are to study point cloud classication. Cheraghian et al.

[

] adapts pointNet [

] to extract object representation and GloVe

[

] or W2V [

] to obtain semantic information for reasoning the

unseen object’s types. Cheraghian et al. [

] adapt GZSL setting

and propose a loss function composed of a regression term [

] and

a skewness term [

] to alleviate the hubness problem, which

indicates that the model may predict only a few target classes for

most of the test instances. Cheraghian et al. [

] further improves

[

] by using a triplet loss. To the best of our knowledge, only one

method [

] is proposed for semantic segmentation that generates

fake features with class prototypes for training the classier. How-

ever, they do not explicitly consider the 3D geometric elements

shared across the seen and unseen categories, which are important

cues to align the 3D visual features and semantic embedding. In

this paper, our method learns geometric primitives that transfer

the knowledge from seen to unseen categories for more accurate

reasoning of unseen category objects. Besides, instead of generating

fake features for the unseen classes, we naturally leverage unseen

visual features extracted from the backbone network, which is more

intuitive and natural under the transductive setting.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BridgingLanguageandGeometricPrimitivesforZero-shotPointCloudSegmentationRunnanChenTheUniversityofHongKongXingeZhuTheChineseUniversityofHongKongNenglunChenTheUniversityofHongKongWeiLiInceptioYuexinMaShanghaiTechUniversityRuigangYangInceptioWenpingWangTexasA&MUniversityABSTRACTWeinvestigatetransductiv...

展开>> 收起<<

Bridging Language and Geometric Primitives for Zero-shot Point Cloud Segmentation.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bridging Language and Geometric Primitives for Zero-shot Point Cloud Segmentation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: