Point Cloud Recognition with Position-to-Structure Attention Transformers Zheng Ding UC San DiegoJames Hou

2025-05-02 0 0 2.7MB 10 页 10玖币

侵权投诉

Point Cloud Recognition with Position-to-Structure Attention Transformers

Zheng Ding

UC San Diego

James Hou

The Bishop’s School

Zhuowen Tu

UC San Diego

Abstract

In this paper, we present Position-to-Structure Attention

Transformers (PS-Former), a Transformer-based algorithm

for 3D point cloud recognition. PS-Former deals with the

challenge in 3D point cloud representation where points are

not positioned in a ﬁxed grid structure and have limited

feature description (only 3D coordinates (x, y, z) for scat-

tered points). Existing Transformer-based architectures in

this domain often require a pre-speciﬁed feature engineer-

ing step to extract point features. Here, we introduce two

new aspects in PS-Former: 1) a learnable condensation

layer that performs point downsampling and feature extrac-

tion; and 2) a Position-to-Structure Attention mechanism

that recursively enriches the structural information with the

position attention branch. Compared with the competing

methods, while being generic with less heuristics feature

designs, PS-Former demonstrates competitive experimental

results on three 3D point cloud tasks including classiﬁca-

tion, part segmentation, and scene segmentation.

1. Introduction

3D point cloud recognition is an active research area in

computer vision that has gained steady progress in the past

years [13,24,32,34,46]. The availability of large-scale

3D datasets [44] and real-world applications [15] in au-

tonomous driving [4,30], computer graphics [31], and 3D

scene understanding [16] make the task of 3D point cloud

recognition increasingly important.

Amongst recent 3D shape representations for deep-

learning based recognition tasks, including meshes [11],

voxels (or volumetric grid) [42], and implicit functions [5],

the point cloud representation [32] remains a viable choice

to represent 3D shapes due to its ﬂexibility and effectiveness

in computation and modeling. However, adopting the point

cloud representation in downstream tasks poses some spe-

cial challenges: 1) unlike the voxel-based volumetric repre-

sentation that has a ﬁxed grid-structure where 3D convolu-

tions [42] can be readily applied, point clouds are basically

collections of scattered points that hold no order; 2) each

Position-to-Structure Layer

Condensation Layer

Position features Structure features

Figure 1. Overview of our proposed method: Our PS-Former

learns rich feature descriptions together with robust point local

graphs. Our PS-Former shows more effectiveness in learning 3D

point cloud representations than vanilla self-attention and cross-

attention mechanisms in the standard Transformers, as demon-

strated in the ablation study (Section 5.1). Left: Illustration of

local structure relations (local graphs of points and their neigh-

boring points). Right: PS-Former Pipeline. It ﬁrst extracts the

structure relations (local graphs) of individual points via a learn-

able Condensation Layer. Then, a Position-to-Structure Attention

mechanism is applied to enrich the structure features from the po-

sition features in a recursive fashion.

sample point carries only the 3D coordinate information

(x, y, z) without rich explicit feature descriptions. A good

understanding about the overall shape, as well as the ob-

ject parts represented by point clouds depends on extracting

the “correct” and “informative” features of the individual

points through their relations with the neighboring points

(context).

The main challenge: a chicken-and-egg problem.

Point clouds come in as scattered points without known

connections and relations. Point cloud recognition is a

chicken-and-egg problem: a rich feature description bene-

ﬁts from the robust extraction of the point structure relations

(local graphs), whereas creating a reliable local graph also

depends on an informative feature description for the points.

In a nutshell, the local graph building and feature extraction

processes are tightly coupled in 3D point cloud recognition,

which is the central issue we are combating here.

arXiv:2210.02030v1 [cs.CV] 5 Oct 2022

Transformers [39] are emerging machine learning mod-

els that have undergone exploding development in natural

language processing [8] and computer vision [3,9]. Un-

like Convolutional Neural Networks [22] that operate in

the ﬁxed image lattice space, the attention mechanism in

the Transformers [39] include the positional embeddings in

the individual tokens that themselves are already orderless.

This makes Transformers a viable representation and com-

putation framework for 3D point cloud recognition. Using

the vanilla Transformers on point clouds [9,39] under the

standard self-attention mechanism, however, leads to a sub-

optimal solution (see ablation study in section 5.1). The

weighted sum mechanism learns an average of the tokens

which are not ideal for structure extraction. Existing works

like PCT [13] have a special non-learnable feature engineer-

ing step that pre-extracts features for each sampled point.

Given these challenges, we propose in this paper a new

point cloud recognition method, Position-to-Structure At-

tention TransFormer (PS-Former), that consists of two in-

teresting properties:

1. A learnable Condensation Layer that performs point

cloud downsampling and feature extraction automat-

ically. Unlike the Transformer-based PCT approach

[13], where a ﬁxed strategy using farthest point sam-

pling and a feature engineering process using KNN

grouping are adopted, we extract structural features by

utilizing the internal self-attention matrix for comput-

ing the point relations.

2. A Position-to-Structure Attention mechanism that

recursively enriches the structure information using the

position attention branch. This is different from the

standard cross-attention mechanism where two work-

ing branches cross attend each other in a symmetric

way. An illustration can be found in Figure 4.

We conduct experiments on three main point cloud

recognition tasks including ModelNet40 classiﬁcation [44],

ShapeNet part segmentation [51], and 3SDIS scene seg-

mentation [1], to evaluate the effectiveness of our proposed

model. ModelNet40 classiﬁcation requires the recognition

of the entire input point cloud while the latter two focus

on single point labeling. PS-Former is shown to be able to

achieve competitive results when compared with state-of-

the-art methods, and improve over the PCT method [13] that

computes the features before the attention layer by grouping

them based on the original 3D space distance.

2. Related Work

3D Point Cloud Recognition. Due to the point cloud’s

non-grid data structure, a number of works have been pro-

posed in the past by ﬁrst converting the points to a grid

data structure e.g. 2D images or 3D voxels [12,37]. The

grid data after this pre-processing can be directly learned

by CNN-like structure [22]. However, this conversion pro-

cess from point clouds to volumetric data may lead to in-

formation loss e.g. occlusion in the 2D images and res-

olution bottleneck in the 3D voxels. The seminal work

of PointNet [32] chose to perform learning directly on the

original point cloud data in which max pooling operation

is adopted to retain invariance in the point sets. Since the

work of PointNet [32], there has been a wealthy body of

methods proposed along this direction. There are meth-

ods [38,43,47] attempting to simulate the convolution pro-

cess in 2D images whereas other approaches [40,48] adopt

graph convolutional neural networks (GCN) to build con-

nections for the neighboring points for feature extraction.

Transformer Architecture. A notable recent develop-

ment in natural language processing is the invention and

widespread adoption of the Transformer architectures [7,

39]. At the core, Transformers [39] model the relations

among tokens with two attention mechanisms, namely self-

attention and cross-attention. Another important compo-

nent of Transformers is the positional encoding that embeds

the position information into the tokens, relaxing the re-

quirement to maintain the order for the input data. Recently,

Transformers have also been successfully adopted in image

classiﬁcation [9] and object detection [3]. Typically, vision

transformers adopt similar absolute/relative positional en-

coding strategies used in language transformers [36,39] to

encode grid structures from 2D images. In 3D point cloud

tasks, the input point cloud data only contains the 3D co-

ordinates without any texture information, which makes the

computation from position to structure feature extraction a

challenging task. In point cloud recognition, Transformers

were adopted in [46]; PCT [13] applies sampling and group-

ing operations introduced by PointNet++ [34] to capture the

structure information.

Compared to existing 3D point cloud recognition meth-

ods [24,28,41,45], our proposed PS-Former model con-

sists of learnable components that are generic and easy to

adapt. Versus the competing Transformer based approaches

such as PCT [13], PS-Former (1) removes the non-learnable

feature engineering stage with a Condensation Layer and

(2) incorporates newly designed Position-to-Structure At-

tention that learns informative features from point positions

and their neighboring structures.

3. Method

In this section, we present our proposed model, Position-

to-Structure Transformers (PS-Former). We ﬁrst give an

overview of our model, followed by a description of the two

key components of our model: Position-Structure Attention

Layer and Condensation Layer.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PointCloudRecognitionwithPosition-to-StructureAttentionTransformersZhengDingUCSanDiegoJamesHouTheBishop'sSchoolZhuowenTuUCSanDiegoAbstractInthispaper,wepresentPosition-to-StructureAttentionTransformers(PS-Former),aTransformer-basedalgorithmfor3Dpointcloudrecognition.PS-Formerdealswiththechallengein3...

展开>> 收起<<

Point Cloud Recognition with Position-to-Structure Attention Transformers Zheng Ding UC San DiegoJames Hou.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Point Cloud Recognition with Position-to-Structure Attention Transformers Zheng Ding UC San DiegoJames Hou

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: