Point Cloud Recognition with Position-to-Structure Attention Transformers Zheng Ding UC San DiegoJames Hou

2025-05-02 0 0 2.7MB 10 页 10玖币
侵权投诉
Point Cloud Recognition with Position-to-Structure Attention Transformers
Zheng Ding
UC San Diego
James Hou
The Bishop’s School
Zhuowen Tu
UC San Diego
Abstract
In this paper, we present Position-to-Structure Attention
Transformers (PS-Former), a Transformer-based algorithm
for 3D point cloud recognition. PS-Former deals with the
challenge in 3D point cloud representation where points are
not positioned in a fixed grid structure and have limited
feature description (only 3D coordinates (x, y, z) for scat-
tered points). Existing Transformer-based architectures in
this domain often require a pre-specified feature engineer-
ing step to extract point features. Here, we introduce two
new aspects in PS-Former: 1) a learnable condensation
layer that performs point downsampling and feature extrac-
tion; and 2) a Position-to-Structure Attention mechanism
that recursively enriches the structural information with the
position attention branch. Compared with the competing
methods, while being generic with less heuristics feature
designs, PS-Former demonstrates competitive experimental
results on three 3D point cloud tasks including classifica-
tion, part segmentation, and scene segmentation.
1. Introduction
3D point cloud recognition is an active research area in
computer vision that has gained steady progress in the past
years [13,24,32,34,46]. The availability of large-scale
3D datasets [44] and real-world applications [15] in au-
tonomous driving [4,30], computer graphics [31], and 3D
scene understanding [16] make the task of 3D point cloud
recognition increasingly important.
Amongst recent 3D shape representations for deep-
learning based recognition tasks, including meshes [11],
voxels (or volumetric grid) [42], and implicit functions [5],
the point cloud representation [32] remains a viable choice
to represent 3D shapes due to its flexibility and effectiveness
in computation and modeling. However, adopting the point
cloud representation in downstream tasks poses some spe-
cial challenges: 1) unlike the voxel-based volumetric repre-
sentation that has a fixed grid-structure where 3D convolu-
tions [42] can be readily applied, point clouds are basically
collections of scattered points that hold no order; 2) each
Position-to-Structure Layer
Condensation Layer
Position features Structure features
Figure 1. Overview of our proposed method: Our PS-Former
learns rich feature descriptions together with robust point local
graphs. Our PS-Former shows more effectiveness in learning 3D
point cloud representations than vanilla self-attention and cross-
attention mechanisms in the standard Transformers, as demon-
strated in the ablation study (Section 5.1). Left: Illustration of
local structure relations (local graphs of points and their neigh-
boring points). Right: PS-Former Pipeline. It first extracts the
structure relations (local graphs) of individual points via a learn-
able Condensation Layer. Then, a Position-to-Structure Attention
mechanism is applied to enrich the structure features from the po-
sition features in a recursive fashion.
sample point carries only the 3D coordinate information
(x, y, z) without rich explicit feature descriptions. A good
understanding about the overall shape, as well as the ob-
ject parts represented by point clouds depends on extracting
the “correct” and “informative” features of the individual
points through their relations with the neighboring points
(context).
The main challenge: a chicken-and-egg problem.
Point clouds come in as scattered points without known
connections and relations. Point cloud recognition is a
chicken-and-egg problem: a rich feature description bene-
fits from the robust extraction of the point structure relations
(local graphs), whereas creating a reliable local graph also
depends on an informative feature description for the points.
In a nutshell, the local graph building and feature extraction
processes are tightly coupled in 3D point cloud recognition,
which is the central issue we are combating here.
arXiv:2210.02030v1 [cs.CV] 5 Oct 2022
Transformers [39] are emerging machine learning mod-
els that have undergone exploding development in natural
language processing [8] and computer vision [3,9]. Un-
like Convolutional Neural Networks [22] that operate in
the fixed image lattice space, the attention mechanism in
the Transformers [39] include the positional embeddings in
the individual tokens that themselves are already orderless.
This makes Transformers a viable representation and com-
putation framework for 3D point cloud recognition. Using
the vanilla Transformers on point clouds [9,39] under the
standard self-attention mechanism, however, leads to a sub-
optimal solution (see ablation study in section 5.1). The
weighted sum mechanism learns an average of the tokens
which are not ideal for structure extraction. Existing works
like PCT [13] have a special non-learnable feature engineer-
ing step that pre-extracts features for each sampled point.
Given these challenges, we propose in this paper a new
point cloud recognition method, Position-to-Structure At-
tention TransFormer (PS-Former), that consists of two in-
teresting properties:
1. A learnable Condensation Layer that performs point
cloud downsampling and feature extraction automat-
ically. Unlike the Transformer-based PCT approach
[13], where a fixed strategy using farthest point sam-
pling and a feature engineering process using KNN
grouping are adopted, we extract structural features by
utilizing the internal self-attention matrix for comput-
ing the point relations.
2. A Position-to-Structure Attention mechanism that
recursively enriches the structure information using the
position attention branch. This is different from the
standard cross-attention mechanism where two work-
ing branches cross attend each other in a symmetric
way. An illustration can be found in Figure 4.
We conduct experiments on three main point cloud
recognition tasks including ModelNet40 classification [44],
ShapeNet part segmentation [51], and 3SDIS scene seg-
mentation [1], to evaluate the effectiveness of our proposed
model. ModelNet40 classification requires the recognition
of the entire input point cloud while the latter two focus
on single point labeling. PS-Former is shown to be able to
achieve competitive results when compared with state-of-
the-art methods, and improve over the PCT method [13] that
computes the features before the attention layer by grouping
them based on the original 3D space distance.
2. Related Work
3D Point Cloud Recognition. Due to the point cloud’s
non-grid data structure, a number of works have been pro-
posed in the past by first converting the points to a grid
data structure e.g. 2D images or 3D voxels [12,37]. The
grid data after this pre-processing can be directly learned
by CNN-like structure [22]. However, this conversion pro-
cess from point clouds to volumetric data may lead to in-
formation loss e.g. occlusion in the 2D images and res-
olution bottleneck in the 3D voxels. The seminal work
of PointNet [32] chose to perform learning directly on the
original point cloud data in which max pooling operation
is adopted to retain invariance in the point sets. Since the
work of PointNet [32], there has been a wealthy body of
methods proposed along this direction. There are meth-
ods [38,43,47] attempting to simulate the convolution pro-
cess in 2D images whereas other approaches [40,48] adopt
graph convolutional neural networks (GCN) to build con-
nections for the neighboring points for feature extraction.
Transformer Architecture. A notable recent develop-
ment in natural language processing is the invention and
widespread adoption of the Transformer architectures [7,
39]. At the core, Transformers [39] model the relations
among tokens with two attention mechanisms, namely self-
attention and cross-attention. Another important compo-
nent of Transformers is the positional encoding that embeds
the position information into the tokens, relaxing the re-
quirement to maintain the order for the input data. Recently,
Transformers have also been successfully adopted in image
classification [9] and object detection [3]. Typically, vision
transformers adopt similar absolute/relative positional en-
coding strategies used in language transformers [36,39] to
encode grid structures from 2D images. In 3D point cloud
tasks, the input point cloud data only contains the 3D co-
ordinates without any texture information, which makes the
computation from position to structure feature extraction a
challenging task. In point cloud recognition, Transformers
were adopted in [46]; PCT [13] applies sampling and group-
ing operations introduced by PointNet++ [34] to capture the
structure information.
Compared to existing 3D point cloud recognition meth-
ods [24,28,41,45], our proposed PS-Former model con-
sists of learnable components that are generic and easy to
adapt. Versus the competing Transformer based approaches
such as PCT [13], PS-Former (1) removes the non-learnable
feature engineering stage with a Condensation Layer and
(2) incorporates newly designed Position-to-Structure At-
tention that learns informative features from point positions
and their neighboring structures.
3. Method
In this section, we present our proposed model, Position-
to-Structure Transformers (PS-Former). We first give an
overview of our model, followed by a description of the two
key components of our model: Position-Structure Attention
Layer and Condensation Layer.
摘要:

PointCloudRecognitionwithPosition-to-StructureAttentionTransformersZhengDingUCSanDiegoJamesHouTheBishop'sSchoolZhuowenTuUCSanDiegoAbstractInthispaper,wepresentPosition-to-StructureAttentionTransformers(PS-Former),aTransformer-basedalgorithmfor3Dpointcloudrecognition.PS-Formerdealswiththechallengein3...

展开>> 收起<<
Point Cloud Recognition with Position-to-Structure Attention Transformers Zheng Ding UC San DiegoJames Hou.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:2.7MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注