Transformers [39] are emerging machine learning mod-
els that have undergone exploding development in natural
language processing [8] and computer vision [3,9]. Un-
like Convolutional Neural Networks [22] that operate in
the fixed image lattice space, the attention mechanism in
the Transformers [39] include the positional embeddings in
the individual tokens that themselves are already orderless.
This makes Transformers a viable representation and com-
putation framework for 3D point cloud recognition. Using
the vanilla Transformers on point clouds [9,39] under the
standard self-attention mechanism, however, leads to a sub-
optimal solution (see ablation study in section 5.1). The
weighted sum mechanism learns an average of the tokens
which are not ideal for structure extraction. Existing works
like PCT [13] have a special non-learnable feature engineer-
ing step that pre-extracts features for each sampled point.
Given these challenges, we propose in this paper a new
point cloud recognition method, Position-to-Structure At-
tention TransFormer (PS-Former), that consists of two in-
teresting properties:
1. A learnable Condensation Layer that performs point
cloud downsampling and feature extraction automat-
ically. Unlike the Transformer-based PCT approach
[13], where a fixed strategy using farthest point sam-
pling and a feature engineering process using KNN
grouping are adopted, we extract structural features by
utilizing the internal self-attention matrix for comput-
ing the point relations.
2. A Position-to-Structure Attention mechanism that
recursively enriches the structure information using the
position attention branch. This is different from the
standard cross-attention mechanism where two work-
ing branches cross attend each other in a symmetric
way. An illustration can be found in Figure 4.
We conduct experiments on three main point cloud
recognition tasks including ModelNet40 classification [44],
ShapeNet part segmentation [51], and 3SDIS scene seg-
mentation [1], to evaluate the effectiveness of our proposed
model. ModelNet40 classification requires the recognition
of the entire input point cloud while the latter two focus
on single point labeling. PS-Former is shown to be able to
achieve competitive results when compared with state-of-
the-art methods, and improve over the PCT method [13] that
computes the features before the attention layer by grouping
them based on the original 3D space distance.
2. Related Work
3D Point Cloud Recognition. Due to the point cloud’s
non-grid data structure, a number of works have been pro-
posed in the past by first converting the points to a grid
data structure e.g. 2D images or 3D voxels [12,37]. The
grid data after this pre-processing can be directly learned
by CNN-like structure [22]. However, this conversion pro-
cess from point clouds to volumetric data may lead to in-
formation loss e.g. occlusion in the 2D images and res-
olution bottleneck in the 3D voxels. The seminal work
of PointNet [32] chose to perform learning directly on the
original point cloud data in which max pooling operation
is adopted to retain invariance in the point sets. Since the
work of PointNet [32], there has been a wealthy body of
methods proposed along this direction. There are meth-
ods [38,43,47] attempting to simulate the convolution pro-
cess in 2D images whereas other approaches [40,48] adopt
graph convolutional neural networks (GCN) to build con-
nections for the neighboring points for feature extraction.
Transformer Architecture. A notable recent develop-
ment in natural language processing is the invention and
widespread adoption of the Transformer architectures [7,
39]. At the core, Transformers [39] model the relations
among tokens with two attention mechanisms, namely self-
attention and cross-attention. Another important compo-
nent of Transformers is the positional encoding that embeds
the position information into the tokens, relaxing the re-
quirement to maintain the order for the input data. Recently,
Transformers have also been successfully adopted in image
classification [9] and object detection [3]. Typically, vision
transformers adopt similar absolute/relative positional en-
coding strategies used in language transformers [36,39] to
encode grid structures from 2D images. In 3D point cloud
tasks, the input point cloud data only contains the 3D co-
ordinates without any texture information, which makes the
computation from position to structure feature extraction a
challenging task. In point cloud recognition, Transformers
were adopted in [46]; PCT [13] applies sampling and group-
ing operations introduced by PointNet++ [34] to capture the
structure information.
Compared to existing 3D point cloud recognition meth-
ods [24,28,41,45], our proposed PS-Former model con-
sists of learnable components that are generic and easy to
adapt. Versus the competing Transformer based approaches
such as PCT [13], PS-Former (1) removes the non-learnable
feature engineering stage with a Condensation Layer and
(2) incorporates newly designed Position-to-Structure At-
tention that learns informative features from point positions
and their neighboring structures.
3. Method
In this section, we present our proposed model, Position-
to-Structure Transformers (PS-Former). We first give an
overview of our model, followed by a description of the two
key components of our model: Position-Structure Attention
Layer and Condensation Layer.