vector attention inherits the merits of both vector attention and multi-head attention while being more
powerful and efficient.
Furthermore, point positions provide important geometric information for 3D semantic understanding.
Hence, the positional relationship among 3D points is more critical than 2D pixels. However, previous
3D position encoding schemes mostly follow the 2D ones and do not fully exploit the geometric
knowledge in 3D coordinates. To this end, we strengthen the position encoding mechanism by apply-
ing an additional position encoding multiplier to the relation vector. Such a design strengthens the
positional relationship information in the model, and we validate its effectiveness in our experiments.
Moreover, it is worth noting that the irregular, non-uniform spatial distributions of points are signif-
icant challenges to the pooling modules for point cloud processing. Previous point cloud pooling
approaches rely on a combination of sampling methods (e.g. farthest point sampling [
4
] or grid
sampling [
5
]) and neighbor query methods (e.g. kNN or radius query), which is time-consuming
and not spatially well-aligned. To overcome this problem, we go beyond the pooling paradigm of
combining sampling and query, and divide the point cloud into non-overlapping partitions to directly
fuse points within the same partition. We use uniform grids as partition divider and achieve significant
improvement.
In conclusion, we propose Point Transformer V2, which improves Point Transformer [
1
] from several
perspectives:
•
We propose an effective grouped vector attention (GVA) with a novel weight encoding layer
that enables efficient information exchange within and among attention groups.
•
We introduce an improved position encoding scheme to utilize point cloud coordinates better
and further enhance the spatial reasoning ability of the model.
•
We design the partition-based pooling strategy to enable more efficient and spatially better-
aligned information aggregation compared to previous methods.
We conducted extensive analysis and controlled experiments to validate our designs. Our results
indicate that PTv2 outperforms predecessor works and sets the new state-of-the-art on various 3D
understanding tasks.
2 Related Works
Image transformers.
With the great success of ViT [
6
], the absolute dominance of convolution in
vision tasks is shaken by Vision Transformer, which becomes a trend in 2D image understanding [
7
,
8
,
9
,
10
]. ViT introduces the far-reaching scaled dot-product self-attention and multi-head self-attention
theory [
3
] in NLP into vision by considering image patches as tokens. However, operating global
attention on the entire image consumes excessive memory. To solve the memory consumption
problem, Swin Transformer [
7
] introduces the grid-based local attention mechanism to operate the
transformer block in a sequence of shifted windows.
Point cloud understanding.
Learning-based methods for processing 3D point clouds can be classi-
fied into the following types: projection-based, voxel-based, and point-based networks. An intuitive
way to process irregular inputs like point clouds is to transform irregular representations into regular
ones. Projection-based methods project 3D point clouds into various image planes and utilize 2D
CNN-based backbones to extract feature representations [
11
,
12
,
13
,
14
]. An alternative approach
operates convolutions in 3D by transforming irregular point clouds into regular voxel representa-
tions [
15
,
16
]. Those voxel-based methods suffer from inefficiency because of the sparsity of point
clouds until the introduction and implementation of sparse convolution [
17
,
18
]. Point-based methods
extract features directly from the point cloud rather than projecting or quantizing irregular point
clouds onto regular grids in 2D or 3D [
19
,
4
,
20
,
5
]. The recently proposed transformer-based
point cloud understanding approaches, introduced in the next paragraph, are also categorized into
point-based methods.
Point cloud transformers.
Transformer-based networks belong to the category of point-based
networks for point cloud understanding. During the research upsurge of vision transformers, at almost
the same period, Zhao et al. [
1
] and Guo et al. [
21
] published their explorations of applying attention
to point cloud understanding, becoming pioneers in this direction. The PCT [
21
] proposed by Guo
et al. performs global attention directly on the point cloud. Their work, similar to ViT, is limited
2