Point Transformer V2 Grouped Vector Attention and Partition-based Pooling Xiaoyang Wu1Yixing Lao2Li Jiang3Xihui Liu1Hengshuang Zhao1

2025-04-24 0 0 7.49MB 16 页 10玖币
侵权投诉
Point Transformer V2: Grouped Vector Attention
and Partition-based Pooling
Xiaoyang Wu1Yixing Lao2Li Jiang3Xihui Liu1Hengshuang Zhao1
1The University of Hong Kong 2Intel Labs 3Max Planck Institute
{xywu3, hszhao}@cs.hku.hk
Abstract
As a pioneering work exploring transformer architecture for 3D point cloud under-
standing, Point Transformer achieves impressive results on multiple highly compet-
itive benchmarks. In this work, we analyze the limitations of the Point Transformer
and propose our powerful and efficient Point Transformer V2 model with novel
designs that overcome the limitations of previous work. In particular, we first
propose group vector attention, which is more effective than the previous version
of vector attention. Inheriting the advantages of both learnable weight encoding
and multi-head attention, we present a highly effective implementation of grouped
vector attention with a novel grouped weight encoding layer. We also strengthen
the position information for attention by an additional position encoding multiplier.
Furthermore, we design novel and lightweight partition-based pooling methods
which enable better spatial alignment and more efficient sampling. Extensive
experiments show that our model achieves better performance than its predecessor
and achieves state-of-the-art on several challenging 3D point cloud understanding
benchmarks, including 3D point cloud segmentation on ScanNet v2 and S3DIS
and 3D point cloud classification on ModelNet40. Our code will be available at
https://github.com/Gofinge/PointTransformerV2.
1 Introduction
Point Transformer (PTv1) [
1
] introduces the self-attention networks to 3D point cloud understanding.
Combining the vector attention [
2
] with a U-Net style encoder-decoder framework, PTv1 achieves
remarkable performance in several 3D point cloud recognition tasks, including shape classification,
object part segmentation, and semantic scene segmentation.
In this work, we analyze the limitations of Point Transformer (PTv1) [1] and propose a new elegant
and powerful backbone named Point Transformer V2 (PTv2). Our PTv2 improves upon PTv1 with
several novel designs, including the advanced grouped vector attention with improved position
encoding, and the efficient partition-based pooling scheme.
The vector attention layers in PTv1 utilize MLPs as the weight encoding to map the subtraction
relation of query and key into an attention weight vector that can modulate the individual channels
of the value vector. However, as the model goes deeper and the number of channels increases, the
number of weight encoding parameters also increases drastically, leading to severe overfitting and
limiting the model depth. To address this problem, we present grouped vector attention with a more
parameter-efficient formulation, where the vector attention is divided into groups with shared vector
attention weights. Meanwhile, we show that the well-known multi-head attention [
3
] and the vector
attention [
2
,
1
] are degenerate cases of our proposed grouped vector attention. Our proposed grouped
Corresponding Author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05666v2 [cs.CV] 12 Oct 2022
vector attention inherits the merits of both vector attention and multi-head attention while being more
powerful and efficient.
Furthermore, point positions provide important geometric information for 3D semantic understanding.
Hence, the positional relationship among 3D points is more critical than 2D pixels. However, previous
3D position encoding schemes mostly follow the 2D ones and do not fully exploit the geometric
knowledge in 3D coordinates. To this end, we strengthen the position encoding mechanism by apply-
ing an additional position encoding multiplier to the relation vector. Such a design strengthens the
positional relationship information in the model, and we validate its effectiveness in our experiments.
Moreover, it is worth noting that the irregular, non-uniform spatial distributions of points are signif-
icant challenges to the pooling modules for point cloud processing. Previous point cloud pooling
approaches rely on a combination of sampling methods (e.g. farthest point sampling [
4
] or grid
sampling [
5
]) and neighbor query methods (e.g. kNN or radius query), which is time-consuming
and not spatially well-aligned. To overcome this problem, we go beyond the pooling paradigm of
combining sampling and query, and divide the point cloud into non-overlapping partitions to directly
fuse points within the same partition. We use uniform grids as partition divider and achieve significant
improvement.
In conclusion, we propose Point Transformer V2, which improves Point Transformer [
1
] from several
perspectives:
We propose an effective grouped vector attention (GVA) with a novel weight encoding layer
that enables efficient information exchange within and among attention groups.
We introduce an improved position encoding scheme to utilize point cloud coordinates better
and further enhance the spatial reasoning ability of the model.
We design the partition-based pooling strategy to enable more efficient and spatially better-
aligned information aggregation compared to previous methods.
We conducted extensive analysis and controlled experiments to validate our designs. Our results
indicate that PTv2 outperforms predecessor works and sets the new state-of-the-art on various 3D
understanding tasks.
2 Related Works
Image transformers.
With the great success of ViT [
6
], the absolute dominance of convolution in
vision tasks is shaken by Vision Transformer, which becomes a trend in 2D image understanding [
7
,
8
,
9
,
10
]. ViT introduces the far-reaching scaled dot-product self-attention and multi-head self-attention
theory [
3
] in NLP into vision by considering image patches as tokens. However, operating global
attention on the entire image consumes excessive memory. To solve the memory consumption
problem, Swin Transformer [
7
] introduces the grid-based local attention mechanism to operate the
transformer block in a sequence of shifted windows.
Point cloud understanding.
Learning-based methods for processing 3D point clouds can be classi-
fied into the following types: projection-based, voxel-based, and point-based networks. An intuitive
way to process irregular inputs like point clouds is to transform irregular representations into regular
ones. Projection-based methods project 3D point clouds into various image planes and utilize 2D
CNN-based backbones to extract feature representations [
11
,
12
,
13
,
14
]. An alternative approach
operates convolutions in 3D by transforming irregular point clouds into regular voxel representa-
tions [
15
,
16
]. Those voxel-based methods suffer from inefficiency because of the sparsity of point
clouds until the introduction and implementation of sparse convolution [
17
,
18
]. Point-based methods
extract features directly from the point cloud rather than projecting or quantizing irregular point
clouds onto regular grids in 2D or 3D [
19
,
4
,
20
,
5
]. The recently proposed transformer-based
point cloud understanding approaches, introduced in the next paragraph, are also categorized into
point-based methods.
Point cloud transformers.
Transformer-based networks belong to the category of point-based
networks for point cloud understanding. During the research upsurge of vision transformers, at almost
the same period, Zhao et al. [
1
] and Guo et al. [
21
] published their explorations of applying attention
to point cloud understanding, becoming pioneers in this direction. The PCT [
21
] proposed by Guo
et al. performs global attention directly on the point cloud. Their work, similar to ViT, is limited
2
Unpooling
U!U"
U"
U!
γ
U"U#
MLP
Δ𝑃
+
Softmax
𝑘
𝑟
𝑤
𝑉
𝑞
$→ ℝ%
AGG
𝑓
&𝑓
'( ∈ 𝑀(𝑝&)
𝑝&𝑝'
𝑓
&
)
𝜔
U!U"
U"
U!
γ
U"U#
MLP
MLP
Δ𝑃
+
Gouped AGG
Softmax
𝑘
𝑟
𝑤
𝑉
𝑞
𝜔
$→ ℝ*
𝑓
&
𝑓
&
)
𝑓
'( ∈ 𝑀(𝑝&)
𝑝&𝑝'
PTv1
PTv2
Pooling
Figure 1: Comparison of the attention, position encoding, and pooling mechanisms between PTv1 and
PTv2. Top-left: the vector attention (Sec. 3.1) with position encoding (Sec. 3.3) in PTv1 . Bottom-left:
our grouped vector attention (Sec. 3.2, denoted by red) with improved position encoding (Sec. 3.3,
denoted by blue) in PTv2. Top-right: the sampling-based pooling and interpolation-based unpooling
in PTv1. Bottom-right: our partition-based pooling and unpooling in PTv2 (Sec. 3.4).
by memory consumption and computational complexity. Meanwhile, based on the vector attention
theory proposed in SAN [
2
], Point Transformer [
1
] proposed by Zhao et al. directly performs local
attention between each point and its adjacent points, which alleviated the memory problem mentioned
above. Point Transformer achieves remarkable results in multiple point cloud understanding tasks
and state-of-art results for several competitive challenges. In this work, we analyze the limitations
of the Point Transformer [
1
], and propose several novel architecture designs for the attention and
pooling module, to improve the effectiveness and efficiency of the Point Transformer. Our proposed
model, Point Transformer V2, performs better than the Point Transformer across a variety of 3D
scene understating tasks.
3 Point Transformer V2
We analyze the limitations of Point Transformer V1 (PTv1) [
1
] and propose our Point Transformer V2
(PTv2), including several improved modules upon PTv1. We begin by introducing the mathematical
formulations and revisiting the vector self-attention used in PTv1 in Sec. 3.1. Based on the observation
that the parameters of PTv1 increases drastically with the increased model depth and channel size,
we propose our powerful and efficient grouped vector attention in Sec. 3.2. Further, we introduce our
improved position encoding in Sec. 3.3 and the new pooling method in Sec. 3.4. We finally describe
our network architecture in Sec. 3.5.
3.1 Problem Formulation and Background
Problem formulation.
Let
M= (P,F)
be a 3D point cloud scene containing a set of points
xi= (pi,fi)∈ M
, where
piR3
represents the point position, and
fiRc
represents the point
features. Point cloud semantic segmentation aims to predict a class label for each point
xi
, and the
goal of scene classification is to predict a class label for each scene
M
.
M(p)
denotes a mapping
function that maps the point at position
p
to a subset of
M
denoted as “reference set”. Next, we
revisit the self-attention mechanism used in PTv1 [1].
3
Local attention.
Conducting the global attention [
6
,
21
] over all points in a scene is computationally
heavy and infeasible for large-scale 3D scenes. Therefore, we apply local attention where the attention
for each point xiworks within a subset of points, i.e., reference point set, M(pi).
Shifted-grid attention [
7
], where attention is alternatively applied over two sets of non-overlapping
image grids, has become is a common practice [
22
,
23
,
24
,
25
] for image transformers. Similarly,
the 3D space can be split into uniform non-overlapping grid cells, and the reference set is defined as
the points within the same grid, i.e.,
M(pi) = {(pj,fj)|pjin the same grid cell as pi}
. However,
such attention relies on a cumbersome shift grid operation to achieve a global receptive field, and it
does not work well on point clouds where the point densities within different grids are not consistent.
PTv1 adopts neighborhood attention, where the reference point set is a local neighborhood of the
given point, i.e.,
M(pi) = {(pj,fj)|pjNeighborhood(pi)}
. Specifically, the neighborhood
point set
M(pi)
is defined as the
k
nearest neighboring (kNN) points of
pi
in PTv1. Our experiments
(Sec. 4.3) show that neighborhood attention is more effective than shifted-grid attention, so our
approach adopts the neighborhood attention.
Scalar attention and vector attention.
Given a point
xi= (pi,fi)∈ M
, we apply linear projec-
tions or MLPs to project the point features
fi
to the feature vectors of query
qi
, key
ki
, and value
vi
,
each with
ch
channels. The standard scalar attention (SA) operated on the point
xi
and its reference
point set M(pi)can be represented as follows,
wij =hqi,kji/ch,fattn
i=X
xj∈M(pi)
Softmax(wi)jvj,(1)
The attention weights in the above formulation are scalars computed from the scaled dot-product [
3
]
between the query and key vectors. Multi-head scalar attention (MSA) [
3
] is an extension of SA
which runs several scalar attentions in parallel. MSA is widely applied in transformers, and we will
show in Sec. 3.2 that MSA is a degenerate case of our proposed grouped vector attention.
Instead of the scalar attention weights, PTv1 applies vector attention, where the attention weights are
vectors that can modulate the individual feature channels. In SA, the scalar attention is computed by
the scaled dot-product between the query and key vectors. In vector attention, a weight encoding
function encodes the relation between query and key to a vector. The vector attention [
2
] is formulated
as follows,
wij =ω(γ(qi,kj)), fattn
i=X
xj∈M(pi)
Softmax(Wi)jvj,(2)
where
is the Hadamard product.
γ
is a relation function (e.g., subtraction).
ω:Rc7→ Rc
is a
learnable weight encoding (e.g., MLP) that computes the attention vectors to re-weight
vj
by channels
before aggregation. Fig. 2(a) shows a method using vector attention with linear weight encoding.
3.2 Grouped Vector Attention
In vector attention, as the network goes deeper and there are more feature encoding channels, the
number of parameters for the weight encoding layer increases drastically. The large parameter size
restricts the efficiency and generalization ability of the model. In order to overcome the limitations of
vector attention, we introduce the grouped vector attention, as illustrated in Fig. 1(left).
Attention groups.
We divide channels of the value vector
vRc
evenly into
g
groups (
1gc
).
The weight encoding layer outputs a grouped attention vector with
g
channels instead of
c
channels.
Channels of
v
within the same attention group share the same scalar attention weight from the
grouped attention vector. Mathematically,
wij =ω(γ(qi,kj)),fattn
i=
M(pi)
X
xj
g
X
l=1
c/g
X
m=1
Softmax(Wi)jlvlc/g+m
j,(3)
where
γ
is the relation function and
ω:Rc7→ Rg
is the learnable grouped weight encoding defined
in the next paragraph. The second equation in Eq. 3is the grouped vector aggregation. Fig. 2
(a) presents a vanilla GVA implemented by a fully connected weight encoding, the number of the
grouped weight encoding function parameters reduced compared with the vector attention (Fig. 2
(b)), leading to a more powerful and efficient model.
4
摘要:

PointTransformerV2:GroupedVectorAttentionandPartition-basedPoolingXiaoyangWu1YixingLao2LiJiang3XihuiLiu1HengshuangZhao11TheUniversityofHongKong2IntelLabs3MaxPlanckInstitute{xywu3,hszhao}@cs.hku.hkAbstractAsapioneeringworkexploringtransformerarchitecturefor3Dpointcloudunder-standing,PointTransformer...

展开>> 收起<<
Point Transformer V2 Grouped Vector Attention and Partition-based Pooling Xiaoyang Wu1Yixing Lao2Li Jiang3Xihui Liu1Hengshuang Zhao1.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:7.49MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注