Point Transformer V2 Grouped Vector Attention and Partition-based Pooling Xiaoyang Wu1Yixing Lao2Li Jiang3Xihui Liu1Hengshuang Zhao1

2025-04-24 0 0 7.49MB 16 页 10玖币

侵权投诉

Point Transformer V2: Grouped Vector Attention

and Partition-based Pooling

Xiaoyang Wu1Yixing Lao2Li Jiang3Xihui Liu1Hengshuang Zhao1∗

1The University of Hong Kong 2Intel Labs 3Max Planck Institute

{xywu3, hszhao}@cs.hku.hk

Abstract

As a pioneering work exploring transformer architecture for 3D point cloud under-

standing, Point Transformer achieves impressive results on multiple highly compet-

itive benchmarks. In this work, we analyze the limitations of the Point Transformer

and propose our powerful and efﬁcient Point Transformer V2 model with novel

designs that overcome the limitations of previous work. In particular, we ﬁrst

propose group vector attention, which is more effective than the previous version

of vector attention. Inheriting the advantages of both learnable weight encoding

and multi-head attention, we present a highly effective implementation of grouped

vector attention with a novel grouped weight encoding layer. We also strengthen

the position information for attention by an additional position encoding multiplier.

Furthermore, we design novel and lightweight partition-based pooling methods

which enable better spatial alignment and more efﬁcient sampling. Extensive

experiments show that our model achieves better performance than its predecessor

and achieves state-of-the-art on several challenging 3D point cloud understanding

benchmarks, including 3D point cloud segmentation on ScanNet v2 and S3DIS

and 3D point cloud classiﬁcation on ModelNet40. Our code will be available at

https://github.com/Gofinge/PointTransformerV2.

1 Introduction

Point Transformer (PTv1) [

] introduces the self-attention networks to 3D point cloud understanding.

Combining the vector attention [

] with a U-Net style encoder-decoder framework, PTv1 achieves

remarkable performance in several 3D point cloud recognition tasks, including shape classiﬁcation,

object part segmentation, and semantic scene segmentation.

In this work, we analyze the limitations of Point Transformer (PTv1) [1] and propose a new elegant

and powerful backbone named Point Transformer V2 (PTv2). Our PTv2 improves upon PTv1 with

several novel designs, including the advanced grouped vector attention with improved position

encoding, and the efﬁcient partition-based pooling scheme.

The vector attention layers in PTv1 utilize MLPs as the weight encoding to map the subtraction

relation of query and key into an attention weight vector that can modulate the individual channels

of the value vector. However, as the model goes deeper and the number of channels increases, the

number of weight encoding parameters also increases drastically, leading to severe overﬁtting and

limiting the model depth. To address this problem, we present grouped vector attention with a more

parameter-efﬁcient formulation, where the vector attention is divided into groups with shared vector

attention weights. Meanwhile, we show that the well-known multi-head attention [

] and the vector

attention [

] are degenerate cases of our proposed grouped vector attention. Our proposed grouped

∗Corresponding Author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05666v2 [cs.CV] 12 Oct 2022

vector attention inherits the merits of both vector attention and multi-head attention while being more

powerful and efﬁcient.

Furthermore, point positions provide important geometric information for 3D semantic understanding.

Hence, the positional relationship among 3D points is more critical than 2D pixels. However, previous

3D position encoding schemes mostly follow the 2D ones and do not fully exploit the geometric

knowledge in 3D coordinates. To this end, we strengthen the position encoding mechanism by apply-

ing an additional position encoding multiplier to the relation vector. Such a design strengthens the

positional relationship information in the model, and we validate its effectiveness in our experiments.

Moreover, it is worth noting that the irregular, non-uniform spatial distributions of points are signif-

icant challenges to the pooling modules for point cloud processing. Previous point cloud pooling

approaches rely on a combination of sampling methods (e.g. farthest point sampling [

] or grid

sampling [

]) and neighbor query methods (e.g. kNN or radius query), which is time-consuming

and not spatially well-aligned. To overcome this problem, we go beyond the pooling paradigm of

combining sampling and query, and divide the point cloud into non-overlapping partitions to directly

fuse points within the same partition. We use uniform grids as partition divider and achieve signiﬁcant

improvement.

In conclusion, we propose Point Transformer V2, which improves Point Transformer [

] from several

perspectives:

•

We propose an effective grouped vector attention (GVA) with a novel weight encoding layer

that enables efﬁcient information exchange within and among attention groups.

•

We introduce an improved position encoding scheme to utilize point cloud coordinates better

and further enhance the spatial reasoning ability of the model.

•

We design the partition-based pooling strategy to enable more efﬁcient and spatially better-

aligned information aggregation compared to previous methods.

We conducted extensive analysis and controlled experiments to validate our designs. Our results

indicate that PTv2 outperforms predecessor works and sets the new state-of-the-art on various 3D

understanding tasks.

2 Related Works

Image transformers.

With the great success of ViT [

], the absolute dominance of convolution in

vision tasks is shaken by Vision Transformer, which becomes a trend in 2D image understanding [

]. ViT introduces the far-reaching scaled dot-product self-attention and multi-head self-attention

theory [

] in NLP into vision by considering image patches as tokens. However, operating global

attention on the entire image consumes excessive memory. To solve the memory consumption

problem, Swin Transformer [

] introduces the grid-based local attention mechanism to operate the

transformer block in a sequence of shifted windows.

Point cloud understanding.

Learning-based methods for processing 3D point clouds can be classi-

ﬁed into the following types: projection-based, voxel-based, and point-based networks. An intuitive

way to process irregular inputs like point clouds is to transform irregular representations into regular

ones. Projection-based methods project 3D point clouds into various image planes and utilize 2D

CNN-based backbones to extract feature representations [

]. An alternative approach

operates convolutions in 3D by transforming irregular point clouds into regular voxel representa-

tions [

]. Those voxel-based methods suffer from inefﬁciency because of the sparsity of point

clouds until the introduction and implementation of sparse convolution [

]. Point-based methods

extract features directly from the point cloud rather than projecting or quantizing irregular point

clouds onto regular grids in 2D or 3D [

]. The recently proposed transformer-based

point cloud understanding approaches, introduced in the next paragraph, are also categorized into

point-based methods.

Point cloud transformers.

Transformer-based networks belong to the category of point-based

networks for point cloud understanding. During the research upsurge of vision transformers, at almost

the same period, Zhao et al. [

] and Guo et al. [

] published their explorations of applying attention

to point cloud understanding, becoming pioneers in this direction. The PCT [

] proposed by Guo

et al. performs global attention directly on the point cloud. Their work, similar to ViT, is limited

Unpooling

U!U"

U"U#

MLP

Δ𝑃

Softmax

𝑘

𝑟

𝑤

𝑉

𝑞

ℝ$→ ℝ%

AGG

𝑓

&𝑓

'( ∈ 𝑀(𝑝&)

𝑝&𝑝'

𝑓

)

𝜔

U!U"

U"U#

MLP

Δ𝑃

Gouped AGG

Softmax

𝑘

𝑟

𝑤

𝑉

𝑞

⊙

𝜔

ℝ$→ ℝ*

𝑓

)

𝑓

'( ∈ 𝑀(𝑝&)

𝑝&𝑝'

PTv1

PTv2

Pooling

Figure 1: Comparison of the attention, position encoding, and pooling mechanisms between PTv1 and

PTv2. Top-left: the vector attention (Sec. 3.1) with position encoding (Sec. 3.3) in PTv1 . Bottom-left:

our grouped vector attention (Sec. 3.2, denoted by red) with improved position encoding (Sec. 3.3,

denoted by blue) in PTv2. Top-right: the sampling-based pooling and interpolation-based unpooling

in PTv1. Bottom-right: our partition-based pooling and unpooling in PTv2 (Sec. 3.4).

by memory consumption and computational complexity. Meanwhile, based on the vector attention

theory proposed in SAN [

], Point Transformer [

] proposed by Zhao et al. directly performs local

attention between each point and its adjacent points, which alleviated the memory problem mentioned

above. Point Transformer achieves remarkable results in multiple point cloud understanding tasks

and state-of-art results for several competitive challenges. In this work, we analyze the limitations

of the Point Transformer [

], and propose several novel architecture designs for the attention and

pooling module, to improve the effectiveness and efﬁciency of the Point Transformer. Our proposed

model, Point Transformer V2, performs better than the Point Transformer across a variety of 3D

scene understating tasks.

3 Point Transformer V2

We analyze the limitations of Point Transformer V1 (PTv1) [

] and propose our Point Transformer V2

(PTv2), including several improved modules upon PTv1. We begin by introducing the mathematical

formulations and revisiting the vector self-attention used in PTv1 in Sec. 3.1. Based on the observation

that the parameters of PTv1 increases drastically with the increased model depth and channel size,

we propose our powerful and efﬁcient grouped vector attention in Sec. 3.2. Further, we introduce our

improved position encoding in Sec. 3.3 and the new pooling method in Sec. 3.4. We ﬁnally describe

our network architecture in Sec. 3.5.

3.1 Problem Formulation and Background

Problem formulation.

Let

M= (P,F)

be a 3D point cloud scene containing a set of points

xi= (pi,fi)∈ M

, where

pi∈R3

represents the point position, and

fi∈Rc

represents the point

features. Point cloud semantic segmentation aims to predict a class label for each point

, and the

goal of scene classiﬁcation is to predict a class label for each scene

M(p)

denotes a mapping

function that maps the point at position

to a subset of

denoted as “reference set”. Next, we

revisit the self-attention mechanism used in PTv1 [1].

Local attention.

Conducting the global attention [

] over all points in a scene is computationally

heavy and infeasible for large-scale 3D scenes. Therefore, we apply local attention where the attention

for each point xiworks within a subset of points, i.e., reference point set, M(pi).

Shifted-grid attention [

], where attention is alternatively applied over two sets of non-overlapping

image grids, has become is a common practice [

] for image transformers. Similarly,

the 3D space can be split into uniform non-overlapping grid cells, and the reference set is deﬁned as

the points within the same grid, i.e.,

M(pi) = {(pj,fj)|pjin the same grid cell as pi}

. However,

such attention relies on a cumbersome shift grid operation to achieve a global receptive ﬁeld, and it

does not work well on point clouds where the point densities within different grids are not consistent.

PTv1 adopts neighborhood attention, where the reference point set is a local neighborhood of the

given point, i.e.,

M(pi) = {(pj,fj)|pj∈Neighborhood(pi)}

. Speciﬁcally, the neighborhood

point set

M(pi)

is deﬁned as the

nearest neighboring (kNN) points of

in PTv1. Our experiments

(Sec. 4.3) show that neighborhood attention is more effective than shifted-grid attention, so our

approach adopts the neighborhood attention.

Scalar attention and vector attention.

Given a point

xi= (pi,fi)∈ M

, we apply linear projec-

tions or MLPs to project the point features

to the feature vectors of query

, key

, and value

each with

channels. The standard scalar attention (SA) operated on the point

and its reference

point set M(pi)can be represented as follows,

wij =hqi,kji/√ch,fattn

i=X

xj∈M(pi)

Softmax(wi)jvj,(1)

The attention weights in the above formulation are scalars computed from the scaled dot-product [

]

between the query and key vectors. Multi-head scalar attention (MSA) [

] is an extension of SA

which runs several scalar attentions in parallel. MSA is widely applied in transformers, and we will

show in Sec. 3.2 that MSA is a degenerate case of our proposed grouped vector attention.

Instead of the scalar attention weights, PTv1 applies vector attention, where the attention weights are

vectors that can modulate the individual feature channels. In SA, the scalar attention is computed by

the scaled dot-product between the query and key vectors. In vector attention, a weight encoding

function encodes the relation between query and key to a vector. The vector attention [

] is formulated

as follows,

wij =ω(γ(qi,kj)), fattn

i=X

xj∈M(pi)

Softmax(Wi)jvj,(2)

where



is the Hadamard product.

is a relation function (e.g., subtraction).

ω:Rc7→ Rc

is a

learnable weight encoding (e.g., MLP) that computes the attention vectors to re-weight

by channels

before aggregation. Fig. 2(a) shows a method using vector attention with linear weight encoding.

3.2 Grouped Vector Attention

In vector attention, as the network goes deeper and there are more feature encoding channels, the

number of parameters for the weight encoding layer increases drastically. The large parameter size

restricts the efﬁciency and generalization ability of the model. In order to overcome the limitations of

vector attention, we introduce the grouped vector attention, as illustrated in Fig. 1(left).

Attention groups.

We divide channels of the value vector

v∈Rc

evenly into

groups (

1≤g≤c

The weight encoding layer outputs a grouped attention vector with

channels instead of

channels.

Channels of

within the same attention group share the same scalar attention weight from the

grouped attention vector. Mathematically,

wij =ω(γ(qi,kj)),fattn

M(pi)

l=1

c/g

m=1

Softmax(Wi)jlvlc/g+m

j,(3)

where

is the relation function and

ω:Rc7→ Rg

is the learnable grouped weight encoding deﬁned

in the next paragraph. The second equation in Eq. 3is the grouped vector aggregation. Fig. 2

(a) presents a vanilla GVA implemented by a fully connected weight encoding, the number of the

grouped weight encoding function parameters reduced compared with the vector attention (Fig. 2

(b)), leading to a more powerful and efﬁcient model.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PointTransformerV2:GroupedVectorAttentionandPartition-basedPoolingXiaoyangWu1YixingLao2LiJiang3XihuiLiu1HengshuangZhao11TheUniversityofHongKong2IntelLabs3MaxPlanckInstitute{xywu3,hszhao}@cs.hku.hkAbstractAsapioneeringworkexploringtransformerarchitecturefor3Dpointcloudunder-standing,PointTransformer...

展开>> 收起<<

Point Transformer V2 Grouped Vector Attention and Partition-based Pooling Xiaoyang Wu1Yixing Lao2Li Jiang3Xihui Liu1Hengshuang Zhao1.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Point Transformer V2 Grouped Vector Attention and Partition-based Pooling Xiaoyang Wu1Yixing Lao2Li Jiang3Xihui Liu1Hengshuang Zhao1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: