2.2.2 Probabilistic Registration.
The probabilistic registration methods model the distribu-
tion of point clouds as a density function often via the use
of GMMs and perform alignment either by employing a
correlation-based approach or using an EM-based optimiza-
tion framework [
49
,
23
]. A commonly used formulation,
such as CPD [
29
] and FilterReg [
12
], represents the geome-
try of the target point cloud using a GMM distribution over
3D Euclidean space. The source point cloud is then fitted
to the GMM distribution under the maximum likelihood es-
timation (MLE) framework. Another statistical approach,
including GMMReg [
20
], JRMPC [
8
], and DeepGMR [
49
],
builds GMM probability distribution on both the source and
the target point clouds. These methods show robustness to
outliers, noise, and density variation [
49
]. These models usu-
ally assume that both source and target point clouds share
the same GMM parameters or that one of the point clouds
is supposed to be “perfect”, thus leading to biased solutions
as long as both point clouds contain noise and outliers (such
as partial-to-partial registration) [8]. Considering the above
factors, we propose an overlap-guided 3D point cloud regis-
tration algorithm based on Gaussian Mixture Model (GMM)
that can be applied to partial-to-partial registration problems.
2.3. Transformers in 3D Point Clouds
Due to the inherent permutation invariance and strong
global feature learning ability, 3D Transformers are well
suited for point cloud processing and analysis. They have
surpassed state-of-the-art non-Transformer algorithm perfor-
mance. In particular, A-SCN [
42
] is the first such example
of a Transformer-based approach for point cloud learning.
Later, PCT [
13
], which is a pure global Transformer net-
work, generates positional embedding using 3D coordinates
of points and adopts a transformer with an offset attention
module to enrich features of points from its local neigh-
borhood. PointTransformer [
52
] to construct self-attention
networks for general 3D tasks with nearest neighbor search.
However, they suffer from the fact that as the size of the fea-
ture map increases, the computing and memory overheads
of the original Transformer increase quadratically. Efforts to
reduce the quadratic complexity of attention mainly focused
on self-attention. For instance, PatchFormer [
50
] reduces
the size of the attention map by first splitting the raw point
cloud into patches, and then aggregating the local feature in
each patch to generate an attention matrix. FastPointTrans-
former proposes centroid-aware voxelization and devoxeliza-
tion techniques to reduce space complexity. However, these
works are less suitable for feature matching, where we are re-
quired to perform self- and cross-attention on features within
and between point clouds, respectively. To this end, we pro-
pose clustered attention which groups a set of points into
J
clusters and computes the attention only for these clusters,
resulting in linear complexity for a fixed number of clusters.
3. Our approach
Rigid point cloud registration aims to recover a rigid trans-
formation matrix
T∈SE(3)
, which consists of rotation
R∈SO(3)
and translation
t∈R3,
that optimally aligns
the source point cloud
P={pi∈R3i= 1,2, ..., N}
to
the target point cloud
Q={qj∈R3j= 1,2, ..., M}
,
where
N
and
M
represent the number of points in
P
and
Q
,
respectively. Fig. 1 illustrates our framework that consists
of three modules: feature extraction, overlap region detec-
tion, and overlap-guided GMM for registration. The shared
weighted encoder first extracts point-wise features
Fp
and
Fq
from point clouds
P
and
Q
, respectively. The clustered
self-attention module then updates the point-wise features
Fp
and
Fq
to capture global context. Next, the overlap
region detection module projects the updated features
P
and
Q
to overlap scores
op,oq
, respectively.
Fp
,
Fq
,
op
and
oq
are then used to estimate the distributions (GMMs) of
P
and
Q
. Finally, weighted SVD is adopted to estimate the
rigid transformation Tbased on the estimated distributions.
3.1. Feature Extraction
The feature extraction network consists of a Dynamic
Graph Convolutional Neural Network (DGCNN), positional
encoding, and a clustered self-attention network. Given a
point cloud pair
P
and
Q
, DGCNN extracts their associated
features
Fp={fpi∈Rdi= 1,2, ..., N}
and
Fq=
{fqj∈Rdj= 1,2, ..., M}. Here, d= 512.
3.1.1 Attention module
Transformer training and inference in previous works can
be computationally expensive due to the quadratic complex-
ity of self-attention over a long sequence of representations,
especially for high-resolution correspondences prediction
tasks. To mitigate this limitation, our novel cluster-based
Transformer architecture operates after local feature extrac-
tion:
Fp
and
Fq
are passed through the attention module
to extract context-dependent point-wise features. Intuitively,
the self-attention module transforms the DGCNN features
into more informative representations to facilitate matching.
Spherical positional encoding.
Transformers are typically
fed with only high-level features, which may not explicitly
encode the geometric structure of the point cloud [
39
,
16
].
This makes the learned features geometrically less discrim-
inative, causing severe matching ambiguity and numerous
outlier matches, especially in low-overlap cases [
33
]. A
straightforward recipe is to explicitly inject positional encod-
ing of 3D point coordinates, which assigns intrinsic geomet-
ric properties to the per-point feature by adding unique po-
sitional information that enhances distinctions among point
features in indistinctive regions [
52
]. However, the result-
ing coordinate-based attentions are naturally transformation-
variant [
33
], while registration requires transformation invari-