2
[11]–[13] utilize DNNs to detect the keypoints of each object
and subsequently compute the 6D pose parameters using
Perspective-n-Point (PnP) for 2D keypoints or Least Squares
methods for 3D keypoints.
While DNN methods can solve the problem more rapidly,
they are still unable to achieve high accuracy due to errors in
segmentation or regression. To achieve higher accuracy and
stability, many works have adopted pose refinement methods,
of which the most common is the Iterative Closest Point
(ICP) [14] algorithm. Given an estimated pose, the method
tries to find the nearest neighbor of each point of the source
point cloud in the target point cloud, considers it as the
corresponding point, and solves for the optimal transforma-
tion iteratively. Moreover, works like [8], [15] use DNNs to
extract more features for better performance. However, with
the development of pose estimation networks, performance
improvement of these pose refinement methods becomes less
and less. The limited accuracy of existing registration methods
can be attributed to their reliance on incomplete point clouds to
register entire object mesh point clouds, resulting in numerous
erroneous correspondences. Besides, despite the widespread
use of color information in 6D estimation, its potential to en-
hance registration accuracy remains largely unexplored. Con-
ventional methods have not effectively exploited the benefits of
color information and are primarily designed to solve the large-
scale optimization problem of point cloud registration, rather
than to deal with the small-scale problem of pose refinement,
resulting in an untapped area of research.
Our refinement method mainly contains two modules.
Firstly, we propose a point cloud completion network to fully
utilize the point cloud and RGB data. Our composite encoder
of the network has two branches: the local branch fuses the
RGB and point cloud information at each corresponding pixel,
and the global branch extracts the feature of the whole point
cloud. The decoder of the network follows [16] and employs
a multistage point generation structure. Additionally, we add
a keypoint detection module to the point cloud completion
network during the training process to improve the sensitivity
of the completed point cloud to pose accuracy, leading to better
pose optimization. Secondly, to use color and point cloud data
in registration and to enhance method stability, we propose
a novel method named Color supported Iterative KeyPoint
(CIKP), which samples the point cloud surrounding each key
point and leverages both RGB and point cloud information to
refine object keypoints iteratively. However, the CIKP method
will make it hard to refine all key points when the point
cloud is incomplete, which limits its performance. To address
this issue, we introduce a combination of our completion
network and the CIKP method, referred to as Point Cloud
Completion and Keypoint Refinement with Fusion (PCKRF).
This integrated approach enables the refinement of the initial
pose prediction from the pose estimation network. We further
conduct extensive experiments on YCB-Video [10] and Occlu-
sion LineMOD [6] datasets to evaluate our method. The results
demonstrate that our method can be effectively integrated with
most existing pose estimation techniques, leading to improved
performance in most cases.
Our main contribution is threefold:
•PCKRF: A pipeline that combines our completion net-
work and CIKP method, utilizing RGBD information and
keypoints throughout the refinement.
•A novel point completion network that includes a com-
posite encoder and adds a keypoint detection module.
•A novel iterative pose refinement method CIKP that
uses both RGB and point cloud information based on
keypoints refinement.
Experiments demonstrate that our PCKRF exhibits superior
stability compared to existing approaches when optimizing
initial poses with relatively high precision. Notably, the results
indicate that our method can be effectively integrated with
most existing pose estimation techniques, leading to improved
performance in most cases. Furthermore, our method achieves
promising results even in challenging scenarios involving
textureless and symmetrical objects.
II. RELATED WORKS
A. Pose Estimation
Pose estimation methods can be categorized into two types
based on their optimization goal: holistic and keypoint-based
methods. Holistic methods predict the 3D position and ori-
entation of objects directly from the provided RGB and/or
depth images. Traditional template-based methods construct
a rigid template for an object from different viewpoints and
compute the best-matched pose for the given image [17],
[18]. Recently, some works utilized DNNs to directly regress
or classify the 6D pose of objects. PoseCNN [10] used a
multi-stage network to predict pose. It first utilized Hough
Voting to determine the center location of objects and then
directly regressed 3D rotation parameters. SSD-6D [19] first
detected objects in the images and then classified their poses.
DenseFusion [8] fused RGB and depth values at the per-pixel
level, which significantly impacted 6D pose estimation meth-
ods based on RGBD images. However, the non-linearity of the
rotation makes it challenging for the loss function to converge.
Recently, Neural Radiance Fields have also been employed
for 6D pose estimation, showcasing significant inspiration and
research potential [20].
Pose estimation using only point cloud information is also
called point cloud registration. Recently, the advancements
in deep neural networks, particularly in three-dimensional
geometry with methods like PointNet [21] and DGCNN [22],
have significantly propelled the progress of deep point cloud
registration. These methods are centered around the idea of
utilizing deep neural networks to extract features from cross-
source point clouds. These extracted features then serve as
the basis for registrations or are directly used to regress
transformation matrices. Techniques like SpinNet [23] aim
to extract robust point descriptors through specialized neural
network designs, focusing on feature learning. However, its
reliance on a voxelization preprocessing step poses a challenge
when dealing with cross-modality point clouds. Another ap-
proach, D3Feat [24], constructs features based on k-nearest
neighbors. Nonetheless, this descriptor tends to struggle when
confronted with significant density disparities. Beyond these