6D Pose Estimation for Textureless Objects on RGB Frames using
Multi-View Optimization
Jun Yang*, Wenjie Xue†, Sahar Ghavidel†, and Steven L. Waslander*
Abstract— 6D pose estimation of textureless objects is a
valuable but challenging task for many robotic applications. In
this work, we propose a framework to address this challenge
using only RGB images acquired from multiple viewpoints. The
core idea of our approach is to decouple 6D pose estimation
into a sequential two-step process, first estimating the 3D
translation and then the 3D rotation of each object. This
decoupled formulation first resolves the scale and depth
ambiguities in single RGB images, and uses these estimates
to accurately identify the object orientation in the second
stage, which is greatly simplified with an accurate scale esti-
mate. Moreover, to accommodate the multi-modal distribution
present in rotation space, we develop an optimization scheme
that explicitly handles object symmetries and counteracts
measurement uncertainties. In comparison to the state-of-the-
art multi-view approach, we demonstrate that the proposed
approach achieves substantial improvements on a challenging
6D pose estimation dataset for textureless objects.
I. INTRODUCTION
Texture-less rigid objects occur frequently in industrial
environments and are of significant interest in many
robotic applications. The task of 6D pose estimation aims
to detect these objects of known geometry and estimate
their 6DoF (Degree of Freedom) poses, i.e., 3D translations
and 3D rotations, with respect to a global coordinate
frame. In robotic manipulation tasks, accurate object
poses are required for path planning and grasp execu-
tions [1], [2], [3]. For robotic navigation, 6D object poses
provide useful information to the robot for localization
and obstacle avoidance [4], [5], [6], [7].
Due to the lack of appearance features, historically, the
problem of 6D pose estimation for textureless objects is
mainly addressed with depth data [8], [9], [10], [11], [12] or
RGB-D images [13], [2], [14], [15], [16]. These approaches
can achieve strong pose estimation performance when
given high-quality depth data. Despite recent advances
in depth acquisition technology, commodity-level depth
cameras produce depth maps with low accuracy and
missing data when surfaces are too glossy or dark [17],
[18], or the object is transparent [19], [20]. Hence, in the
past decade, RGB-based solutions have received a lot of
attention as an alternative approach [21], [22]. Due to
the advancements in deep learning, some learning-based
approaches have been recently shown to significantly
This work was supported by Epson Canada Ltd.
*Jun Yang and Steven L. Waslander are with University of Toronto
Institute for Aerospace Studies and Robotics Institute. {jun.yang,
steven.waslander}@robotics.utias.utoronto.ca
†Wenjie Xue and Sahar Ghavidel are with Epson Canada
{mark.xue, sahar.ghavidel}@ea.epson.com
boost the object pose estimation performance using only
RGB images [23], [24], [25], [26], [27]. However, due to
the scale, depth, and perspective ambiguities inherent to
a single viewpoint, RGB-based solutions usually have low
accuracy for the final estimated 6D poses.
To this end, recent works utilize multiple RGB frames
acquired from different viewpoints to enhance their pose
estimation results [28], [29], [30], [6], [31], [7]. In particular,
these approaches can be further categorized into offline
batch-based solutions [28], [30], where all the frames are
provided at once, and incremental solutions [29], [6], [31],
[7], where frames are provided sequentially. While fusing
pose estimates from different viewpoints can improve
the overall performance, handling extreme inconsistency,
such as appearance ambiguities, rotational symmetries,
and possible occlusions, is still challenging. To address
these challenges, in this work, we propose a decoupled
formulation to factorize the 6D pose estimation problem
into a sequential two-step optimization process. Figure 1
shows an overview of the framework. Based on the per-
frame predictions of the object’s segmentation mask and
2D center from neural networks, we first optimize the 3D
translation and obtain the object’s scale in the image. The
acquired scale greatly simplifies the object rotation esti-
mation problem with a template-matching method [21]. A
max-mixture formulation [32] is finally adopted to accom-
modate the multi-modal output distribution present in
rotation space. We conduct extensive experiments on the
challenging ROBI dataset [33]. In comparison to the state-
of-the-art method CosyPose [28], we achieve a substantial
improvement with our method (28.5% and 3.4% over its
RGB and RGBD version, respectively).
In summary, our key contributions are:
•We propose a novel 6D object pose estimation ap-
proach that decouples the problem into a sequential
two-step process. This process resolves the depth
ambiguities from individual RGB frames and greatly
improves the estimate of rotation parameters.
•To deal with the multi-modal uncertainties of object
rotation, we develop a rotation optimization scheme
that explicitly handles the object symmetries and
counteracts measurement ambiguities.
II. RELATED WORK
A. Object Pose Estimation from a Single RGB Image
Many approaches have been presented in recent years
to address the pose estimation problem for texture-less
arXiv:2210.11554v2 [cs.RO] 21 Feb 2023