6D Pose Estimation for Textureless Objects on RGB Frames using Multi-View Optimization Jun Yang Wenjie Xue Sahar Ghavidel and Steven L. Waslander

2025-04-30 0 0 1.25MB 8 页 10玖币
侵权投诉
6D Pose Estimation for Textureless Objects on RGB Frames using
Multi-View Optimization
Jun Yang*, Wenjie Xue†, Sahar Ghavidel†, and Steven L. Waslander*
Abstract 6D pose estimation of textureless objects is a
valuable but challenging task for many robotic applications. In
this work, we propose a framework to address this challenge
using only RGB images acquired from multiple viewpoints. The
core idea of our approach is to decouple 6D pose estimation
into a sequential two-step process, first estimating the 3D
translation and then the 3D rotation of each object. This
decoupled formulation first resolves the scale and depth
ambiguities in single RGB images, and uses these estimates
to accurately identify the object orientation in the second
stage, which is greatly simplified with an accurate scale esti-
mate. Moreover, to accommodate the multi-modal distribution
present in rotation space, we develop an optimization scheme
that explicitly handles object symmetries and counteracts
measurement uncertainties. In comparison to the state-of-the-
art multi-view approach, we demonstrate that the proposed
approach achieves substantial improvements on a challenging
6D pose estimation dataset for textureless objects.
I. INTRODUCTION
Texture-less rigid objects occur frequently in industrial
environments and are of significant interest in many
robotic applications. The task of 6D pose estimation aims
to detect these objects of known geometry and estimate
their 6DoF (Degree of Freedom) poses, i.e., 3D translations
and 3D rotations, with respect to a global coordinate
frame. In robotic manipulation tasks, accurate object
poses are required for path planning and grasp execu-
tions [1], [2], [3]. For robotic navigation, 6D object poses
provide useful information to the robot for localization
and obstacle avoidance [4], [5], [6], [7].
Due to the lack of appearance features, historically, the
problem of 6D pose estimation for textureless objects is
mainly addressed with depth data [8], [9], [10], [11], [12] or
RGB-D images [13], [2], [14], [15], [16]. These approaches
can achieve strong pose estimation performance when
given high-quality depth data. Despite recent advances
in depth acquisition technology, commodity-level depth
cameras produce depth maps with low accuracy and
missing data when surfaces are too glossy or dark [17],
[18], or the object is transparent [19], [20]. Hence, in the
past decade, RGB-based solutions have received a lot of
attention as an alternative approach [21], [22]. Due to
the advancements in deep learning, some learning-based
approaches have been recently shown to significantly
This work was supported by Epson Canada Ltd.
*Jun Yang and Steven L. Waslander are with University of Toronto
Institute for Aerospace Studies and Robotics Institute. {jun.yang,
steven.waslander}@robotics.utias.utoronto.ca
†Wenjie Xue and Sahar Ghavidel are with Epson Canada
{mark.xue, sahar.ghavidel}@ea.epson.com
boost the object pose estimation performance using only
RGB images [23], [24], [25], [26], [27]. However, due to
the scale, depth, and perspective ambiguities inherent to
a single viewpoint, RGB-based solutions usually have low
accuracy for the final estimated 6D poses.
To this end, recent works utilize multiple RGB frames
acquired from different viewpoints to enhance their pose
estimation results [28], [29], [30], [6], [31], [7]. In particular,
these approaches can be further categorized into offline
batch-based solutions [28], [30], where all the frames are
provided at once, and incremental solutions [29], [6], [31],
[7], where frames are provided sequentially. While fusing
pose estimates from different viewpoints can improve
the overall performance, handling extreme inconsistency,
such as appearance ambiguities, rotational symmetries,
and possible occlusions, is still challenging. To address
these challenges, in this work, we propose a decoupled
formulation to factorize the 6D pose estimation problem
into a sequential two-step optimization process. Figure 1
shows an overview of the framework. Based on the per-
frame predictions of the object’s segmentation mask and
2D center from neural networks, we first optimize the 3D
translation and obtain the objects scale in the image. The
acquired scale greatly simplifies the object rotation esti-
mation problem with a template-matching method [21]. A
max-mixture formulation [32] is finally adopted to accom-
modate the multi-modal output distribution present in
rotation space. We conduct extensive experiments on the
challenging ROBI dataset [33]. In comparison to the state-
of-the-art method CosyPose [28], we achieve a substantial
improvement with our method (28.5% and 3.4% over its
RGB and RGBD version, respectively).
In summary, our key contributions are:
We propose a novel 6D object pose estimation ap-
proach that decouples the problem into a sequential
two-step process. This process resolves the depth
ambiguities from individual RGB frames and greatly
improves the estimate of rotation parameters.
To deal with the multi-modal uncertainties of object
rotation, we develop a rotation optimization scheme
that explicitly handles the object symmetries and
counteracts measurement ambiguities.
II. RELATED WORK
A. Object Pose Estimation from a Single RGB Image
Many approaches have been presented in recent years
to address the pose estimation problem for texture-less
arXiv:2210.11554v2 [cs.RO] 21 Feb 2023
Object Mask
2D Center + Uncertainty
Cropped Image
Object Center Localization 3D Translation
Optimization
Square ROI using known
Center and Scale
Re-Cropped Object
Segmentation
Input Image + Bounding Boxes
3D Rotation
Measurement
TM-based Rotation
Estimation
Unit Vectors
3D Rotation Optimization
Fig. 1: An overview of the proposed multi-view object pose estimation pipeline with a two-step optimization formulation.
objects with only RGB images. Due to the lack of appear-
ance features, traditional methods usually tackle the prob-
lem via holistic template matching techniques [21], [34],
[35], but are susceptible to scale change and cluttered en-
vironments. More recently, deep learning techniques, such
as convolutional neural networks (CNNs), have been em-
ployed to overcome these challenges. As two pioneering
methods, SSD-6D [23] and PoseCNN [24] developed the
CNN architectures to estimate the 6D object poses from
a single RGB image. In comparison, some recent works
leverage CNNs to first predict 2D object keypoints [36],
[37], [26] or dense 2D-3D correspondences [38], [39], [27],
[40], and then compute the pose through 2D-3D corre-
spondences with a PnP algorithm [41]. Although these
methods show good 2D detection results, the accuracy
in the final 6D poses is generally low.
B. Object Pose Estimation from Multiple Viewpoints
Multi-view approaches aim to resolve the scale and
depth ambiguities that commonly occur in the single
viewpoint setting and improve the accuracy of the esti-
mated poses. Traditional works utilize local features [42],
[43] and cannot handle textureless objects. Recently, the
multi-view object pose estimation problem has been re-
visited with neural networks. These approaches used an
offline, batch-based optimization formulation, where all
the frames are given at once, to obtain a single consistent
scene interpretation [44], [28], [20], [30]. Compared to
batch-based methods, other works solve the multi-view
pose estimation problem in an online manner. These
works estimate camera poses and object poses simulta-
neously, known as object-level SLAM [5], [6], [45], [7],
or estimate object poses with known camera poses [1],
[29], [31]. Although these methods show performance im-
provements with only RGB images, they still face difficulty
in dealing with object scales, rotational symmetries, and
measurement uncertainties.
With the per-frame neural network predictions as mea-
surements, our work resolves the depth and scale am-
biguities by a decoupled formulation. It also explicitly
handles rotational symmetries and measurement uncer-
tainties within an incremental online framework.
III. APPROACH OVERVIEW AND PROBLEM FORMULATION
Given the 3D object model and multi-view images, the
goal of 6D object pose estimation is to estimate the rigid
transformation Two SE(3) from the object model frame
Oto a global (world) frame W. We assume that we know
the camera poses Twc SE(3) with respect to the world
frame. This can be done by robot forward kinematics and
eye-in-hand calibration when the camera is mounted on
the end-effector of a robotic arm [46], or off-the-shelf
SLAM methods for a hand-held camera [47], [48].
Given measurements Z1:kup to viewpoint k, we aim to
estimate the posterior distribution of the 6D object pose
P(Rw o ,two |Z1:k). The direct computation of this distribu-
tion is generally not feasible since object translation tw,o
and rotation Rwo have distinct distributions. Specifically,
the translation distribution P(tw o )is straightforward and
expected to be unimodal. In contrast, the distribution for
object rotation P(Rw o )is less obvious due to complex
uncertainties arising from shape symmetries, appearance
ambiguities, and possible occlusions. Inspired by [29], we
decouple the pose posterior P(Rw o ,two |Z1:k)into:
P(Rw o ,two |Z1:k)=P(Rw o |Z1:k,tw o )P(two |Z1:k)(1)
where P(two |Z1:k)can be formulated as a unimodal
Gaussian distribution N¡tw o |µ,Σ¢.P(Rwo |Z1:k,tw o )is
the rotation distribution conditioned on the input images
Z1:kand the 3D translation tw,o. To represent the com-
plex rotation uncertainties, similar to [42], we formulate
P(Rw o |Z1:k,two )as the mixture of Gaussian distribution:
P(Rw o |Z1:k,two )=
N
X
i=1
wiN¡Rw o |µi,Σi¢(2)
which consists of NGaussian components. The coefficient
widenotes the weight of the mixture component. µi
摘要:

6DPoseEstimationforTexturelessObjectsonRGBFramesusingMulti-ViewOptimizationJunYang*,WenjieXue†,SaharGhavidel†,andStevenL.Waslander*Abstract—6Dposeestimationoftexturelessobjectsisavaluablebutchallengingtaskformanyroboticapplications.Inthiswork,weproposeaframeworktoaddressthischallengeusingonlyRGBimag...

展开>> 收起<<
6D Pose Estimation for Textureless Objects on RGB Frames using Multi-View Optimization Jun Yang Wenjie Xue Sahar Ghavidel and Steven L. Waslander.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.25MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注