6D Pose Estimation for Textureless Objects on RGB Frames using Multi-View Optimization Jun Yang Wenjie Xue Sahar Ghavidel and Steven L. Waslander

2025-04-30 0 0 1.25MB 8 页 10玖币

侵权投诉

6D Pose Estimation for Textureless Objects on RGB Frames using

Multi-View Optimization

Jun Yang*, Wenjie Xue†, Sahar Ghavidel†, and Steven L. Waslander*

Abstract— 6D pose estimation of textureless objects is a

valuable but challenging task for many robotic applications. In

this work, we propose a framework to address this challenge

using only RGB images acquired from multiple viewpoints. The

core idea of our approach is to decouple 6D pose estimation

into a sequential two-step process, ﬁrst estimating the 3D

translation and then the 3D rotation of each object. This

decoupled formulation ﬁrst resolves the scale and depth

ambiguities in single RGB images, and uses these estimates

to accurately identify the object orientation in the second

stage, which is greatly simpliﬁed with an accurate scale esti-

mate. Moreover, to accommodate the multi-modal distribution

present in rotation space, we develop an optimization scheme

that explicitly handles object symmetries and counteracts

measurement uncertainties. In comparison to the state-of-the-

art multi-view approach, we demonstrate that the proposed

approach achieves substantial improvements on a challenging

6D pose estimation dataset for textureless objects.

I. INTRODUCTION

Texture-less rigid objects occur frequently in industrial

environments and are of signiﬁcant interest in many

robotic applications. The task of 6D pose estimation aims

to detect these objects of known geometry and estimate

their 6DoF (Degree of Freedom) poses, i.e., 3D translations

and 3D rotations, with respect to a global coordinate

frame. In robotic manipulation tasks, accurate object

poses are required for path planning and grasp execu-

tions [1], [2], [3]. For robotic navigation, 6D object poses

provide useful information to the robot for localization

and obstacle avoidance [4], [5], [6], [7].

Due to the lack of appearance features, historically, the

problem of 6D pose estimation for textureless objects is

mainly addressed with depth data [8], [9], [10], [11], [12] or

RGB-D images [13], [2], [14], [15], [16]. These approaches

can achieve strong pose estimation performance when

given high-quality depth data. Despite recent advances

in depth acquisition technology, commodity-level depth

cameras produce depth maps with low accuracy and

missing data when surfaces are too glossy or dark [17],

[18], or the object is transparent [19], [20]. Hence, in the

past decade, RGB-based solutions have received a lot of

attention as an alternative approach [21], [22]. Due to

the advancements in deep learning, some learning-based

approaches have been recently shown to signiﬁcantly

This work was supported by Epson Canada Ltd.

*Jun Yang and Steven L. Waslander are with University of Toronto

Institute for Aerospace Studies and Robotics Institute. {jun.yang,

steven.waslander}@robotics.utias.utoronto.ca

†Wenjie Xue and Sahar Ghavidel are with Epson Canada

{mark.xue, sahar.ghavidel}@ea.epson.com

boost the object pose estimation performance using only

RGB images [23], [24], [25], [26], [27]. However, due to

the scale, depth, and perspective ambiguities inherent to

a single viewpoint, RGB-based solutions usually have low

accuracy for the ﬁnal estimated 6D poses.

To this end, recent works utilize multiple RGB frames

acquired from different viewpoints to enhance their pose

estimation results [28], [29], [30], [6], [31], [7]. In particular,

these approaches can be further categorized into ofﬂine

batch-based solutions [28], [30], where all the frames are

provided at once, and incremental solutions [29], [6], [31],

[7], where frames are provided sequentially. While fusing

pose estimates from different viewpoints can improve

the overall performance, handling extreme inconsistency,

such as appearance ambiguities, rotational symmetries,

and possible occlusions, is still challenging. To address

these challenges, in this work, we propose a decoupled

formulation to factorize the 6D pose estimation problem

into a sequential two-step optimization process. Figure 1

shows an overview of the framework. Based on the per-

frame predictions of the object’s segmentation mask and

2D center from neural networks, we ﬁrst optimize the 3D

translation and obtain the object’s scale in the image. The

acquired scale greatly simpliﬁes the object rotation esti-

mation problem with a template-matching method [21]. A

max-mixture formulation [32] is ﬁnally adopted to accom-

modate the multi-modal output distribution present in

rotation space. We conduct extensive experiments on the

challenging ROBI dataset [33]. In comparison to the state-

of-the-art method CosyPose [28], we achieve a substantial

improvement with our method (28.5% and 3.4% over its

RGB and RGBD version, respectively).

In summary, our key contributions are:

•We propose a novel 6D object pose estimation ap-

proach that decouples the problem into a sequential

two-step process. This process resolves the depth

ambiguities from individual RGB frames and greatly

improves the estimate of rotation parameters.

•To deal with the multi-modal uncertainties of object

rotation, we develop a rotation optimization scheme

that explicitly handles the object symmetries and

counteracts measurement ambiguities.

II. RELATED WORK

A. Object Pose Estimation from a Single RGB Image

Many approaches have been presented in recent years

to address the pose estimation problem for texture-less

arXiv:2210.11554v2 [cs.RO] 21 Feb 2023

Object Mask

2D Center + Uncertainty

Cropped Image

Object Center Localization 3D Translation

Optimization

Square ROI using known

Center and Scale

Re-Cropped Object

Segmentation

Input Image + Bounding Boxes

3D Rotation

Measurement

TM-based Rotation

Estimation

Unit Vectors

3D Rotation Optimization

Fig. 1: An overview of the proposed multi-view object pose estimation pipeline with a two-step optimization formulation.

objects with only RGB images. Due to the lack of appear-

ance features, traditional methods usually tackle the prob-

lem via holistic template matching techniques [21], [34],

[35], but are susceptible to scale change and cluttered en-

vironments. More recently, deep learning techniques, such

as convolutional neural networks (CNNs), have been em-

ployed to overcome these challenges. As two pioneering

methods, SSD-6D [23] and PoseCNN [24] developed the

CNN architectures to estimate the 6D object poses from

a single RGB image. In comparison, some recent works

leverage CNNs to ﬁrst predict 2D object keypoints [36],

[37], [26] or dense 2D-3D correspondences [38], [39], [27],

[40], and then compute the pose through 2D-3D corre-

spondences with a PnP algorithm [41]. Although these

methods show good 2D detection results, the accuracy

in the ﬁnal 6D poses is generally low.

B. Object Pose Estimation from Multiple Viewpoints

Multi-view approaches aim to resolve the scale and

depth ambiguities that commonly occur in the single

viewpoint setting and improve the accuracy of the esti-

mated poses. Traditional works utilize local features [42],

[43] and cannot handle textureless objects. Recently, the

multi-view object pose estimation problem has been re-

visited with neural networks. These approaches used an

ofﬂine, batch-based optimization formulation, where all

the frames are given at once, to obtain a single consistent

scene interpretation [44], [28], [20], [30]. Compared to

batch-based methods, other works solve the multi-view

pose estimation problem in an online manner. These

works estimate camera poses and object poses simulta-

neously, known as object-level SLAM [5], [6], [45], [7],

or estimate object poses with known camera poses [1],

[29], [31]. Although these methods show performance im-

provements with only RGB images, they still face difﬁculty

in dealing with object scales, rotational symmetries, and

measurement uncertainties.

With the per-frame neural network predictions as mea-

surements, our work resolves the depth and scale am-

biguities by a decoupled formulation. It also explicitly

handles rotational symmetries and measurement uncer-

tainties within an incremental online framework.

III. APPROACH OVERVIEW AND PROBLEM FORMULATION

Given the 3D object model and multi-view images, the

goal of 6D object pose estimation is to estimate the rigid

transformation Two ∈SE(3) from the object model frame

Oto a global (world) frame W. We assume that we know

the camera poses Twc ∈SE(3) with respect to the world

frame. This can be done by robot forward kinematics and

eye-in-hand calibration when the camera is mounted on

the end-effector of a robotic arm [46], or off-the-shelf

SLAM methods for a hand-held camera [47], [48].

Given measurements Z1:kup to viewpoint k, we aim to

estimate the posterior distribution of the 6D object pose

P(Rw o ,two |Z1:k). The direct computation of this distribu-

tion is generally not feasible since object translation tw,o

and rotation Rwo have distinct distributions. Speciﬁcally,

the translation distribution P(tw o )is straightforward and

expected to be unimodal. In contrast, the distribution for

object rotation P(Rw o )is less obvious due to complex

uncertainties arising from shape symmetries, appearance

ambiguities, and possible occlusions. Inspired by [29], we

decouple the pose posterior P(Rw o ,two |Z1:k)into:

P(Rw o ,two |Z1:k)=P(Rw o |Z1:k,tw o )P(two |Z1:k)(1)

where P(two |Z1:k)can be formulated as a unimodal

Gaussian distribution N¡tw o |µ,Σ¢.P(Rwo |Z1:k,tw o )is

the rotation distribution conditioned on the input images

Z1:kand the 3D translation tw,o. To represent the com-

plex rotation uncertainties, similar to [42], we formulate

P(Rw o |Z1:k,two )as the mixture of Gaussian distribution:

P(Rw o |Z1:k,two )=

i=1

wiN¡Rw o |µi,Σi¢(2)

which consists of NGaussian components. The coefﬁcient

widenotes the weight of the mixture component. µi

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

6DPoseEstimationforTexturelessObjectsonRGBFramesusingMulti-ViewOptimizationJunYang*,WenjieXue,SaharGhavidel,andStevenL.Waslander*Abstract6Dposeestimationoftexturelessobjectsisavaluablebutchallengingtaskformanyroboticapplications.Inthiswork,weproposeaframeworktoaddressthischallengeusingonlyRGBimag...

展开>> 收起<<

6D Pose Estimation for Textureless Objects on RGB Frames using Multi-View Optimization Jun Yang Wenjie Xue Sahar Ghavidel and Steven L. Waslander.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

6D Pose Estimation for Textureless Objects on RGB Frames using Multi-View Optimization Jun Yang Wenjie Xue Sahar Ghavidel and Steven L. Waslander

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: