Multi-view object pose estimation from correspondence distributions and epipolar geometry Rasmus Laurvig Haugaard and Thorbjørn Mosekjær Iversen

2025-05-02 0 0 1.28MB 7 页 10玖币

侵权投诉

Multi-view object pose estimation from

correspondence distributions and epipolar geometry

Rasmus Laurvig Haugaard and Thorbjørn Mosekjær Iversen

Abstract— In many automation tasks involving manipulation

of rigid objects, the poses of the objects must be acquired.

Vision-based pose estimation using a single RGB or RGB-D

sensor is especially popular due to its broad applicability.

However, single-view pose estimation is inherently limited by

depth ambiguity and ambiguities imposed by various phenom-

ena like occlusion, self-occlusion, reﬂections, etc. Aggregation of

information from multiple views can potentially resolve these

ambiguities, but the current state-of-the-art multi-view pose

estimation method only uses multiple views to aggregate single-

view pose estimates, and thus rely on obtaining good single-view

estimates. We present a multi-view pose estimation method

which aggregates learned 2D-3D distributions from multiple

views for both the initial estimate and optional reﬁnement.

Our method performs probabilistic sampling of 3D-3D cor-

respondences under epipolar constraints using learned 2D-3D

correspondence distributions which are implicitly trained to

respect visual ambiguities such as symmetry. Evaluation on the

T-LESS dataset shows that our method reduces pose estimation

errors by 80-91% compared to the best single-view method, and

we present state-of-the-art results on T-LESS with four views,

even compared with methods using ﬁve and eight views.

I. INTRODUCTION

Many robotics tasks involve precise manipulation of rigid

objects. This is especially true for industrial robotics where

high precision pose estimates are crucial for successful exe-

cution of demanding tasks such as bin picking and assembly.

Single-view pose estimation applies to a wide range of

scenarios where pose estimation is desired, from augmented

reality to self-driving cars to object manipulation, including

bin picking and assembly. Therefore, from both a practical

and scientiﬁc perspective, it is of great interest to push the

limits of single-view pose estimation. For example, much

effort has been focused on obtaining robustness to partial oc-

clusions, with popular benchmarks targeting pose estimation

even for object that are 90% occluded [1], and thus, state-

of-the-art methods show quite remarkable robustness towards

occlusion. Single-view pose estimation is however inherently

limited, not only by occlusion from other objects, but also by

self-occlusion and notably depth ambiguity. Estimating depth

from a single color image has relatively high uncertainty as

a consequence of the small effect change of depth can have

on the size of the object in the image. Depth sensors can

be used to reduce depth ambiguity, however, the quality of

cheap depth sensors is questionable, good depth sensors are

expensive, and it does not mitigate the other ambiguities from

single-view pose estimation.

All authors are from SDU Robotics, Maersk Mc-Kinney Moller Institute,

University of Southern Denmark. The authors gratefully acknowledge the

support from Innovation Fund Denmark through the project MADE Fast.

{rlha,thmi}@mmmi.sdu.dk

1) u1∼p(u1)

2) c∼p(c|u1)

3) u2∼p(u2|u1, c)

Fig. 1. A real example of sampling a 3D-3D correspondence using

three image crops from different views. We propose to sample from the

joint 3D-3D correspondence distribution p(x, c)of scene points, x∈R3,

and object points, c∈R3. We let xbe represented by two points,

(u1, u2), from two different views, such that p(x, c) = p(u1, u2, c) =

p(u1)p(c|u1)p(u2|u1, c). 1) An image point, u1, shown by a red circle,

is sampled from estimated masks across views. For brevity, the masks are

not shown here. 2) An object point, c, shown by a black circle, is sampled

from a learned 2D-3D distribution over the object’s surface. The distribution

p(c|u1)is is shown in red on the 3D model. 3) The image point, u1,

imposes epipolar lines in the other images, shown by red lines. We use the

learned 2D-3D distribution as well as the epipolar constraints to approximate

p(u2|u1, c), which is shown in black and white, and sample u2from this

distribution, shown by a red triangle. The resulting pose estimate from our

full pipeline is superimposed in the bottom row.

In some cases, like industrial object manipulation, obtain-

ing multiple views is feasible, and intuitively, aggregation

of information from multiple views has great potential to

reduce ambiguities and obtain more robust systems. State-of-

the-art multi-view pose estimation methods either only use

multi-view information for reﬁnement [2], [3] or use heuristic

features [4]. Also, [2]–[4] assume uni-modal distributions

in pose space, 2D-3D correspondence space, and curvature

space, respectively.

We hypothesize that there is potential in taking an ap-

proach that is more fundamentally multi-view, while using

learned features that can represent relevant ambiguities. To

arXiv:2210.00924v2 [cs.CV] 23 Mar 2023

that end, we envision a pipeline consisting of a multi-view

detector followed by a multi-view pose estimator, and this

work focuses on the latter.

We present a novel multi-view RGB-only pose estima-

tion method, using recent learned 2D-3D correspondence

distributions [5] and epipolar geometry to sample 3D-3D

correspondences as outlined in Figure 1. Pose hypotheses are

sampled from the resulting correspondences, and scored by

the agreement with the 2D-3D correspondence distributions

across views. Finally, the best pose hypothesis is reﬁned to

maximize this agreement across views.

Our primary contribution is a state-of-the-art multi-view

RGB-only pose estimation method, combining learned 2D-

3D correspondence distributions and epipolar geometry. We

also contribute multi-view extensions of the pose scoring and

reﬁnement proposed in [5].

We present related work in II, describe our method in

detail in III, how we evaluate our method in IV, present and

discuss our ﬁndings in V, and comment on limitations and

future work in VI.

II. RELATED WORK

The following section reviews the literature related to

the problem of estimating 6D poses of rigid objects from

multiple RGB or RGB-D sensors, under the assumption that

accurate 3D models of the objects are available. Even under

this limited scope, there exists a large number of published

methods, and thus this review is limited to recent publications

which serves as representative works of the various pose

estimation methodologies.

Using a course categorization, pose estimation methods

can be divided into surface-based methods and image-based

methods. Surface-based methods rely on the creation of a

3D point cloud by reconstructing points of an object’s sur-

face, e.g. using traditional stereo methods which triangulates

surface points from 2D-2D point correspondences under

epipolar constrains [6], using deep learning based multi-

view stereo [7], or depth sensors [8]. The depth information

from multiple RGB-D views have also been fused to form

more complete point clouds [9]. The object pose can then be

inferred from the point cloud by ﬁnding 3D-3D correspon-

dences between the cloud and the model. This has e.g. been

done using deep learning to estimate 3D object keypoints

directly from sparse point clouds [10], or estimating 2D-3D

correspondences and lifting them to 3D-3D correspondences

with a depth sensor [11]. From 3D-3D correspondences, the

Kabsch algorithm can be used to compute the relative pose,

often as part of a RANSAC procedure [12] to be robust

toward outliers. Pose reﬁnement is then often performed, e.g.

using ICP [13].

Surface-based methods rely on the accuracy of the re-

constructed point clouds, so their performance is dependent

the quality of the depth sensor or that accurate 2D-2D

correspondences can be found for triangulation. This can

be problematic for industrial applications since accurate

depth sensors are expensive and industrial objects tend to

be symmetric with featureless surfaces, which makes for a

challenging correspondence problem [1].

Image-based pose estimation methods estimate a pose

directly from the image without reconstructing the object

surface in 3D. One approach to image-based pose esti-

mation is establishing 2D-3D correspondences followed by

the Perspective-n-Point (PnP) algorithm [12], which utilizes

that the projection of 4 or more non-colinear object points

uniquely deﬁne an object’s pose. The PnP algorithm is the

foundation for many both traditional and contemporary pose

estimation methods, e.g. [14] which uses deep learning to

regress bounding box corners, and [5] learns 2D-3D cor-

respondence distributions. Other image-based deep learning

methods include direct regression of object orientation [2],

[15] and treating the rotation estimation as a classiﬁcation

problem [16].

While most pose estimation methods assume a uni-modal

pose distribution and handle symmetries explicitly to better

justify this assumption, such as [2], [14], there are methods

such as [5], which allows multi-modal pose distributions and

handles ambiguities like symmetry implicitly.

All of the previously mentioned methods focus on es-

timating the single most probable pose, given the image.

There also exists methods that estimate entire pose distri-

butions [17], [18].

The above image-based pose estimation methods are all

single-view, and thus suffer from the aforementioned single-

view ambiguities. Several pose estimation methods have

been proposed to aggregate information from multiple views.

These can roughly be divided into one of two categories

depending on whether the aggregation is done as high level

pose reﬁnement or low level image feature aggregation.

High-level view aggregation has e.g. been done using

object-level SLAM which simultaneously reﬁnes the poses

of multiple objects together with multi-view camera extrin-

sics [19], pose voting which increases reliability of pose

estimates through a voting scheme [20], or pose reﬁnement

by minimizing a loss based on similarity between observed

and rendered correspondence features across views [11],

[21]. Most of the methods assume that the object symmetries

are provided, such that the pose ambiguities caused by object

symmetry can be explicitly accounted for. This has e.g. been

done by explicitly incorporating symmetries in the objective

of a scene graph optimization [2], or using symmetries

together with estimated extrinsics to explicitly add priors

on the object pose [22]. Methods that estimate full pose

distributions [17], [18] are particularly well suited for pose-

level multi-view aggregation [23]. However, the methods

have yet to show state-of-the-art performance on established

benchmarks.

Low-level aggregation of image data from multiple views

has been done by using DenseFusion [24] to aggre-

gate learned visual and geometric features from multiple

views [25], by estimating centroids and image curvatures for

use in a multi-view render-and-compare scheme [4], or by

formulating pose reﬁnement as an optimization problem with

an objective based on the object shape reprojection error in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-viewobjectposeestimationfromcorrespondencedistributionsandepipolargeometryRasmusLaurvigHaugaardandThorbjørnMosekjærIversenAbstractInmanyautomationtasksinvolvingmanipulationofrigidobjects,theposesoftheobjectsmustbeacquired.Vision-basedposeestimationusingasingleRGBorRGB-Dsensorisespeciallypopul...

展开>> 收起<<

Multi-view object pose estimation from correspondence distributions and epipolar geometry Rasmus Laurvig Haugaard and Thorbjørn Mosekjær Iversen.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-view object pose estimation from correspondence distributions and epipolar geometry Rasmus Laurvig Haugaard and Thorbjørn Mosekjær Iversen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: