Multi-view object pose estimation from correspondence distributions and epipolar geometry Rasmus Laurvig Haugaard and Thorbjørn Mosekjær Iversen

2025-05-02 0 0 1.28MB 7 页 10玖币
侵权投诉
Multi-view object pose estimation from
correspondence distributions and epipolar geometry
Rasmus Laurvig Haugaard and Thorbjørn Mosekjær Iversen
Abstract In many automation tasks involving manipulation
of rigid objects, the poses of the objects must be acquired.
Vision-based pose estimation using a single RGB or RGB-D
sensor is especially popular due to its broad applicability.
However, single-view pose estimation is inherently limited by
depth ambiguity and ambiguities imposed by various phenom-
ena like occlusion, self-occlusion, reflections, etc. Aggregation of
information from multiple views can potentially resolve these
ambiguities, but the current state-of-the-art multi-view pose
estimation method only uses multiple views to aggregate single-
view pose estimates, and thus rely on obtaining good single-view
estimates. We present a multi-view pose estimation method
which aggregates learned 2D-3D distributions from multiple
views for both the initial estimate and optional refinement.
Our method performs probabilistic sampling of 3D-3D cor-
respondences under epipolar constraints using learned 2D-3D
correspondence distributions which are implicitly trained to
respect visual ambiguities such as symmetry. Evaluation on the
T-LESS dataset shows that our method reduces pose estimation
errors by 80-91% compared to the best single-view method, and
we present state-of-the-art results on T-LESS with four views,
even compared with methods using five and eight views.
I. INTRODUCTION
Many robotics tasks involve precise manipulation of rigid
objects. This is especially true for industrial robotics where
high precision pose estimates are crucial for successful exe-
cution of demanding tasks such as bin picking and assembly.
Single-view pose estimation applies to a wide range of
scenarios where pose estimation is desired, from augmented
reality to self-driving cars to object manipulation, including
bin picking and assembly. Therefore, from both a practical
and scientific perspective, it is of great interest to push the
limits of single-view pose estimation. For example, much
effort has been focused on obtaining robustness to partial oc-
clusions, with popular benchmarks targeting pose estimation
even for object that are 90% occluded [1], and thus, state-
of-the-art methods show quite remarkable robustness towards
occlusion. Single-view pose estimation is however inherently
limited, not only by occlusion from other objects, but also by
self-occlusion and notably depth ambiguity. Estimating depth
from a single color image has relatively high uncertainty as
a consequence of the small effect change of depth can have
on the size of the object in the image. Depth sensors can
be used to reduce depth ambiguity, however, the quality of
cheap depth sensors is questionable, good depth sensors are
expensive, and it does not mitigate the other ambiguities from
single-view pose estimation.
All authors are from SDU Robotics, Maersk Mc-Kinney Moller Institute,
University of Southern Denmark. The authors gratefully acknowledge the
support from Innovation Fund Denmark through the project MADE Fast.
{rlha,thmi}@mmmi.sdu.dk
1) u1p(u1)
2) cp(c|u1)
3) u2p(u2|u1, c)
Fig. 1. A real example of sampling a 3D-3D correspondence using
three image crops from different views. We propose to sample from the
joint 3D-3D correspondence distribution p(x, c)of scene points, xR3,
and object points, cR3. We let xbe represented by two points,
(u1, u2), from two different views, such that p(x, c) = p(u1, u2, c) =
p(u1)p(c|u1)p(u2|u1, c). 1) An image point, u1, shown by a red circle,
is sampled from estimated masks across views. For brevity, the masks are
not shown here. 2) An object point, c, shown by a black circle, is sampled
from a learned 2D-3D distribution over the object’s surface. The distribution
p(c|u1)is is shown in red on the 3D model. 3) The image point, u1,
imposes epipolar lines in the other images, shown by red lines. We use the
learned 2D-3D distribution as well as the epipolar constraints to approximate
p(u2|u1, c), which is shown in black and white, and sample u2from this
distribution, shown by a red triangle. The resulting pose estimate from our
full pipeline is superimposed in the bottom row.
In some cases, like industrial object manipulation, obtain-
ing multiple views is feasible, and intuitively, aggregation
of information from multiple views has great potential to
reduce ambiguities and obtain more robust systems. State-of-
the-art multi-view pose estimation methods either only use
multi-view information for refinement [2], [3] or use heuristic
features [4]. Also, [2]–[4] assume uni-modal distributions
in pose space, 2D-3D correspondence space, and curvature
space, respectively.
We hypothesize that there is potential in taking an ap-
proach that is more fundamentally multi-view, while using
learned features that can represent relevant ambiguities. To
1
arXiv:2210.00924v2 [cs.CV] 23 Mar 2023
that end, we envision a pipeline consisting of a multi-view
detector followed by a multi-view pose estimator, and this
work focuses on the latter.
We present a novel multi-view RGB-only pose estima-
tion method, using recent learned 2D-3D correspondence
distributions [5] and epipolar geometry to sample 3D-3D
correspondences as outlined in Figure 1. Pose hypotheses are
sampled from the resulting correspondences, and scored by
the agreement with the 2D-3D correspondence distributions
across views. Finally, the best pose hypothesis is refined to
maximize this agreement across views.
Our primary contribution is a state-of-the-art multi-view
RGB-only pose estimation method, combining learned 2D-
3D correspondence distributions and epipolar geometry. We
also contribute multi-view extensions of the pose scoring and
refinement proposed in [5].
We present related work in II, describe our method in
detail in III, how we evaluate our method in IV, present and
discuss our findings in V, and comment on limitations and
future work in VI.
II. RELATED WORK
The following section reviews the literature related to
the problem of estimating 6D poses of rigid objects from
multiple RGB or RGB-D sensors, under the assumption that
accurate 3D models of the objects are available. Even under
this limited scope, there exists a large number of published
methods, and thus this review is limited to recent publications
which serves as representative works of the various pose
estimation methodologies.
Using a course categorization, pose estimation methods
can be divided into surface-based methods and image-based
methods. Surface-based methods rely on the creation of a
3D point cloud by reconstructing points of an object’s sur-
face, e.g. using traditional stereo methods which triangulates
surface points from 2D-2D point correspondences under
epipolar constrains [6], using deep learning based multi-
view stereo [7], or depth sensors [8]. The depth information
from multiple RGB-D views have also been fused to form
more complete point clouds [9]. The object pose can then be
inferred from the point cloud by finding 3D-3D correspon-
dences between the cloud and the model. This has e.g. been
done using deep learning to estimate 3D object keypoints
directly from sparse point clouds [10], or estimating 2D-3D
correspondences and lifting them to 3D-3D correspondences
with a depth sensor [11]. From 3D-3D correspondences, the
Kabsch algorithm can be used to compute the relative pose,
often as part of a RANSAC procedure [12] to be robust
toward outliers. Pose refinement is then often performed, e.g.
using ICP [13].
Surface-based methods rely on the accuracy of the re-
constructed point clouds, so their performance is dependent
the quality of the depth sensor or that accurate 2D-2D
correspondences can be found for triangulation. This can
be problematic for industrial applications since accurate
depth sensors are expensive and industrial objects tend to
be symmetric with featureless surfaces, which makes for a
challenging correspondence problem [1].
Image-based pose estimation methods estimate a pose
directly from the image without reconstructing the object
surface in 3D. One approach to image-based pose esti-
mation is establishing 2D-3D correspondences followed by
the Perspective-n-Point (PnP) algorithm [12], which utilizes
that the projection of 4 or more non-colinear object points
uniquely define an object’s pose. The PnP algorithm is the
foundation for many both traditional and contemporary pose
estimation methods, e.g. [14] which uses deep learning to
regress bounding box corners, and [5] learns 2D-3D cor-
respondence distributions. Other image-based deep learning
methods include direct regression of object orientation [2],
[15] and treating the rotation estimation as a classification
problem [16].
While most pose estimation methods assume a uni-modal
pose distribution and handle symmetries explicitly to better
justify this assumption, such as [2], [14], there are methods
such as [5], which allows multi-modal pose distributions and
handles ambiguities like symmetry implicitly.
All of the previously mentioned methods focus on es-
timating the single most probable pose, given the image.
There also exists methods that estimate entire pose distri-
butions [17], [18].
The above image-based pose estimation methods are all
single-view, and thus suffer from the aforementioned single-
view ambiguities. Several pose estimation methods have
been proposed to aggregate information from multiple views.
These can roughly be divided into one of two categories
depending on whether the aggregation is done as high level
pose refinement or low level image feature aggregation.
High-level view aggregation has e.g. been done using
object-level SLAM which simultaneously refines the poses
of multiple objects together with multi-view camera extrin-
sics [19], pose voting which increases reliability of pose
estimates through a voting scheme [20], or pose refinement
by minimizing a loss based on similarity between observed
and rendered correspondence features across views [11],
[21]. Most of the methods assume that the object symmetries
are provided, such that the pose ambiguities caused by object
symmetry can be explicitly accounted for. This has e.g. been
done by explicitly incorporating symmetries in the objective
of a scene graph optimization [2], or using symmetries
together with estimated extrinsics to explicitly add priors
on the object pose [22]. Methods that estimate full pose
distributions [17], [18] are particularly well suited for pose-
level multi-view aggregation [23]. However, the methods
have yet to show state-of-the-art performance on established
benchmarks.
Low-level aggregation of image data from multiple views
has been done by using DenseFusion [24] to aggre-
gate learned visual and geometric features from multiple
views [25], by estimating centroids and image curvatures for
use in a multi-view render-and-compare scheme [4], or by
formulating pose refinement as an optimization problem with
an objective based on the object shape reprojection error in
2
摘要:

Multi-viewobjectposeestimationfromcorrespondencedistributionsandepipolargeometryRasmusLaurvigHaugaardandThorbjørnMosekjærIversenAbstract—Inmanyautomationtasksinvolvingmanipulationofrigidobjects,theposesoftheobjectsmustbeacquired.Vision-basedposeestimationusingasingleRGBorRGB-Dsensorisespeciallypopul...

展开>> 收起<<
Multi-view object pose estimation from correspondence distributions and epipolar geometry Rasmus Laurvig Haugaard and Thorbjørn Mosekjær Iversen.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:1.28MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注