that end, we envision a pipeline consisting of a multi-view
detector followed by a multi-view pose estimator, and this
work focuses on the latter.
We present a novel multi-view RGB-only pose estima-
tion method, using recent learned 2D-3D correspondence
distributions [5] and epipolar geometry to sample 3D-3D
correspondences as outlined in Figure 1. Pose hypotheses are
sampled from the resulting correspondences, and scored by
the agreement with the 2D-3D correspondence distributions
across views. Finally, the best pose hypothesis is refined to
maximize this agreement across views.
Our primary contribution is a state-of-the-art multi-view
RGB-only pose estimation method, combining learned 2D-
3D correspondence distributions and epipolar geometry. We
also contribute multi-view extensions of the pose scoring and
refinement proposed in [5].
We present related work in II, describe our method in
detail in III, how we evaluate our method in IV, present and
discuss our findings in V, and comment on limitations and
future work in VI.
II. RELATED WORK
The following section reviews the literature related to
the problem of estimating 6D poses of rigid objects from
multiple RGB or RGB-D sensors, under the assumption that
accurate 3D models of the objects are available. Even under
this limited scope, there exists a large number of published
methods, and thus this review is limited to recent publications
which serves as representative works of the various pose
estimation methodologies.
Using a course categorization, pose estimation methods
can be divided into surface-based methods and image-based
methods. Surface-based methods rely on the creation of a
3D point cloud by reconstructing points of an object’s sur-
face, e.g. using traditional stereo methods which triangulates
surface points from 2D-2D point correspondences under
epipolar constrains [6], using deep learning based multi-
view stereo [7], or depth sensors [8]. The depth information
from multiple RGB-D views have also been fused to form
more complete point clouds [9]. The object pose can then be
inferred from the point cloud by finding 3D-3D correspon-
dences between the cloud and the model. This has e.g. been
done using deep learning to estimate 3D object keypoints
directly from sparse point clouds [10], or estimating 2D-3D
correspondences and lifting them to 3D-3D correspondences
with a depth sensor [11]. From 3D-3D correspondences, the
Kabsch algorithm can be used to compute the relative pose,
often as part of a RANSAC procedure [12] to be robust
toward outliers. Pose refinement is then often performed, e.g.
using ICP [13].
Surface-based methods rely on the accuracy of the re-
constructed point clouds, so their performance is dependent
the quality of the depth sensor or that accurate 2D-2D
correspondences can be found for triangulation. This can
be problematic for industrial applications since accurate
depth sensors are expensive and industrial objects tend to
be symmetric with featureless surfaces, which makes for a
challenging correspondence problem [1].
Image-based pose estimation methods estimate a pose
directly from the image without reconstructing the object
surface in 3D. One approach to image-based pose esti-
mation is establishing 2D-3D correspondences followed by
the Perspective-n-Point (PnP) algorithm [12], which utilizes
that the projection of 4 or more non-colinear object points
uniquely define an object’s pose. The PnP algorithm is the
foundation for many both traditional and contemporary pose
estimation methods, e.g. [14] which uses deep learning to
regress bounding box corners, and [5] learns 2D-3D cor-
respondence distributions. Other image-based deep learning
methods include direct regression of object orientation [2],
[15] and treating the rotation estimation as a classification
problem [16].
While most pose estimation methods assume a uni-modal
pose distribution and handle symmetries explicitly to better
justify this assumption, such as [2], [14], there are methods
such as [5], which allows multi-modal pose distributions and
handles ambiguities like symmetry implicitly.
All of the previously mentioned methods focus on es-
timating the single most probable pose, given the image.
There also exists methods that estimate entire pose distri-
butions [17], [18].
The above image-based pose estimation methods are all
single-view, and thus suffer from the aforementioned single-
view ambiguities. Several pose estimation methods have
been proposed to aggregate information from multiple views.
These can roughly be divided into one of two categories
depending on whether the aggregation is done as high level
pose refinement or low level image feature aggregation.
High-level view aggregation has e.g. been done using
object-level SLAM which simultaneously refines the poses
of multiple objects together with multi-view camera extrin-
sics [19], pose voting which increases reliability of pose
estimates through a voting scheme [20], or pose refinement
by minimizing a loss based on similarity between observed
and rendered correspondence features across views [11],
[21]. Most of the methods assume that the object symmetries
are provided, such that the pose ambiguities caused by object
symmetry can be explicitly accounted for. This has e.g. been
done by explicitly incorporating symmetries in the objective
of a scene graph optimization [2], or using symmetries
together with estimated extrinsics to explicitly add priors
on the object pose [22]. Methods that estimate full pose
distributions [17], [18] are particularly well suited for pose-
level multi-view aggregation [23]. However, the methods
have yet to show state-of-the-art performance on established
benchmarks.
Low-level aggregation of image data from multiple views
has been done by using DenseFusion [24] to aggre-
gate learned visual and geometric features from multiple
views [25], by estimating centroids and image curvatures for
use in a multi-view render-and-compare scheme [4], or by
formulating pose refinement as an optimization problem with
an objective based on the object shape reprojection error in
2