features are transformed into an intermediate representation
[34,32,47,35,15] which are then used to extract pose (us-
ing PnP [25] or other variations [40,15]) or pose is extracted
directly [7,47,9,24]. By assuming that the necessary infor-
mation to extract the pose is performed by step 2.), the pose
extraction step is the key to a fast and accurate estimation.
Prior art has proposed several intermediate representations,
e.g. NOCS, keypoint heatmaps [48,35,34,51]. These rep-
resentations cover the full input crop area thereby comput-
ing it at every pixel on a significant spatial dimension, re-
gardless of the area of the image occupied by the object,
resulting in a large amount of unnecessary and expensive
computations. Moreover, some require an additional slow
RANSAC PnP step. Others methods propose to directly
learn the PnP operation [47,7,9], and while they are shown
to be faster and more precise, introduce more complexity
into the model without removing the information-less re-
gions from the computational pipeline.
On top of these, an application might choose to refine
the predicted erroneous pose. The most commonly used
methods rely on a costly render-compare iterative process
[28,51,24,46], making them unsuitable for real-time ap-
plications. Ad-hoc refinement methods require large mod-
els, designed and trained only for refinement and leaving
the initial pose estimation as an exercise for other meth-
ods [24,28,51]. More recently, specifically designed ap-
proaches perform a trade-off between runtime and initial-
ization: RePose while fast requires a great initialization [20]
and SurfEmb [12] is very precise and robust to occlusions
and symmetries but is extremely slow at inference time .
In this paper, we introduce a novel method that removes
redundant computations around areas where the object is
not present, while oversampling the image regions where
it is. We achieve this by using a simple yet effective inter-
mediate offset representation: Object Surface Keypoint Fea-
tures (OSKFs). Given an initial coarse pose, we project pre-
determined object surface keypoints into the image plane.
We generate OSKFs by sampling the extracted feature pyra-
mid at each keypoint current 2D location. Given that the in-
tial pose is not guaranteed to be precise, we use deformable
attention to guide our sampling around the original 2D loca-
tion, overcoming possible errors in the coarse pose. There-
fore, we propose OSKF-PoseTransformers (OSKF-PT), a
transformer module with deformable attention mechanisms
[53], where self-attention and cross-attention operations are
performed over the OSKFs set, outputting an improved
pose. Since OSKFs are an inexpensive representation in
terms of computation, we chain together multiple OSKF-
PT in a novel Cascaded Pose Refinement (CPF) module to
iteratively refine the pose in a cascaded fashion, which can
be trained end-to-end.
In summary, this paper’s contributions are:
• We propose Object Surface Keypoint Features
(OSKF), a lightweight intermediate 6d pose offset rep-
resentation, which is significantly less noisy, ignores
unusable information from feature maps resulting in a
more accurate pose estimation when compared to prior
art and is considerably cheaper to generate than inter-
mediate pose representations.
• We propose OSFK-PoseTransformer (OSFK-PT), a
module that utilizes a chain of self-attention and
deformable-attention layers to iteratively update an ini-
tial pose guess. Due to the lightweight nature of OS-
KFs, our refinement is faster than any prior refinement
method, taking less than 3ms per iteration.
• We introduce CRT-6D, a fast end-to-end 6d pose es-
timation model, that leverages a cascaded iterative re-
finement over a chain of OSFK-PTs to achieve state of
the art accuracy for real time 6D pose estimators on
two challenging datasets, with its inference time being
100% faster than the fastest prior methods.
2. Literature Review
Keypoint Detection. Object pose estimation can be seen
as the inverse of camera pose estimation. One can extract
6D pose by solving the PnP problem which means we can
detect the pixel position of keypoints, creating the neces-
sary 2D-3D correspondence set. Early works started by
choosing the 3D bounding box of the objects as keypoints
[43,35,30]. However, the projected 3D bounding box
keypoints usually lie outside the silhouette of the objects
, which potentially reduces the local information extraction.
This shortfall was noticed by PVNet [34], which suggests
the use of the surface region to find suitable keypoints.
Dense Object Coordinate Estimation. Instead of pre-
selecting a few keypoints, NOCS [48] was proposed where
for every pixel in the silhouette of the object is used to esti-
mate the coordinate of the surface of the object (in normal-
ized space) projected at that pixel. In other words, every
point in the surface would become a keypoint and could be
used for the 2D-3D correspondence set to solve PnP. In-
spired by NOCS, Pix2Pose [32] proposed the use of a GAN
to solve issues with occlusion. DPOD [51,15,47,40] sug-
gested using UV maps and object regions instead of a 3D
coordinate system, ensuring that every point estimated lied
within the object’s surface. Each of these methods has an
increasingly more complex model, and while performance
has been improved by each method, runtime has been over-
looked.
Direct Pose Estimation. Posenet[22] proposed learning
quaternions to predict rotation on camera pose estimation
tasks. In 6D object pose estimation field, PoseCNN [50]
used Lie algebra instead. SSD-6D [21] discretized the view-
point space and learned to classify it, while using the mask
to regress the distance to the camera. These methods are