CRT-6D Fast 6D Object Pose Estimation with Cascaded Refinement Transformers Pedro Castro

2025-05-06
0
0
3.55MB
10 页
10玖币
侵权投诉
CRT-6D: Fast 6D Object Pose Estimation with Cascaded Refinement
Transformers
Pedro Castro
Imperial College London
p.castro18@imperial.ac.uk
Tae-Kyun Kim
Imperial College London, KAIST
tk.kim@imperial.ac.uk
Abstract
Learning based 6D object pose estimation methods
rely on computing large intermediate pose representations
and/or iteratively refining an initial estimation with a slow
render-compare pipeline. This paper introduces a novel
method we call Cascaded Pose Refinement Transformers,
or CRT-6D. We replace the commonly used dense interme-
diate representation with a sparse set of features sampled
from the feature pyramid we call OSKFs(Object Surface
Keypoint Features) where each element corresponds to an
object keypoint. We employ lightweight deformable trans-
formers and chain them together to iteratively refine pro-
posed poses over the sampled OSKFs. We achieve infer-
ence runtimes 2×faster than the closest real-time state of
the art methods while supporting up to 21 objects on a sin-
gle model. We demonstrate the effectiveness of CRT-6D by
performing extensive experiments on the LM-O and YCB-
V datasets. Compared to real-time methods, we achieve
state of the art on LM-O and YCB-V, falling slightly be-
hind methods with inference runtimes one order of mag-
nitude higher. The source code is available at: https:
//github.com/PedroCastro/CRT-6D
1. Introduction
Estimating the 6D pose of objects given an RGB image
remains a challenging computer vision task yet indispens-
able in many real world applications from autonomous ve-
hicle perception, robotics as well as augmented reality. This
task entails the retrieval of a target object’s 3D rotation and
translation, relative to a camera, by overcoming difficult is-
sues such as occlusion, illumination and symmetries. Depth
information can used to great effect when available [21,28],
while monocular methods tend to underperform due to lack
of information.
Recent methods utilizing Convolutional Neural Net-
works (CNNs) have surpassed prior classical approaches
and are at the core of most recent state-of-the-art 6D ob-
Encoder Decoder
Pt-1
Refinement
Model ΔPt
P
Pt
+
Encoder
Pt
Q0
Self Attention
Deformable Cross
Attention
ΔPt
Self Attention
+
Pt-1
P0
Qt-1
OSKFs
Sample
OSKFs
Transformer
Pose Refiner
CRT-6D
Renderer
Pose
Refinement
Pose
Estimation
Backbone
Features
Pose
Representation
Figure 1: Illustrative diagram of CRT-6D. CRT-6D re-
moves the decoder and pose representation from pose es-
timation methods, and the renderer and refinement model
from the standard refinement pipelines. Instead CRT-6D re-
places them with a deformable attention based refinement
module achieving pose estimation and refinement within the
same model. Each refinement iteration takes less than 3ms
making CRT-6D 2×faster than prior real time methods
and simultaneously achieving better accuracy.
ject pose estimators [41,6,34,39,35,47,9,24,40,32,3].
The computation pipeline of these methods can be roughly
defined by 3 steps: 1.) The object is detected in the im-
age (this usually done using an off-the-shelf object detector
[37,36]) ; 2.) Features are extracted from a cropped im-
age, around the 2D area containing the object, using an es-
tablished CNN pre-trained architecture [13,42]; 3.) These
arXiv:2210.11718v1 [cs.CV] 21 Oct 2022
features are transformed into an intermediate representation
[34,32,47,35,15] which are then used to extract pose (us-
ing PnP [25] or other variations [40,15]) or pose is extracted
directly [7,47,9,24]. By assuming that the necessary infor-
mation to extract the pose is performed by step 2.), the pose
extraction step is the key to a fast and accurate estimation.
Prior art has proposed several intermediate representations,
e.g. NOCS, keypoint heatmaps [48,35,34,51]. These rep-
resentations cover the full input crop area thereby comput-
ing it at every pixel on a significant spatial dimension, re-
gardless of the area of the image occupied by the object,
resulting in a large amount of unnecessary and expensive
computations. Moreover, some require an additional slow
RANSAC PnP step. Others methods propose to directly
learn the PnP operation [47,7,9], and while they are shown
to be faster and more precise, introduce more complexity
into the model without removing the information-less re-
gions from the computational pipeline.
On top of these, an application might choose to refine
the predicted erroneous pose. The most commonly used
methods rely on a costly render-compare iterative process
[28,51,24,46], making them unsuitable for real-time ap-
plications. Ad-hoc refinement methods require large mod-
els, designed and trained only for refinement and leaving
the initial pose estimation as an exercise for other meth-
ods [24,28,51]. More recently, specifically designed ap-
proaches perform a trade-off between runtime and initial-
ization: RePose while fast requires a great initialization [20]
and SurfEmb [12] is very precise and robust to occlusions
and symmetries but is extremely slow at inference time .
In this paper, we introduce a novel method that removes
redundant computations around areas where the object is
not present, while oversampling the image regions where
it is. We achieve this by using a simple yet effective inter-
mediate offset representation: Object Surface Keypoint Fea-
tures (OSKFs). Given an initial coarse pose, we project pre-
determined object surface keypoints into the image plane.
We generate OSKFs by sampling the extracted feature pyra-
mid at each keypoint current 2D location. Given that the in-
tial pose is not guaranteed to be precise, we use deformable
attention to guide our sampling around the original 2D loca-
tion, overcoming possible errors in the coarse pose. There-
fore, we propose OSKF-PoseTransformers (OSKF-PT), a
transformer module with deformable attention mechanisms
[53], where self-attention and cross-attention operations are
performed over the OSKFs set, outputting an improved
pose. Since OSKFs are an inexpensive representation in
terms of computation, we chain together multiple OSKF-
PT in a novel Cascaded Pose Refinement (CPF) module to
iteratively refine the pose in a cascaded fashion, which can
be trained end-to-end.
In summary, this paper’s contributions are:
• We propose Object Surface Keypoint Features
(OSKF), a lightweight intermediate 6d pose offset rep-
resentation, which is significantly less noisy, ignores
unusable information from feature maps resulting in a
more accurate pose estimation when compared to prior
art and is considerably cheaper to generate than inter-
mediate pose representations.
• We propose OSFK-PoseTransformer (OSFK-PT), a
module that utilizes a chain of self-attention and
deformable-attention layers to iteratively update an ini-
tial pose guess. Due to the lightweight nature of OS-
KFs, our refinement is faster than any prior refinement
method, taking less than 3ms per iteration.
• We introduce CRT-6D, a fast end-to-end 6d pose es-
timation model, that leverages a cascaded iterative re-
finement over a chain of OSFK-PTs to achieve state of
the art accuracy for real time 6D pose estimators on
two challenging datasets, with its inference time being
100% faster than the fastest prior methods.
2. Literature Review
Keypoint Detection. Object pose estimation can be seen
as the inverse of camera pose estimation. One can extract
6D pose by solving the PnP problem which means we can
detect the pixel position of keypoints, creating the neces-
sary 2D-3D correspondence set. Early works started by
choosing the 3D bounding box of the objects as keypoints
[43,35,30]. However, the projected 3D bounding box
keypoints usually lie outside the silhouette of the objects
, which potentially reduces the local information extraction.
This shortfall was noticed by PVNet [34], which suggests
the use of the surface region to find suitable keypoints.
Dense Object Coordinate Estimation. Instead of pre-
selecting a few keypoints, NOCS [48] was proposed where
for every pixel in the silhouette of the object is used to esti-
mate the coordinate of the surface of the object (in normal-
ized space) projected at that pixel. In other words, every
point in the surface would become a keypoint and could be
used for the 2D-3D correspondence set to solve PnP. In-
spired by NOCS, Pix2Pose [32] proposed the use of a GAN
to solve issues with occlusion. DPOD [51,15,47,40] sug-
gested using UV maps and object regions instead of a 3D
coordinate system, ensuring that every point estimated lied
within the object’s surface. Each of these methods has an
increasingly more complex model, and while performance
has been improved by each method, runtime has been over-
looked.
Direct Pose Estimation. Posenet[22] proposed learning
quaternions to predict rotation on camera pose estimation
tasks. In 6D object pose estimation field, PoseCNN [50]
used Lie algebra instead. SSD-6D [21] discretized the view-
point space and learned to classify it, while using the mask
to regress the distance to the camera. These methods are
摘要:
展开>>
收起<<
CRT-6D:Fast6DObjectPoseEstimationwithCascadedRefinementTransformersPedroCastroImperialCollegeLondonp.castro18@imperial.ac.ukTae-KyunKimImperialCollegeLondon,KAISTtk.kim@imperial.ac.ukAbstractLearningbased6Dobjectposeestimationmethodsrelyoncomputinglargeintermediateposerepresentationsand/oriterativel...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-14 22
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 28
-
VIP免费2024-11-23 11
-
VIP免费2024-11-23 21
-
VIP免费2024-11-23 12
-
VIP免费2024-11-23 5
分类:图书资源
价格:10玖币
属性:10 页
大小:3.55MB
格式:PDF
时间:2025-05-06
作者详情
-
IMU2CLIP MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS Seungwhan Moon Andrea Madotto Zhaojiang Lin Alireza Dirafzoon Aparajita Saraf10 玖币0人下载
-
Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective Zijian Zhang1 Chang Shu23 Ya Xiao1 Yuan Shen1 Di Zhu1 Jing Xiao210 玖币0人下载