CRT-6D Fast 6D Object Pose Estimation with Cascaded Refinement Transformers Pedro Castro

2025-05-06 0 0 3.55MB 10 页 10玖币

侵权投诉

CRT-6D: Fast 6D Object Pose Estimation with Cascaded Reﬁnement

Transformers

Pedro Castro

Imperial College London

p.castro18@imperial.ac.uk

Tae-Kyun Kim

Imperial College London, KAIST

tk.kim@imperial.ac.uk

Abstract

Learning based 6D object pose estimation methods

rely on computing large intermediate pose representations

and/or iteratively reﬁning an initial estimation with a slow

render-compare pipeline. This paper introduces a novel

method we call Cascaded Pose Reﬁnement Transformers,

or CRT-6D. We replace the commonly used dense interme-

diate representation with a sparse set of features sampled

from the feature pyramid we call OSKFs(Object Surface

Keypoint Features) where each element corresponds to an

object keypoint. We employ lightweight deformable trans-

formers and chain them together to iteratively reﬁne pro-

posed poses over the sampled OSKFs. We achieve infer-

ence runtimes 2×faster than the closest real-time state of

the art methods while supporting up to 21 objects on a sin-

gle model. We demonstrate the effectiveness of CRT-6D by

performing extensive experiments on the LM-O and YCB-

V datasets. Compared to real-time methods, we achieve

state of the art on LM-O and YCB-V, falling slightly be-

hind methods with inference runtimes one order of mag-

nitude higher. The source code is available at: https:

//github.com/PedroCastro/CRT-6D

1. Introduction

Estimating the 6D pose of objects given an RGB image

remains a challenging computer vision task yet indispens-

able in many real world applications from autonomous ve-

hicle perception, robotics as well as augmented reality. This

task entails the retrieval of a target object’s 3D rotation and

translation, relative to a camera, by overcoming difﬁcult is-

sues such as occlusion, illumination and symmetries. Depth

information can used to great effect when available [21,28],

while monocular methods tend to underperform due to lack

of information.

Recent methods utilizing Convolutional Neural Net-

works (CNNs) have surpassed prior classical approaches

and are at the core of most recent state-of-the-art 6D ob-

Encoder Decoder

Pt-1

Refinement

Model ΔPt

Encoder

Self Attention

Deformable Cross

Attention

ΔPt

Self Attention

Pt-1

Qt-1

OSKFs

Sample

OSKFs

Transformer

Pose Refiner

CRT-6D

Renderer

Pose

Refinement

Pose

Estimation

Backbone

Features

Pose

Representation

Figure 1: Illustrative diagram of CRT-6D. CRT-6D re-

moves the decoder and pose representation from pose es-

timation methods, and the renderer and reﬁnement model

from the standard reﬁnement pipelines. Instead CRT-6D re-

places them with a deformable attention based reﬁnement

module achieving pose estimation and reﬁnement within the

same model. Each reﬁnement iteration takes less than 3ms

making CRT-6D 2×faster than prior real time methods

and simultaneously achieving better accuracy.

ject pose estimators [41,6,34,39,35,47,9,24,40,32,3].

The computation pipeline of these methods can be roughly

deﬁned by 3 steps: 1.) The object is detected in the im-

age (this usually done using an off-the-shelf object detector

[37,36]) ; 2.) Features are extracted from a cropped im-

age, around the 2D area containing the object, using an es-

tablished CNN pre-trained architecture [13,42]; 3.) These

arXiv:2210.11718v1 [cs.CV] 21 Oct 2022

features are transformed into an intermediate representation

[34,32,47,35,15] which are then used to extract pose (us-

ing PnP [25] or other variations [40,15]) or pose is extracted

directly [7,47,9,24]. By assuming that the necessary infor-

mation to extract the pose is performed by step 2.), the pose

extraction step is the key to a fast and accurate estimation.

Prior art has proposed several intermediate representations,

e.g. NOCS, keypoint heatmaps [48,35,34,51]. These rep-

resentations cover the full input crop area thereby comput-

ing it at every pixel on a signiﬁcant spatial dimension, re-

gardless of the area of the image occupied by the object,

resulting in a large amount of unnecessary and expensive

computations. Moreover, some require an additional slow

RANSAC PnP step. Others methods propose to directly

learn the PnP operation [47,7,9], and while they are shown

to be faster and more precise, introduce more complexity

into the model without removing the information-less re-

gions from the computational pipeline.

On top of these, an application might choose to reﬁne

the predicted erroneous pose. The most commonly used

methods rely on a costly render-compare iterative process

[28,51,24,46], making them unsuitable for real-time ap-

plications. Ad-hoc reﬁnement methods require large mod-

els, designed and trained only for reﬁnement and leaving

the initial pose estimation as an exercise for other meth-

ods [24,28,51]. More recently, speciﬁcally designed ap-

proaches perform a trade-off between runtime and initial-

ization: RePose while fast requires a great initialization [20]

and SurfEmb [12] is very precise and robust to occlusions

and symmetries but is extremely slow at inference time .

In this paper, we introduce a novel method that removes

redundant computations around areas where the object is

not present, while oversampling the image regions where

it is. We achieve this by using a simple yet effective inter-

mediate offset representation: Object Surface Keypoint Fea-

tures (OSKFs). Given an initial coarse pose, we project pre-

determined object surface keypoints into the image plane.

We generate OSKFs by sampling the extracted feature pyra-

mid at each keypoint current 2D location. Given that the in-

tial pose is not guaranteed to be precise, we use deformable

attention to guide our sampling around the original 2D loca-

tion, overcoming possible errors in the coarse pose. There-

fore, we propose OSKF-PoseTransformers (OSKF-PT), a

transformer module with deformable attention mechanisms

[53], where self-attention and cross-attention operations are

performed over the OSKFs set, outputting an improved

pose. Since OSKFs are an inexpensive representation in

terms of computation, we chain together multiple OSKF-

PT in a novel Cascaded Pose Reﬁnement (CPF) module to

iteratively reﬁne the pose in a cascaded fashion, which can

be trained end-to-end.

In summary, this paper’s contributions are:

• We propose Object Surface Keypoint Features

(OSKF), a lightweight intermediate 6d pose offset rep-

resentation, which is signiﬁcantly less noisy, ignores

unusable information from feature maps resulting in a

more accurate pose estimation when compared to prior

art and is considerably cheaper to generate than inter-

mediate pose representations.

• We propose OSFK-PoseTransformer (OSFK-PT), a

module that utilizes a chain of self-attention and

deformable-attention layers to iteratively update an ini-

tial pose guess. Due to the lightweight nature of OS-

KFs, our reﬁnement is faster than any prior reﬁnement

method, taking less than 3ms per iteration.

• We introduce CRT-6D, a fast end-to-end 6d pose es-

timation model, that leverages a cascaded iterative re-

ﬁnement over a chain of OSFK-PTs to achieve state of

the art accuracy for real time 6D pose estimators on

two challenging datasets, with its inference time being

100% faster than the fastest prior methods.

2. Literature Review

Keypoint Detection. Object pose estimation can be seen

as the inverse of camera pose estimation. One can extract

6D pose by solving the PnP problem which means we can

detect the pixel position of keypoints, creating the neces-

sary 2D-3D correspondence set. Early works started by

choosing the 3D bounding box of the objects as keypoints

[43,35,30]. However, the projected 3D bounding box

keypoints usually lie outside the silhouette of the objects

, which potentially reduces the local information extraction.

This shortfall was noticed by PVNet [34], which suggests

the use of the surface region to ﬁnd suitable keypoints.

Dense Object Coordinate Estimation. Instead of pre-

selecting a few keypoints, NOCS [48] was proposed where

for every pixel in the silhouette of the object is used to esti-

mate the coordinate of the surface of the object (in normal-

ized space) projected at that pixel. In other words, every

point in the surface would become a keypoint and could be

used for the 2D-3D correspondence set to solve PnP. In-

spired by NOCS, Pix2Pose [32] proposed the use of a GAN

to solve issues with occlusion. DPOD [51,15,47,40] sug-

gested using UV maps and object regions instead of a 3D

coordinate system, ensuring that every point estimated lied

within the object’s surface. Each of these methods has an

increasingly more complex model, and while performance

has been improved by each method, runtime has been over-

looked.

Direct Pose Estimation. Posenet[22] proposed learning

quaternions to predict rotation on camera pose estimation

tasks. In 6D object pose estimation ﬁeld, PoseCNN [50]

used Lie algebra instead. SSD-6D [21] discretized the view-

point space and learned to classify it, while using the mask

to regress the distance to the camera. These methods are

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CRT-6D:Fast6DObjectPoseEstimationwithCascadedRefinementTransformersPedroCastroImperialCollegeLondonp.castro18@imperial.ac.ukTae-KyunKimImperialCollegeLondon,KAISTtk.kim@imperial.ac.ukAbstractLearningbased6Dobjectposeestimationmethodsrelyoncomputinglargeintermediateposerepresentationsand/oriterativel...

展开>> 收起<<

CRT-6D Fast 6D Object Pose Estimation with Cascaded Refinement Transformers Pedro Castro.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CRT-6D Fast 6D Object Pose Estimation with Cascaded Refinement Transformers Pedro Castro

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: