Video based Object 6D Pose Estimation using Transformers Apoorva Beedu

2025-05-06 0 0 8.11MB 13 页 10玖币

侵权投诉

Video based Object 6D Pose Estimation using

Transformers

Apoorva Beedu

Georgia Institute of Technology

abeedu3@gatech.edu

Huda Alamri

Georgia Institute of Technology

halamri@gatech.edu

Irfan Essa

Georgia Institute of Technology

irfan@gatech.edu

Abstract

We introduce a Transformer based

D Object Pose Estimation framework Video-

Pose, comprising an end-to-end attention based modelling architecture, that attends

to previous frames in order to estimate accurate

D Object Poses in videos. Our

approach leverages the temporal information from a video sequence for pose re-

ﬁnement, along with being computationally efﬁcient and robust. Compared to

existing methods, our architecture is able to capture and reason from long-range

dependencies efﬁciently, thus iteratively reﬁning over video sequences. Experi-

mental evaluation on the YCB-Video dataset shows that our approach is on par

with the state-of-the-art Transformer methods, and performs signiﬁcantly better

relative to CNN based approaches. Further, with a speed of 33 fps, it is also

more efﬁcient and therefore applicable to a variety of applications that require

real-time object pose estimation. Training code and pretrained models are available

at https://github.com/ApoorvaBeedu/VideoPose.

1 Introduction

Estimating the

D translation and

D rotation for every object in an image is a core building block

for many applications in robotics [

] and augmented reality [

]. The classical solution for

such

-DOF pose estimation problems utilises a feature point matching mechanism, followed by

Perspective-n-Point (PnP) to correct the estimated pose [

]. However, such approaches

fail when objects are texture-less or heavily occluded. Typical ways of reﬁning the

DOF estimation

involves using additional depth data [

] or post-processing methods like Iterative Closest

Point (ICP) or other deep learning based rendering methods [

], which increase

computational costs. Other approaches treat it as a classiﬁcation problem [

], resulting in

reduced performance as the output space is not continuous.

In robotics, augmented reality, and mobile applications, the input signals are typically videos rather

than single images, thus, giving opportunity for a multi-view framework. Li et al. [

] utilize multiple

frames from different viewing angles to estimate single object poses. Wen et al. [

] and Deng et

al. [

] use tracking methods to estimate the poses, however these methods do not explicitly exploit

the temporal information in the videos. The idea of using more than one frame to estimate object

poses has seen limited exploration. As the object poses in a video sequence are implicitly related to

camera transformations and do not change abruptly between frames, and as different viewpoints of

the objects aid in the pose estimation [

], we believe that modelling temporal relationship can

only aid in effective perfomance on the task.

NeurIPS 2022 Workshop on Vision Transformers: Theory and applications.

arXiv:2210.13540v2 [cs.CV] 7 Nov 2022

Motivated by this, we introduce a video based object

pose estimation framework, that uses

past estimations to bound the

pose in the current frame. Speciﬁcally, we leverage the popular

Transformer architecture [

] with causal masked attention, where each input frame is only

allowed to attend to frames that precede it. We train the model to jointly predict the 6D poses while

also learning to accurately predict future features to match the true features. Such a setup has been

employed in [

], which shows that predicting future features is an effective self-supervised pretext

task for learning visual representations.

While the temporal architecture described above can be applied on top of any visual feature encoder

(as discussed in ablations), we propose a purely transformer-based model that uses a Swin trans-

former [

] as the backbone. This enables our network to not only attend temporally to frames in the

video, but also spatially within the frame.

In summary, the contributions of our paper are:

•

We introduce a video based

D Object pose estimation framework that is purely attention

based.

•

We incorporate self supervision via a predictive loss for learning better visual representations.

•

We perform evaluation on the challenging YCB-Video dataset [

], where our algo-

rithm achieves improved performance over state-of-the-art single frame methods such as

PoseCNN [

] and PoseRBPF [

] with a real-time performance at 33fps, and transformer

based method such as T6D-Dicrect [1].

2 Related Work

Estimating the

-DOF pose of objects in the scene is a widely studied task. The classical methods

either use template-based or feature-based approaches. In template-based methods, a template is

matched to different locations in the image, and a similarity score is computed [

]. However,

these template matching methods could fail to make predictions for textureless objects and cluttered

environments. In feature based methods, local features are extracted, and correspondence between

known 3D objects and local 2D features is established using PnP to recover 6D poses [

However, these methods also require sufﬁcient textures on the object to compute local features and

face difﬁculty in generalising well to new environments as they are often trained on small datasets.

Convolutional Neural Networks (CNNs) have proven to be an effective tool in many computer

vision tasks. However, they rely heavily on the availability of large-scale annotated datasets. Due to

this limitation, the YCB-Video [

], T-LESS [

], and OccludedLINEMOD datasets [

] were

introduced. They have enabled the emergence of novel network designs such as PoseCNN [

DPOD [

], PVNet [

], and others [

]. In this paper, we use the challenging YCB-Video

dataset, as it is a popular dataset that serves as a testbed for many recent methods [1, 2, 13].

Building on those datasets, various CNN architectures have been introduced to learn effective

representations of objects and to estimate accurate 6D poses. Kehl et al. [

] extend SSD [

] by

adding an additional viewpoint classiﬁcation branch to the network whereas [

] predict 2D

projections from 3D bounding box estimations. Other methods involve a hybrid approach where the

model learns to perform multiple tasks, e.g., Song et al. [

] enforce consistencies among keypoints,

edges, and object symmetries, and Billings et al. [

] predict silhouettes of objects along with object

poses. There is also a growing trend of designing model agnostic features [

] that can handle novel

objects. Finallly, few shot, one shot, and category level pose estimation has also seen increased

interest recently [11, 54, 46]

To reﬁne the predicted poses, several works use additional depth information and perform the

standard ICP algorithm [

], directly learn from RGB-D inputs [

], or through neural

rendering [57, 22, 30, 33]. We argue that since the input signals to robots and/or mobile devices are

typically video sequences, instead of heavily relying on post processing reﬁnement using additional

depth information and rendering, estimating poses in videos by exploiting the temporal data could

already reﬁne the single pose estimations. Recently, several tracking algorithms are utilising videos

to estimate object poses. A notable work from Deng et al. [

] introduces the PoseRBPF algorithm

that uses particle ﬁlters to track objects in video sequences. However, this state-of-the-art algorithm

provides accurate estimations at a high computational cost. Wen et al. [

] also perform tracking,

but use synthetic rendering of the object at the previous time-step.

Linear Linear Linear Linear

0 1

*2 3 9

Transformer Encoder Transformer Encoder Transformer Encoder

0 1 2 3

0 1

*2 3 9 0 1

*2 3 9

Transformer Encoder

90 1

*2 3

Future Feature

Predictor ( )

Future Feature

Predictor ( )

Future Feature

Predictor ( )

Video Transformer Decoder ( )

Input Video Frames

Future Feature

Predictor ( )

[ | ]

Pose Regressor ( ) Pose Regressor ( Pose Regressor ( ) Pose Regressor ( )

[ | ] [ | ] [ | ]

Figure 1: Overview of our framework for 6D object pose estimation. We use Swin transformer [

]

as Transformer Encoder, and GPT2 [

] as Video Transformer Decoder. Future Feature Predictor

and Pose Regressor consists of a 2 layer MLP, further described in Figure 2

With the rise in self-attention models and Transformer architectures [

], we also saw an

increased interest in vision based transformers [

]. This has resulted in the application of

Transformers to other applications like object detection [

] and human pose estimation [

], and object pose estimation [

]. TD6-Direct [

] builds on the Detection Transformer

(DETR) [

] architecture to directly regress to the pose, while [

] uses transformers to predict

keypoints, and subsequently does keypoint regression. In contrast to these works, we use Transformer

models to attend over a set of frames in a video, and directly regress to the 6D pose. As transformers

use a self-attention mechanism, our framework is capable of learning and reﬁning

D poses from

previous frames, without needing additional post process reﬁnement.

3 Approach

Given an RGB-D video stream, our goal is to estimate the

D rotation and

D translation of all the

objects in every frame of the video. We assume the system has access to the

D model of the object.

In the following sections,

denotes the rotation matrix with respect to the annotated canonical object

pose, and Tis the translation from the object to the camera.

3.1 Overview of the network

Our pipeline, as shown in Figure 1, is a two stage network. The ﬁrst stage comprises of a feature

extraction module; We use a Swin transformer [

] to learn the visual features for every frame in the

video. For a given video sequence and its corresponding depth, the transformer encoder gives us a

feature vector of shape

b×t×n×768

where

corresponds to the batch size,

corresponds to the

temporal length and ncorresponds to the number of objects in the image.

Pose Estimation relies on accurate object detection, which derives the class-id and Region-Of-Interest

(ROI). During training, we use the ground truth bounding box whereas during testing, we use the

predictions and bounding box from the PoseCNN model. This can potentially be replaced with

any lightweight feature extraction model such as MobileNet [

] to make the inference faster, or

DETR [

] - a transformer based object detection module. We also use depth as an additional input.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VideobasedObject6DPoseEstimationusingTransformersApoorvaBeeduGeorgiaInstituteofTechnologyabeedu3@gatech.eduHudaAlamriGeorgiaInstituteofTechnologyhalamri@gatech.eduIrfanEssaGeorgiaInstituteofTechnologyirfan@gatech.eduAbstractWeintroduceaTransformerbased6DObjectPoseEstimationframeworkVideo-Pose,compri...

展开>> 收起<<

Video based Object 6D Pose Estimation using Transformers Apoorva Beedu.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Video based Object 6D Pose Estimation using Transformers Apoorva Beedu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: