Video based Object 6D Pose Estimation using Transformers Apoorva Beedu

2025-05-06 0 0 8.11MB 13 页 10玖币
侵权投诉
Video based Object 6D Pose Estimation using
Transformers
Apoorva Beedu
Georgia Institute of Technology
abeedu3@gatech.edu
Huda Alamri
Georgia Institute of Technology
halamri@gatech.edu
Irfan Essa
Georgia Institute of Technology
irfan@gatech.edu
Abstract
We introduce a Transformer based
6
D Object Pose Estimation framework Video-
Pose, comprising an end-to-end attention based modelling architecture, that attends
to previous frames in order to estimate accurate
6
D Object Poses in videos. Our
approach leverages the temporal information from a video sequence for pose re-
finement, along with being computationally efficient and robust. Compared to
existing methods, our architecture is able to capture and reason from long-range
dependencies efficiently, thus iteratively refining over video sequences. Experi-
mental evaluation on the YCB-Video dataset shows that our approach is on par
with the state-of-the-art Transformer methods, and performs significantly better
relative to CNN based approaches. Further, with a speed of 33 fps, it is also
more efficient and therefore applicable to a variety of applications that require
real-time object pose estimation. Training code and pretrained models are available
at https://github.com/ApoorvaBeedu/VideoPose.
1 Introduction
Estimating the
3
D translation and
3
D rotation for every object in an image is a core building block
for many applications in robotics [
44
,
48
,
14
] and augmented reality [
35
]. The classical solution for
such
6
-DOF pose estimation problems utilises a feature point matching mechanism, followed by
Perspective-n-Point (PnP) to correct the estimated pose [
41
,
47
,
40
,
21
]. However, such approaches
fail when objects are texture-less or heavily occluded. Typical ways of refining the
6
DOF estimation
involves using additional depth data [
51
,
7
,
18
,
25
] or post-processing methods like Iterative Closest
Point (ICP) or other deep learning based rendering methods [
56
,
23
,
30
,
45
], which increase
computational costs. Other approaches treat it as a classification problem [
49
,
23
], resulting in
reduced performance as the output space is not continuous.
In robotics, augmented reality, and mobile applications, the input signals are typically videos rather
than single images, thus, giving opportunity for a multi-view framework. Li et al. [
28
] utilize multiple
frames from different viewing angles to estimate single object poses. Wen et al. [
55
] and Deng et
al. [
13
] use tracking methods to estimate the poses, however these methods do not explicitly exploit
the temporal information in the videos. The idea of using more than one frame to estimate object
poses has seen limited exploration. As the object poses in a video sequence are implicitly related to
camera transformations and do not change abruptly between frames, and as different viewpoints of
the objects aid in the pose estimation [
27
,
12
], we believe that modelling temporal relationship can
only aid in effective perfomance on the task.
NeurIPS 2022 Workshop on Vision Transformers: Theory and applications.
arXiv:2210.13540v2 [cs.CV] 7 Nov 2022
Motivated by this, we introduce a video based object
6D
pose estimation framework, that uses
past estimations to bound the
6D
pose in the current frame. Specifically, we leverage the popular
Transformer architecture [
50
,
42
] with causal masked attention, where each input frame is only
allowed to attend to frames that precede it. We train the model to jointly predict the 6D poses while
also learning to accurately predict future features to match the true features. Such a setup has been
employed in [
15
], which shows that predicting future features is an effective self-supervised pretext
task for learning visual representations.
While the temporal architecture described above can be applied on top of any visual feature encoder
(as discussed in ablations), we propose a purely transformer-based model that uses a Swin trans-
former [
32
] as the backbone. This enables our network to not only attend temporally to frames in the
video, but also spatially within the frame.
In summary, the contributions of our paper are:
We introduce a video based
6
D Object pose estimation framework that is purely attention
based.
We incorporate self supervision via a predictive loss for learning better visual representations.
We perform evaluation on the challenging YCB-Video dataset [
56
], where our algo-
rithm achieves improved performance over state-of-the-art single frame methods such as
PoseCNN [
56
] and PoseRBPF [
13
] with a real-time performance at 33fps, and transformer
based method such as T6D-Dicrect [1].
2 Related Work
Estimating the
6
-DOF pose of objects in the scene is a widely studied task. The classical methods
either use template-based or feature-based approaches. In template-based methods, a template is
matched to different locations in the image, and a similarity score is computed [
18
,
17
]. However,
these template matching methods could fail to make predictions for textureless objects and cluttered
environments. In feature based methods, local features are extracted, and correspondence between
known 3D objects and local 2D features is established using PnP to recover 6D poses [
43
,
39
].
However, these methods also require sufficient textures on the object to compute local features and
face difficulty in generalising well to new environments as they are often trained on small datasets.
Convolutional Neural Networks (CNNs) have proven to be an effective tool in many computer
vision tasks. However, they rely heavily on the availability of large-scale annotated datasets. Due to
this limitation, the YCB-Video [
56
], T-LESS [
19
], and OccludedLINEMOD datasets [
26
,
40
] were
introduced. They have enabled the emergence of novel network designs such as PoseCNN [
56
],
DPOD [
59
], PVNet [
40
], and others [
8
,
52
,
16
,
10
]. In this paper, we use the challenging YCB-Video
dataset, as it is a popular dataset that serves as a testbed for many recent methods [1, 2, 13].
Building on those datasets, various CNN architectures have been introduced to learn effective
representations of objects and to estimate accurate 6D poses. Kehl et al. [
23
] extend SSD [
31
] by
adding an additional viewpoint classification branch to the network whereas [
41
,
47
] predict 2D
projections from 3D bounding box estimations. Other methods involve a hybrid approach where the
model learns to perform multiple tasks, e.g., Song et al. [
45
] enforce consistencies among keypoints,
edges, and object symmetries, and Billings et al. [
6
] predict silhouettes of objects along with object
poses. There is also a growing trend of designing model agnostic features [
53
] that can handle novel
objects. Finallly, few shot, one shot, and category level pose estimation has also seen increased
interest recently [11, 54, 46]
To refine the predicted poses, several works use additional depth information and perform the
standard ICP algorithm [
56
,
23
], directly learn from RGB-D inputs [
51
,
30
,
59
], or through neural
rendering [57, 22, 30, 33]. We argue that since the input signals to robots and/or mobile devices are
typically video sequences, instead of heavily relying on post processing refinement using additional
depth information and rendering, estimating poses in videos by exploiting the temporal data could
already refine the single pose estimations. Recently, several tracking algorithms are utilising videos
to estimate object poses. A notable work from Deng et al. [
13
] introduces the PoseRBPF algorithm
that uses particle filters to track objects in video sequences. However, this state-of-the-art algorithm
provides accurate estimations at a high computational cost. Wen et al. [
55
] also perform tracking,
but use synthetic rendering of the object at the previous time-step.
2
Linear Linear Linear Linear
0 1
*2 3 9
Transformer Encoder Transformer Encoder Transformer Encoder
0 1 2 3
0 1
*2 3 9 0 1
*2 3 9
Transformer Encoder
90 1
*2 3
Future Feature
Predictor ( )
Future Feature
Predictor ( )
Future Feature
Predictor ( )
Video Transformer Decoder ( )
Input Video Frames
Future Feature
Predictor ( )
[ | ]
Pose Regressor ( ) Pose Regressor ( Pose Regressor ( ) Pose Regressor ( )
[ | ] [ | ] [ | ]
Figure 1: Overview of our framework for 6D object pose estimation. We use Swin transformer [
32
]
as Transformer Encoder, and GPT2 [
42
] as Video Transformer Decoder. Future Feature Predictor
and Pose Regressor consists of a 2 layer MLP, further described in Figure 2
With the rise in self-attention models and Transformer architectures [
50
,
42
,
58
], we also saw an
increased interest in vision based transformers [
3
,
32
,
5
]. This has resulted in the application of
Transformers to other applications like object detection [
9
,
62
] and human pose estimation [
60
,
61
,
29
,
36
,
34
], and object pose estimation [
37
,
2
,
1
]. TD6-Direct [
1
] builds on the Detection Transformer
(DETR) [
9
] architecture to directly regress to the pose, while [
2
] uses transformers to predict
keypoints, and subsequently does keypoint regression. In contrast to these works, we use Transformer
models to attend over a set of frames in a video, and directly regress to the 6D pose. As transformers
use a self-attention mechanism, our framework is capable of learning and refining
6
D poses from
previous frames, without needing additional post process refinement.
3 Approach
Given an RGB-D video stream, our goal is to estimate the
3
D rotation and
3
D translation of all the
objects in every frame of the video. We assume the system has access to the
3
D model of the object.
In the following sections,
R
denotes the rotation matrix with respect to the annotated canonical object
pose, and Tis the translation from the object to the camera.
3.1 Overview of the network
Our pipeline, as shown in Figure 1, is a two stage network. The first stage comprises of a feature
extraction module; We use a Swin transformer [
32
] to learn the visual features for every frame in the
video. For a given video sequence and its corresponding depth, the transformer encoder gives us a
feature vector of shape
b×t×n×768
where
b
corresponds to the batch size,
t
corresponds to the
temporal length and ncorresponds to the number of objects in the image.
Pose Estimation relies on accurate object detection, which derives the class-id and Region-Of-Interest
(ROI). During training, we use the ground truth bounding box whereas during testing, we use the
predictions and bounding box from the PoseCNN model. This can potentially be replaced with
any lightweight feature extraction model such as MobileNet [
20
] to make the inference faster, or
DETR [
9
] - a transformer based object detection module. We also use depth as an additional input.
3
摘要:

VideobasedObject6DPoseEstimationusingTransformersApoorvaBeeduGeorgiaInstituteofTechnologyabeedu3@gatech.eduHudaAlamriGeorgiaInstituteofTechnologyhalamri@gatech.eduIrfanEssaGeorgiaInstituteofTechnologyirfan@gatech.eduAbstractWeintroduceaTransformerbased6DObjectPoseEstimationframeworkVideo-Pose,compri...

展开>> 收起<<
Video based Object 6D Pose Estimation using Transformers Apoorva Beedu.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:8.11MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注