Motivated by this, we introduce a video based object
6D
pose estimation framework, that uses
past estimations to bound the
6D
pose in the current frame. Specifically, we leverage the popular
Transformer architecture [
50
,
42
] with causal masked attention, where each input frame is only
allowed to attend to frames that precede it. We train the model to jointly predict the 6D poses while
also learning to accurately predict future features to match the true features. Such a setup has been
employed in [
15
], which shows that predicting future features is an effective self-supervised pretext
task for learning visual representations.
While the temporal architecture described above can be applied on top of any visual feature encoder
(as discussed in ablations), we propose a purely transformer-based model that uses a Swin trans-
former [
32
] as the backbone. This enables our network to not only attend temporally to frames in the
video, but also spatially within the frame.
In summary, the contributions of our paper are:
•
We introduce a video based
6
D Object pose estimation framework that is purely attention
based.
•
We incorporate self supervision via a predictive loss for learning better visual representations.
•
We perform evaluation on the challenging YCB-Video dataset [
56
], where our algo-
rithm achieves improved performance over state-of-the-art single frame methods such as
PoseCNN [
56
] and PoseRBPF [
13
] with a real-time performance at 33fps, and transformer
based method such as T6D-Dicrect [1].
2 Related Work
Estimating the
6
-DOF pose of objects in the scene is a widely studied task. The classical methods
either use template-based or feature-based approaches. In template-based methods, a template is
matched to different locations in the image, and a similarity score is computed [
18
,
17
]. However,
these template matching methods could fail to make predictions for textureless objects and cluttered
environments. In feature based methods, local features are extracted, and correspondence between
known 3D objects and local 2D features is established using PnP to recover 6D poses [
43
,
39
].
However, these methods also require sufficient textures on the object to compute local features and
face difficulty in generalising well to new environments as they are often trained on small datasets.
Convolutional Neural Networks (CNNs) have proven to be an effective tool in many computer
vision tasks. However, they rely heavily on the availability of large-scale annotated datasets. Due to
this limitation, the YCB-Video [
56
], T-LESS [
19
], and OccludedLINEMOD datasets [
26
,
40
] were
introduced. They have enabled the emergence of novel network designs such as PoseCNN [
56
],
DPOD [
59
], PVNet [
40
], and others [
8
,
52
,
16
,
10
]. In this paper, we use the challenging YCB-Video
dataset, as it is a popular dataset that serves as a testbed for many recent methods [1, 2, 13].
Building on those datasets, various CNN architectures have been introduced to learn effective
representations of objects and to estimate accurate 6D poses. Kehl et al. [
23
] extend SSD [
31
] by
adding an additional viewpoint classification branch to the network whereas [
41
,
47
] predict 2D
projections from 3D bounding box estimations. Other methods involve a hybrid approach where the
model learns to perform multiple tasks, e.g., Song et al. [
45
] enforce consistencies among keypoints,
edges, and object symmetries, and Billings et al. [
6
] predict silhouettes of objects along with object
poses. There is also a growing trend of designing model agnostic features [
53
] that can handle novel
objects. Finallly, few shot, one shot, and category level pose estimation has also seen increased
interest recently [11, 54, 46]
To refine the predicted poses, several works use additional depth information and perform the
standard ICP algorithm [
56
,
23
], directly learn from RGB-D inputs [
51
,
30
,
59
], or through neural
rendering [57, 22, 30, 33]. We argue that since the input signals to robots and/or mobile devices are
typically video sequences, instead of heavily relying on post processing refinement using additional
depth information and rendering, estimating poses in videos by exploiting the temporal data could
already refine the single pose estimations. Recently, several tracking algorithms are utilising videos
to estimate object poses. A notable work from Deng et al. [
13
] introduces the PoseRBPF algorithm
that uses particle filters to track objects in video sequences. However, this state-of-the-art algorithm
provides accurate estimations at a high computational cost. Wen et al. [
55
] also perform tracking,
but use synthetic rendering of the object at the previous time-step.
2