Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision
Transformers for 3D Object Recognition
Georgios Tziafas1and Hamidreza Kasaei1
Abstract— The Vision Transformer (ViT) architecture has
established its place in computer vision literature, however,
training ViTs for RGB-D object recognition remains an under-
studied topic, viewed in recent literature only through the lens
of multi-task pretraining in multiple vision modalities. Such
approaches are often computationally intensive, relying on the
scale of multiple pretraining datasets to align RGB with 3D
information. In this work, we propose a simple yet strong recipe
for transferring pretrained ViTs in RGB-D domains for 3D
object recognition, focusing on fusing RGB and depth repre-
sentations encoded jointly by the ViT. Compared to previous
works in multimodal Transformers, the key challenge here is
to use the attested flexibility of ViTs to capture cross-modal
interactions at the downstream and not the pretraining stage.
We explore which depth representation is better in terms of
resulting accuracy and compare early and late fusion techniques
for aligning the RGB and depth modalities within the ViT
architecture. Experimental results in the Washington RGB-
D Objects dataset (ROD) demonstrate that in such RGB →
RGB-D scenarios, late fusion techniques work better than most
popularly employed early fusion. With our transfer baseline,
fusion ViTs score up to 95.4% top-1 accuracy in ROD, achieving
new state-of-the-art results in this benchmark. We further
show the benefits of using our multimodal fusion baseline
over unimodal feature extractors in a synthetic-to-real visual
adaptation as well as in an open-ended lifelong learning scenario
in the ROD benchmark, where our model outperforms previous
works by a margin of >8%. Finally, we integrate our method
with a robot framework and demonstrate how it can serve as
a perception utility in an interactive robot learning scenario,
both in simulation and with a real robot.
I. INTRODUCTION
Transfer learning approaches for computer vision have
a long-standing tradition for image classification, most
popularly using Convolutional Neural Networks (CNNs).
More recently, the Vision Transformer (ViT) [13] architecture
and its variants [28, 41, 3] have shown promising transfer
results, providing flexible representations that can be fine-
tuned for downstream tasks, also in few-shot settings [10].
This capability however comes at the cost of data inefficiency
[24], as performance gains over CNNs are noticed in
Transformers that are pretrained in large-scale datasets, such
as ImageNet21k [37] and JFT-300M [39]. When moving
from RGB-only to view-based 3D object recognition (RGB-
D), a dataset of similar magnitude for pretraining is amiss,
granting RGB-D representation learning a topic that has yet
to grow. Recent alternative directions include transferring
from models pretrained on collections of multimodal datasets
[17, 27, 16], although they focus on scene-level tasks, they
1
Department of Artificial Intelligence, University of Groningen, The
Netherlands {g.t.tziafas,hamidreza.kasaei}@rug.nl
are constrained to the use of the early fusion strategy and
are often computationally intensive to fine-tune.
In this work, we aim to address such limitations by
revisiting the RGB-D object recognition task and study
recipes for transferring an RGB-only pretrained ViT (i.e.
in ImageNet1k [12]) into an RGB-D object-level dataset. We
begin by exploring different representation formats for the
depth modality and design two variants that adapt ViT to fuse
RGB and depth (see Fig. 1), namely: a) Early fusion, where
RGB and depth are fused before the encoder and RGB-D
patches are represented jointly in the sequence, and b) Late
fusion, where we move the fusion operation after the encoder,
leaving the patch embedders intact from their pretraining. Our
hypothesis is that when fine-tuning in small (or moderate)
sized datasets, the late fusion baseline is very likely to perform
better, as it doesn’t change the representation of the input
compared to the pretraining stage, but casts the challenge
as a distribution shift in the input images (i.e. both RGB
and depth are processed by the same weights and must be
mapped to the same label).
Experimental results with the Washington RGB-D Objects
dataset [25] positively reinforce our hypothesis, as the late
fusion baseline far outperforms the early variant. More
interestingly, we show that with our late fusion recipe,
ViTs achieve new state-of-the-art results in this benchmark,
surpassing a plethora of methods that specifically study RGB-
D fusion techniques for object recognition. We conduct addi-
tional experiments to further demonstrate the representational
strength of our approach in: a) a synthetic-to-real transfer
scenario, where we show that with late fusion a synthetically
pretrained ViT can surpass the performance of training in real
data with only a few fine-tuning examples, and b) an open-
ended lifelong learning scenario, where we show that our late
fusion encoder outperforms unimodal versions of the same
scale, even without fine-tuning, while outperforming previous
works by a significant margin. Finally, we demonstrate the
applicability of our approach in the robotics domain by
integrating our method with a simulated and real robot
framework. In particular, we illustrate how the robot can be
taught by a human user to recognize new objects and perform
a table-cleaning task. In summary, our main contributions
are:
•
We experimentally find that late fusion performs better
than early fusion in RGB
→
RGB-D transfer scenarios.
•
We achieve new state-of-the-art results for RGB-D object
recognition in the ROD [25] benchmark.
•
We show that our method can aid in SynROD
→
ROD
[31] few-shot visual domain adaptation scenario.
arXiv:2210.00843v2 [cs.CV] 7 Mar 2023