Early or Late Fusion Matters Efficient RGB-D Fusion in Vision Transformers for 3D Object Recognition Georgios Tziafas1and Hamidreza Kasaei1

2025-05-02 0 0 7.18MB 8 页 10玖币
侵权投诉
Early or Late Fusion Matters: Efficient RGB-D Fusion in Vision
Transformers for 3D Object Recognition
Georgios Tziafas1and Hamidreza Kasaei1
Abstract The Vision Transformer (ViT) architecture has
established its place in computer vision literature, however,
training ViTs for RGB-D object recognition remains an under-
studied topic, viewed in recent literature only through the lens
of multi-task pretraining in multiple vision modalities. Such
approaches are often computationally intensive, relying on the
scale of multiple pretraining datasets to align RGB with 3D
information. In this work, we propose a simple yet strong recipe
for transferring pretrained ViTs in RGB-D domains for 3D
object recognition, focusing on fusing RGB and depth repre-
sentations encoded jointly by the ViT. Compared to previous
works in multimodal Transformers, the key challenge here is
to use the attested flexibility of ViTs to capture cross-modal
interactions at the downstream and not the pretraining stage.
We explore which depth representation is better in terms of
resulting accuracy and compare early and late fusion techniques
for aligning the RGB and depth modalities within the ViT
architecture. Experimental results in the Washington RGB-
D Objects dataset (ROD) demonstrate that in such RGB
RGB-D scenarios, late fusion techniques work better than most
popularly employed early fusion. With our transfer baseline,
fusion ViTs score up to 95.4% top-1 accuracy in ROD, achieving
new state-of-the-art results in this benchmark. We further
show the benefits of using our multimodal fusion baseline
over unimodal feature extractors in a synthetic-to-real visual
adaptation as well as in an open-ended lifelong learning scenario
in the ROD benchmark, where our model outperforms previous
works by a margin of >8%. Finally, we integrate our method
with a robot framework and demonstrate how it can serve as
a perception utility in an interactive robot learning scenario,
both in simulation and with a real robot.
I. INTRODUCTION
Transfer learning approaches for computer vision have
a long-standing tradition for image classification, most
popularly using Convolutional Neural Networks (CNNs).
More recently, the Vision Transformer (ViT) [13] architecture
and its variants [28, 41, 3] have shown promising transfer
results, providing flexible representations that can be fine-
tuned for downstream tasks, also in few-shot settings [10].
This capability however comes at the cost of data inefficiency
[24], as performance gains over CNNs are noticed in
Transformers that are pretrained in large-scale datasets, such
as ImageNet21k [37] and JFT-300M [39]. When moving
from RGB-only to view-based 3D object recognition (RGB-
D), a dataset of similar magnitude for pretraining is amiss,
granting RGB-D representation learning a topic that has yet
to grow. Recent alternative directions include transferring
from models pretrained on collections of multimodal datasets
[17, 27, 16], although they focus on scene-level tasks, they
1
Department of Artificial Intelligence, University of Groningen, The
Netherlands {g.t.tziafas,hamidreza.kasaei}@rug.nl
are constrained to the use of the early fusion strategy and
are often computationally intensive to fine-tune.
In this work, we aim to address such limitations by
revisiting the RGB-D object recognition task and study
recipes for transferring an RGB-only pretrained ViT (i.e.
in ImageNet1k [12]) into an RGB-D object-level dataset. We
begin by exploring different representation formats for the
depth modality and design two variants that adapt ViT to fuse
RGB and depth (see Fig. 1), namely: a) Early fusion, where
RGB and depth are fused before the encoder and RGB-D
patches are represented jointly in the sequence, and b) Late
fusion, where we move the fusion operation after the encoder,
leaving the patch embedders intact from their pretraining. Our
hypothesis is that when fine-tuning in small (or moderate)
sized datasets, the late fusion baseline is very likely to perform
better, as it doesn’t change the representation of the input
compared to the pretraining stage, but casts the challenge
as a distribution shift in the input images (i.e. both RGB
and depth are processed by the same weights and must be
mapped to the same label).
Experimental results with the Washington RGB-D Objects
dataset [25] positively reinforce our hypothesis, as the late
fusion baseline far outperforms the early variant. More
interestingly, we show that with our late fusion recipe,
ViTs achieve new state-of-the-art results in this benchmark,
surpassing a plethora of methods that specifically study RGB-
D fusion techniques for object recognition. We conduct addi-
tional experiments to further demonstrate the representational
strength of our approach in: a) a synthetic-to-real transfer
scenario, where we show that with late fusion a synthetically
pretrained ViT can surpass the performance of training in real
data with only a few fine-tuning examples, and b) an open-
ended lifelong learning scenario, where we show that our late
fusion encoder outperforms unimodal versions of the same
scale, even without fine-tuning, while outperforming previous
works by a significant margin. Finally, we demonstrate the
applicability of our approach in the robotics domain by
integrating our method with a simulated and real robot
framework. In particular, we illustrate how the robot can be
taught by a human user to recognize new objects and perform
a table-cleaning task. In summary, our main contributions
are:
We experimentally find that late fusion performs better
than early fusion in RGB
RGB-D transfer scenarios.
We achieve new state-of-the-art results for RGB-D object
recognition in the ROD [25] benchmark.
We show that our method can aid in SynROD
ROD
[31] few-shot visual domain adaptation scenario.
arXiv:2210.00843v2 [cs.CV] 7 Mar 2023
RGB projection Depth projection
Transformer Encoder
MLP
Head
1234
*
0
*
fusion
: Position
Embedding
*: Trainable <CLS>
Linear projection
3
Linear projection
Transformer Encoder
MLP
Head
12
*
034
12
*
0
*
fusion
Embedding
4
...
...
...
...
5 6 7 8 9
...
... ...
...
*
shared
Fig. 1: Two different baselines for fusing RGB-D representations in the ViT architecture. In early fusion (left), a separate projection is used for RGB and
depth and the fused embeddings are fed to the encoder, providing a single
<CLS>
token. In late fusion (right), the same weights are used for projecting
RGB and depth and the two modalities are fed separately to the encoder. The two final
<CLS>
tokens are fused to provide the final representation for
classification.
We show that our method can be applied in an online
lifelong robot learning setup, including experimental
comparisons with previous works as well as simulation
and real robot demonstrations.
II. RELATED WORKS
In this section, we discuss previous works on RGB-D fusion
with CNNs for view-based object recognition, multimodal
Transformers, as well as open-ended lifelong learning, which
we include as an evaluation scenario in our experiments.
A. RGB-D Fusion with CNNs
As in RGB image classification, multiple traditional CNN-
based approaches have replaced conventional approaches [4,
40] for extending to the RGB-D modalities. The focus of such
works lies in RGB-D fusion, where deep features extracted
from CNNs are fused through a multimodal fusion layer
[43] or custom networks [42]. Rahman et. al. [14] propose
a parallel three-stream CNN which processes two depth
encodings in two streams and RGB in the last one. Cheng et.
al. [11] proposed to integrate Gaussian mixture models with
CNNs through fisher kernel encodings. Zia et. al. [48] propose
mixed 2D/3D CNNs which are initialized with 2D pretrained
weights and extend to 3D to also incorporate depth. Such
methods study how to inject fusion in the locally-aware CNN
architecture. In contrary, in our work, we implement fusion as
a pooling operation on multimodal Transformer embeddings
and opt to gain cross-modal alignment via transferring from
pretrained models.
B. Multimodal Learning with Transformers
In the absence of a large-scale RGB-D dataset for pre-
training, recent works try to alleviate this bottleneck by
pretraining on collections of datasets from multiple modalities
[17, 27, 16] and rely on the flexibility of Transformers to
capture cross-modal interactions. However, such methods
focus on scene/action recognition or semantic segmentation
tasks, leaving the RGB-D object recognition task unexplored.
Furthermore, they employ an early fusion technique for
converting heterogeneous modalities (i.e., image, video) in
the same sequence representation, leaving open questions of
whether this is the best fusion technique in homogeneous
modalities such as RGB-D, as well as if its the best fusion
technique for directly transferring from one homogeneous
modality to another, without the pretraining step. Finally, they
rely heavily on model capacity and specialized Transformer
architecture variants (e.g Swin [28]) in order to enable
multimodal pretraining to boost performance in unimodal
downstream tasks. Such models set a high computational
resource entry point for practitioners, casting them not widely
accessible for fine-tuning in arbitrary datasets.
C. Open-Ended Lifelong Learning
An emerging topic in deep learning literature, most com-
monly referred to as Lifelong or Continual Learning, studies
the scenario of a learning agent continuously incorporating
new experiences from an online data stream. In the context
of image classification, the challenge is stated as learning
to classify images from an ever-shifting distribution, while
avoiding the effect of catastrophic forgetting [33, 9, 36, 46].
Even though works for using Transformers in lifelong learning
are starting to grow [44, 15], to the best of our knowledge,
this is the first work that touches on lifelong learning with
Transformers for RGB-D object recognition. However, we
highlight that the focus of this work is not on lifelong learning
摘要:

EarlyorLateFusionMatters:EfcientRGB-DFusioninVisionTransformersfor3DObjectRecognitionGeorgiosTziafas1andHamidrezaKasaei1Abstract—TheVisionTransformer(ViT)architecturehasestablisheditsplaceincomputervisionliterature,however,trainingViTsforRGB-Dobjectrecognitionremainsanunder-studiedtopic,viewedinrec...

展开>> 收起<<
Early or Late Fusion Matters Efficient RGB-D Fusion in Vision Transformers for 3D Object Recognition Georgios Tziafas1and Hamidreza Kasaei1.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:7.18MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注