Early or Late Fusion Matters Efﬁcient RGB-D Fusion in Vision Transformers for 3D Object Recognition Georgios Tziafas1and Hamidreza Kasaei1

2025-05-02 0 0 7.18MB 8 页 10玖币

侵权投诉

Early or Late Fusion Matters: Efﬁcient RGB-D Fusion in Vision

Transformers for 3D Object Recognition

Georgios Tziafas1and Hamidreza Kasaei1

Abstract— The Vision Transformer (ViT) architecture has

established its place in computer vision literature, however,

training ViTs for RGB-D object recognition remains an under-

studied topic, viewed in recent literature only through the lens

of multi-task pretraining in multiple vision modalities. Such

approaches are often computationally intensive, relying on the

scale of multiple pretraining datasets to align RGB with 3D

information. In this work, we propose a simple yet strong recipe

for transferring pretrained ViTs in RGB-D domains for 3D

object recognition, focusing on fusing RGB and depth repre-

sentations encoded jointly by the ViT. Compared to previous

works in multimodal Transformers, the key challenge here is

to use the attested ﬂexibility of ViTs to capture cross-modal

interactions at the downstream and not the pretraining stage.

We explore which depth representation is better in terms of

resulting accuracy and compare early and late fusion techniques

for aligning the RGB and depth modalities within the ViT

architecture. Experimental results in the Washington RGB-

D Objects dataset (ROD) demonstrate that in such RGB →

RGB-D scenarios, late fusion techniques work better than most

popularly employed early fusion. With our transfer baseline,

fusion ViTs score up to 95.4% top-1 accuracy in ROD, achieving

new state-of-the-art results in this benchmark. We further

show the beneﬁts of using our multimodal fusion baseline

over unimodal feature extractors in a synthetic-to-real visual

adaptation as well as in an open-ended lifelong learning scenario

in the ROD benchmark, where our model outperforms previous

works by a margin of >8%. Finally, we integrate our method

with a robot framework and demonstrate how it can serve as

a perception utility in an interactive robot learning scenario,

both in simulation and with a real robot.

I. INTRODUCTION

Transfer learning approaches for computer vision have

a long-standing tradition for image classiﬁcation, most

popularly using Convolutional Neural Networks (CNNs).

More recently, the Vision Transformer (ViT) [13] architecture

and its variants [28, 41, 3] have shown promising transfer

results, providing ﬂexible representations that can be ﬁne-

tuned for downstream tasks, also in few-shot settings [10].

This capability however comes at the cost of data inefﬁciency

[24], as performance gains over CNNs are noticed in

Transformers that are pretrained in large-scale datasets, such

as ImageNet21k [37] and JFT-300M [39]. When moving

from RGB-only to view-based 3D object recognition (RGB-

D), a dataset of similar magnitude for pretraining is amiss,

granting RGB-D representation learning a topic that has yet

to grow. Recent alternative directions include transferring

from models pretrained on collections of multimodal datasets

[17, 27, 16], although they focus on scene-level tasks, they

Department of Artiﬁcial Intelligence, University of Groningen, The

Netherlands {g.t.tziafas,hamidreza.kasaei}@rug.nl

are constrained to the use of the early fusion strategy and

are often computationally intensive to ﬁne-tune.

In this work, we aim to address such limitations by

revisiting the RGB-D object recognition task and study

recipes for transferring an RGB-only pretrained ViT (i.e.

in ImageNet1k [12]) into an RGB-D object-level dataset. We

begin by exploring different representation formats for the

depth modality and design two variants that adapt ViT to fuse

RGB and depth (see Fig. 1), namely: a) Early fusion, where

RGB and depth are fused before the encoder and RGB-D

patches are represented jointly in the sequence, and b) Late

fusion, where we move the fusion operation after the encoder,

leaving the patch embedders intact from their pretraining. Our

hypothesis is that when ﬁne-tuning in small (or moderate)

sized datasets, the late fusion baseline is very likely to perform

better, as it doesn’t change the representation of the input

compared to the pretraining stage, but casts the challenge

as a distribution shift in the input images (i.e. both RGB

and depth are processed by the same weights and must be

mapped to the same label).

Experimental results with the Washington RGB-D Objects

dataset [25] positively reinforce our hypothesis, as the late

fusion baseline far outperforms the early variant. More

interestingly, we show that with our late fusion recipe,

ViTs achieve new state-of-the-art results in this benchmark,

surpassing a plethora of methods that speciﬁcally study RGB-

D fusion techniques for object recognition. We conduct addi-

tional experiments to further demonstrate the representational

strength of our approach in: a) a synthetic-to-real transfer

scenario, where we show that with late fusion a synthetically

pretrained ViT can surpass the performance of training in real

data with only a few ﬁne-tuning examples, and b) an open-

ended lifelong learning scenario, where we show that our late

fusion encoder outperforms unimodal versions of the same

scale, even without ﬁne-tuning, while outperforming previous

works by a signiﬁcant margin. Finally, we demonstrate the

applicability of our approach in the robotics domain by

integrating our method with a simulated and real robot

framework. In particular, we illustrate how the robot can be

taught by a human user to recognize new objects and perform

a table-cleaning task. In summary, our main contributions

are:

•

We experimentally ﬁnd that late fusion performs better

than early fusion in RGB

→

RGB-D transfer scenarios.

•

We achieve new state-of-the-art results for RGB-D object

recognition in the ROD [25] benchmark.

•

We show that our method can aid in SynROD

→

ROD

[31] few-shot visual domain adaptation scenario.

arXiv:2210.00843v2 [cs.CV] 7 Mar 2023

RGB projection Depth projection

Transformer Encoder

MLP

Head

1234

fusion

: Position

Embedding

*: Trainable <CLS>

Linear projection

Transformer Encoder

MLP

Head

034

fusion

Embedding

...

5 6 7 8 9

...

... ...

...

shared

Fig. 1: Two different baselines for fusing RGB-D representations in the ViT architecture. In early fusion (left), a separate projection is used for RGB and

depth and the fused embeddings are fed to the encoder, providing a single

<CLS>

token. In late fusion (right), the same weights are used for projecting

RGB and depth and the two modalities are fed separately to the encoder. The two ﬁnal

<CLS>

tokens are fused to provide the ﬁnal representation for

classiﬁcation.

•

We show that our method can be applied in an online

lifelong robot learning setup, including experimental

comparisons with previous works as well as simulation

and real robot demonstrations.

II. RELATED WORKS

In this section, we discuss previous works on RGB-D fusion

with CNNs for view-based object recognition, multimodal

Transformers, as well as open-ended lifelong learning, which

we include as an evaluation scenario in our experiments.

A. RGB-D Fusion with CNNs

As in RGB image classiﬁcation, multiple traditional CNN-

based approaches have replaced conventional approaches [4,

40] for extending to the RGB-D modalities. The focus of such

works lies in RGB-D fusion, where deep features extracted

from CNNs are fused through a multimodal fusion layer

[43] or custom networks [42]. Rahman et. al. [14] propose

a parallel three-stream CNN which processes two depth

encodings in two streams and RGB in the last one. Cheng et.

al. [11] proposed to integrate Gaussian mixture models with

CNNs through ﬁsher kernel encodings. Zia et. al. [48] propose

mixed 2D/3D CNNs which are initialized with 2D pretrained

weights and extend to 3D to also incorporate depth. Such

methods study how to inject fusion in the locally-aware CNN

architecture. In contrary, in our work, we implement fusion as

a pooling operation on multimodal Transformer embeddings

and opt to gain cross-modal alignment via transferring from

pretrained models.

B. Multimodal Learning with Transformers

In the absence of a large-scale RGB-D dataset for pre-

training, recent works try to alleviate this bottleneck by

pretraining on collections of datasets from multiple modalities

[17, 27, 16] and rely on the ﬂexibility of Transformers to

capture cross-modal interactions. However, such methods

focus on scene/action recognition or semantic segmentation

tasks, leaving the RGB-D object recognition task unexplored.

Furthermore, they employ an early fusion technique for

converting heterogeneous modalities (i.e., image, video) in

the same sequence representation, leaving open questions of

whether this is the best fusion technique in homogeneous

modalities such as RGB-D, as well as if its the best fusion

technique for directly transferring from one homogeneous

modality to another, without the pretraining step. Finally, they

rely heavily on model capacity and specialized Transformer

architecture variants (e.g Swin [28]) in order to enable

multimodal pretraining to boost performance in unimodal

downstream tasks. Such models set a high computational

resource entry point for practitioners, casting them not widely

accessible for ﬁne-tuning in arbitrary datasets.

C. Open-Ended Lifelong Learning

An emerging topic in deep learning literature, most com-

monly referred to as Lifelong or Continual Learning, studies

the scenario of a learning agent continuously incorporating

new experiences from an online data stream. In the context

of image classiﬁcation, the challenge is stated as learning

to classify images from an ever-shifting distribution, while

avoiding the effect of catastrophic forgetting [33, 9, 36, 46].

Even though works for using Transformers in lifelong learning

are starting to grow [44, 15], to the best of our knowledge,

this is the ﬁrst work that touches on lifelong learning with

Transformers for RGB-D object recognition. However, we

highlight that the focus of this work is not on lifelong learning

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EarlyorLateFusionMatters:EfcientRGB-DFusioninVisionTransformersfor3DObjectRecognitionGeorgiosTziafas1andHamidrezaKasaei1AbstractTheVisionTransformer(ViT)architecturehasestablisheditsplaceincomputervisionliterature,however,trainingViTsforRGB-Dobjectrecognitionremainsanunder-studiedtopic,viewedinrec...

展开>> 收起<<

Early or Late Fusion Matters Efﬁcient RGB-D Fusion in Vision Transformers for 3D Object Recognition Georgios Tziafas1and Hamidreza Kasaei1.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Early or Late Fusion Matters Efﬁcient RGB-D Fusion in Vision Transformers for 3D Object Recognition Georgios Tziafas1and Hamidreza Kasaei1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: