Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models Songsong Xiong1 Georgios Tziafas1and Hamidreza Kasaei1

2025-04-29 0 0 4.47MB 7 页 10玖币

侵权投诉

Enhancing Fine-Grained 3D Object Recognition

using Hybrid Multi-Modal Vision Transformer-CNN Models

Songsong Xiong1, Georgios Tziafas1and Hamidreza Kasaei1

Abstract— Robots operating in human-centered environ-

ments, such as retail stores, restaurants, and households, are

often required to distinguish between similar objects in different

contexts with a high degree of accuracy. However, ﬁne-grained

object recognition remains a challenge in robotics due to the

high intra-category and low inter-category dissimilarities. In

addition, the limited number of ﬁne-grained 3D datasets poses

a signiﬁcant problem in addressing this issue effectively. In this

paper, we propose a hybrid multi-modal Vision Transformer

(ViT) and Convolutional Neural Networks (CNN) approach to

improve the performance of ﬁne-grained visual classiﬁcation

(FGVC). To address the shortage of FGVC 3D datasets, we

generated two synthetic datasets. The ﬁrst dataset consists

of 20 categories related to restaurants with a total of 100

instances, while the second dataset contains 120 shoe instances.

Our approach was evaluated on both datasets, and the results

indicate that it outperforms both CNN-only and ViT-only

baselines, achieving a recognition accuracy of 94.50% and

93.51% on the restaurant and shoe datasets, respectively. Ad-

ditionally, we have made our FGVC RGB-D datasets available

to the research community to enable further experimentation

and advancement. Furthermore, we successfully integrated our

proposed method with a robot framework and demonstrated

its potential as a ﬁne-grained perception tool in both simulated

and real-world robotic scenarios.

I. INTRODUCTION

As society continues to grow, labor shortages have be-

come increasingly prevalent. Consequently, robots are gain-

ing popularity in human-centered environments [1], [2]. To

safely operate in such domains, the robot should be able

to recognize ﬁne-grained objects accurately. For example,

a restaurant robot must be able to categorize drinks with

similar packaging but varying attributes. Similarly, a service

robot in the shoe store is required to sort shoes with

comparable appearances. Fine-grained visual categorization

(FGVC) has recently received considerable attention [3] as

it aims to identify sub-categories within the same basic-

level classes. However, it remains a challenging task due to

the high intra-category and low inter-category dissimilarity

issues [4]. Furthermore, the performance of FGVC is often

hindered due to limited available datasets [3].

Recently, the majority of studies have been focused on

RGB ﬁne-grained recognition. These studies include the

CUB-200-2011 dataset [3], Oxford Flowers dataset [5], Air-

craft dataset [6], and Pets dataset [7]. Additionally, some

studies have employed both RGB and depth sensors to

perform a variety of robotic tasks, including object classiﬁca-

tion [8]–[12] and action recognition [13]. These studies have

1Department of Artiﬁcial Intelligence, University of Groningen,

Groningen, The Netherlands

{s.xiong, g.t.tziafas,hamidreza.kasaei}@rug.nl

Fig. 1. An example of ﬁne-grained object visual classiﬁcation (FGVC)

scenario: (left) The robot leverages an RGB-D camera to perceive the

environment ; (right) The distribution of various object categories across the

RGB-D feature space is displayed using a t-SNE plot. This plot reveals that

distinguishing between ﬁne-grained categories like knife, fork, and spoon

is more challenging than distinguishing between basic-level categories like

mugs and bottles.

demonstrated that using multi-modal object representations

enhances recognition accuracy.

To the best of our knowledge, there are only two FGVC

RGB-D datasets currently available. The ﬁrst dataset is

limited to hand-grasp classiﬁcation, while the second dataset

is centered around vegetables and fruits, but unfortunately,

it is not publicly available [14], [15]. Furthermore, due

to their small scale, these datasets impose restrictions on

the performance of deep learning methods for FGVC [16].

Consequently, there is a lack of large-scale RGB-D datasets

that can be utilized for FGVC purposes. To address the

shortage of FGVC 3D datasets, we generated two synthetic

datasets. The ﬁrst dataset consists of 20 categories related to

restaurants with a total of 100 instances, while the second

dataset contains 120 shoe instances.

Furthermore, we propose a hybrid multi-modal approach

based on ensembles of CNN-ViT networks to enhance the

accuracy of ﬁne-grained 3D object recognition. An overview

of our approach is shown in Fig. 2. We performed extensive

sets of experiments to assess the performance of the proposed

approach. Experimental results show that our multi-modal

approach surpasses corresponding unimodal CNN-only and

ViT-only approaches in recognition accuracy. Additionally,

we have successfully integrated our proposed method with a

robot framework and demonstrated its potential as a ﬁne-

grained perception tool in both simulated and real-world

robotic scenarios.

In summary, our key contributions are twofold:

•We propose a hybrid multi-modal approach based on

ViT-CNN networks to enhance ﬁne-grained 3D object

recognition.

arXiv:2210.04613v2 [cs.CV] 6 Mar 2023

Fig. 2. A dual-arm service robot is used to successfully sort ﬁne-grained shoe objects: (left) To accomplish this task successfully, the robot should

recognize all ﬁne-grained shoe objects ﬁrst and then place similar shoes in the same basket (shoes); (center) the proposed hybrid multi-modal approach

receives RGB-D images of each object and passes them through a CNN network and a ViT simultaneously. The CNN network is responsible for capturing

local information of the object, while the ViT is used to encode global features of the objects. The obtained representations are then fused and used for

object recognition purposes; (right) Afterward, the robot detects a proper grasp conﬁguration for the target object [17], and then grasps and manipulates

the object into the basket.

•To the best of our knowledge, we are the ﬁrst group

to build the publicly available 3D object datasets

for ﬁne-grained object classiﬁcation. The datasets are

publicly available online at: https://github.com/github-

songsong/Fine-grained-Pointcloud-Object-Dataset

II. RELATED WORK

Various studies have been carried out on ﬁne-grained

object recognition, which can be classiﬁed into three cat-

egories based on their approach, namely localization meth-

ods, feature encoding methods, and transformer methods, as

discussed in [4].

Localization-based FGVC methods: These methods fo-

cus on identifying discriminative partition areas between

instances by training a detection model, and then classifying

using the trained model. For examples, Branson et al. [18]

and Wei et al. [19] proposed supervised learning of localiza-

tion via part annotations. However, due to the high cost and

limited availability of part annotations, weakly supervised

learning using image labels alone has gained more attention.

Yang et al. [20] introduced a re-ranking method to enhance

region representations for global categorization. Unlike our

approach, these methods require specially designed models

to identify potential areas, and the selected sections must still

undergo classiﬁcation via a backbone model.

Feature-encoding methods: These methods aim to enhance

the object representation to achieve better classiﬁcation

performance. Our approach falls into this category. Yu et

al. [21] enhanced the object representation by utilizing the

hierarchical bilinear pooling function, which combines the

multiple cross-layer bilinear features. Zheng et al. [22]

proposed a deep bilinear transformation block, which can be

deeply stacked in convolutional neural networks to learn ﬁne-

grained image representations. In particular, they uniformly

categorized the input channels into several semantic groups

and then generated a compact representation for FGVC. As

these methods use a single encoder and only RGB data, their

performances are limited [23]. To address these limitations,

we consider multi-modal and ensemble of ViT and CNN

models to handle the FGVC task.

Transformer methods: In recent years, transformers have

shown remarkable progress in Natural Language Process-

ing [24], [25]. As a result, more studies have applied

transformers to computer vision tasks, including object

detection [26], [27], segmentation [28], [29], and object

tracking [30]. In particular, using ViT models for FGVC

has gained increasing popularity [31], [32]. For instance,

Dosovitskiy et al. [33] proposed the ViT model, which

demonstrated superior performance in image classiﬁcation.

Subsequently, Swin [34], DeiT [35], and MAE [36] were

introduced for various computer vision tasks. He et al. [23]

extended the ViT-only model to FGVC and evaluated the

proposed approach on traditional RGB-only datasets (e.g.,

various bird species, and car models).

Many researchers have recently utilized CNN-only or ViT-

only models to tackle FGVC with the RGB-only datasets.

Ullrich et al. [32] leveraged a multi-CNN network to extract

RGB and Depth images for 3D object recognition. The RGB

and Depth image representations from CNN-only models

separately are also used for the single-view FGVC [15].

The performance spectrum of CNNs and ViTs, however, is

conﬁned under their ﬁxed architectures. With the training

dataset increasing, their accuracy can be improved but only

approach the maximum of their ﬁxed capacity. To improve

the performance of single-view FGVC, we proposed the hy-

brid multi-modal approach based on ViT and CNN networks.

Fine-grained object datasets: In recent studies on FGVC,

Nilsback [5] contributed a ﬁne-grained ﬂower dataset with

17 different species for FGVC, followed by the ﬁne-

grained Birds dataset containing 11788 images from 200

bird species [3]. Since then, FGVC has gradually gained

more attention. For example, the Standford Dogs [7] and

Cars [37] datasets for FGVC were published, respectively.

Fine-grained VegFru [38] consisting of vegetables and fruits,

and Kuzushiji-MNIST [39], have been introduced recently.

Considering the limitation of the RGB-only data, RGB-

D images rapidly emerged in computer vision tasks due to

providing additional rich information. For example, Andreas

et al. [40] released an object segmentation dataset, which

comprises of 111 RGB-D images with stacked and occluding

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EnhancingFine-Grained3DObjectRecognitionusingHybridMulti-ModalVisionTransformer-CNNModelsSongsongXiong1,GeorgiosTziafas1andHamidrezaKasaei1AbstractRobotsoperatinginhuman-centeredenviron-ments,suchasretailstores,restaurants,andhouseholds,areoftenrequiredtodistinguishbetweensimilarobjectsindifferentc...

展开>> 收起<<

Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models Songsong Xiong1 Georgios Tziafas1and Hamidreza Kasaei1.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models Songsong Xiong1 Georgios Tziafas1and Hamidreza Kasaei1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: