Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models Songsong Xiong1 Georgios Tziafas1and Hamidreza Kasaei1

2025-04-29 0 0 4.47MB 7 页 10玖币
侵权投诉
Enhancing Fine-Grained 3D Object Recognition
using Hybrid Multi-Modal Vision Transformer-CNN Models
Songsong Xiong1, Georgios Tziafas1and Hamidreza Kasaei1
Abstract Robots operating in human-centered environ-
ments, such as retail stores, restaurants, and households, are
often required to distinguish between similar objects in different
contexts with a high degree of accuracy. However, fine-grained
object recognition remains a challenge in robotics due to the
high intra-category and low inter-category dissimilarities. In
addition, the limited number of fine-grained 3D datasets poses
a significant problem in addressing this issue effectively. In this
paper, we propose a hybrid multi-modal Vision Transformer
(ViT) and Convolutional Neural Networks (CNN) approach to
improve the performance of fine-grained visual classification
(FGVC). To address the shortage of FGVC 3D datasets, we
generated two synthetic datasets. The first dataset consists
of 20 categories related to restaurants with a total of 100
instances, while the second dataset contains 120 shoe instances.
Our approach was evaluated on both datasets, and the results
indicate that it outperforms both CNN-only and ViT-only
baselines, achieving a recognition accuracy of 94.50% and
93.51% on the restaurant and shoe datasets, respectively. Ad-
ditionally, we have made our FGVC RGB-D datasets available
to the research community to enable further experimentation
and advancement. Furthermore, we successfully integrated our
proposed method with a robot framework and demonstrated
its potential as a fine-grained perception tool in both simulated
and real-world robotic scenarios.
I. INTRODUCTION
As society continues to grow, labor shortages have be-
come increasingly prevalent. Consequently, robots are gain-
ing popularity in human-centered environments [1], [2]. To
safely operate in such domains, the robot should be able
to recognize fine-grained objects accurately. For example,
a restaurant robot must be able to categorize drinks with
similar packaging but varying attributes. Similarly, a service
robot in the shoe store is required to sort shoes with
comparable appearances. Fine-grained visual categorization
(FGVC) has recently received considerable attention [3] as
it aims to identify sub-categories within the same basic-
level classes. However, it remains a challenging task due to
the high intra-category and low inter-category dissimilarity
issues [4]. Furthermore, the performance of FGVC is often
hindered due to limited available datasets [3].
Recently, the majority of studies have been focused on
RGB fine-grained recognition. These studies include the
CUB-200-2011 dataset [3], Oxford Flowers dataset [5], Air-
craft dataset [6], and Pets dataset [7]. Additionally, some
studies have employed both RGB and depth sensors to
perform a variety of robotic tasks, including object classifica-
tion [8]–[12] and action recognition [13]. These studies have
1Department of Artificial Intelligence, University of Groningen,
Groningen, The Netherlands
{s.xiong, g.t.tziafas,hamidreza.kasaei}@rug.nl
Fig. 1. An example of fine-grained object visual classification (FGVC)
scenario: (left) The robot leverages an RGB-D camera to perceive the
environment ; (right) The distribution of various object categories across the
RGB-D feature space is displayed using a t-SNE plot. This plot reveals that
distinguishing between fine-grained categories like knife, fork, and spoon
is more challenging than distinguishing between basic-level categories like
mugs and bottles.
demonstrated that using multi-modal object representations
enhances recognition accuracy.
To the best of our knowledge, there are only two FGVC
RGB-D datasets currently available. The first dataset is
limited to hand-grasp classification, while the second dataset
is centered around vegetables and fruits, but unfortunately,
it is not publicly available [14], [15]. Furthermore, due
to their small scale, these datasets impose restrictions on
the performance of deep learning methods for FGVC [16].
Consequently, there is a lack of large-scale RGB-D datasets
that can be utilized for FGVC purposes. To address the
shortage of FGVC 3D datasets, we generated two synthetic
datasets. The first dataset consists of 20 categories related to
restaurants with a total of 100 instances, while the second
dataset contains 120 shoe instances.
Furthermore, we propose a hybrid multi-modal approach
based on ensembles of CNN-ViT networks to enhance the
accuracy of fine-grained 3D object recognition. An overview
of our approach is shown in Fig. 2. We performed extensive
sets of experiments to assess the performance of the proposed
approach. Experimental results show that our multi-modal
approach surpasses corresponding unimodal CNN-only and
ViT-only approaches in recognition accuracy. Additionally,
we have successfully integrated our proposed method with a
robot framework and demonstrated its potential as a fine-
grained perception tool in both simulated and real-world
robotic scenarios.
In summary, our key contributions are twofold:
We propose a hybrid multi-modal approach based on
ViT-CNN networks to enhance fine-grained 3D object
recognition.
arXiv:2210.04613v2 [cs.CV] 6 Mar 2023
Fig. 2. A dual-arm service robot is used to successfully sort fine-grained shoe objects: (left) To accomplish this task successfully, the robot should
recognize all fine-grained shoe objects first and then place similar shoes in the same basket (shoes); (center) the proposed hybrid multi-modal approach
receives RGB-D images of each object and passes them through a CNN network and a ViT simultaneously. The CNN network is responsible for capturing
local information of the object, while the ViT is used to encode global features of the objects. The obtained representations are then fused and used for
object recognition purposes; (right) Afterward, the robot detects a proper grasp configuration for the target object [17], and then grasps and manipulates
the object into the basket.
To the best of our knowledge, we are the first group
to build the publicly available 3D object datasets
for fine-grained object classification. The datasets are
publicly available online at: https://github.com/github-
songsong/Fine-grained-Pointcloud-Object-Dataset
II. RELATED WORK
Various studies have been carried out on fine-grained
object recognition, which can be classified into three cat-
egories based on their approach, namely localization meth-
ods, feature encoding methods, and transformer methods, as
discussed in [4].
Localization-based FGVC methods: These methods fo-
cus on identifying discriminative partition areas between
instances by training a detection model, and then classifying
using the trained model. For examples, Branson et al. [18]
and Wei et al. [19] proposed supervised learning of localiza-
tion via part annotations. However, due to the high cost and
limited availability of part annotations, weakly supervised
learning using image labels alone has gained more attention.
Yang et al. [20] introduced a re-ranking method to enhance
region representations for global categorization. Unlike our
approach, these methods require specially designed models
to identify potential areas, and the selected sections must still
undergo classification via a backbone model.
Feature-encoding methods: These methods aim to enhance
the object representation to achieve better classification
performance. Our approach falls into this category. Yu et
al. [21] enhanced the object representation by utilizing the
hierarchical bilinear pooling function, which combines the
multiple cross-layer bilinear features. Zheng et al. [22]
proposed a deep bilinear transformation block, which can be
deeply stacked in convolutional neural networks to learn fine-
grained image representations. In particular, they uniformly
categorized the input channels into several semantic groups
and then generated a compact representation for FGVC. As
these methods use a single encoder and only RGB data, their
performances are limited [23]. To address these limitations,
we consider multi-modal and ensemble of ViT and CNN
models to handle the FGVC task.
Transformer methods: In recent years, transformers have
shown remarkable progress in Natural Language Process-
ing [24], [25]. As a result, more studies have applied
transformers to computer vision tasks, including object
detection [26], [27], segmentation [28], [29], and object
tracking [30]. In particular, using ViT models for FGVC
has gained increasing popularity [31], [32]. For instance,
Dosovitskiy et al. [33] proposed the ViT model, which
demonstrated superior performance in image classification.
Subsequently, Swin [34], DeiT [35], and MAE [36] were
introduced for various computer vision tasks. He et al. [23]
extended the ViT-only model to FGVC and evaluated the
proposed approach on traditional RGB-only datasets (e.g.,
various bird species, and car models).
Many researchers have recently utilized CNN-only or ViT-
only models to tackle FGVC with the RGB-only datasets.
Ullrich et al. [32] leveraged a multi-CNN network to extract
RGB and Depth images for 3D object recognition. The RGB
and Depth image representations from CNN-only models
separately are also used for the single-view FGVC [15].
The performance spectrum of CNNs and ViTs, however, is
confined under their fixed architectures. With the training
dataset increasing, their accuracy can be improved but only
approach the maximum of their fixed capacity. To improve
the performance of single-view FGVC, we proposed the hy-
brid multi-modal approach based on ViT and CNN networks.
Fine-grained object datasets: In recent studies on FGVC,
Nilsback [5] contributed a fine-grained flower dataset with
17 different species for FGVC, followed by the fine-
grained Birds dataset containing 11788 images from 200
bird species [3]. Since then, FGVC has gradually gained
more attention. For example, the Standford Dogs [7] and
Cars [37] datasets for FGVC were published, respectively.
Fine-grained VegFru [38] consisting of vegetables and fruits,
and Kuzushiji-MNIST [39], have been introduced recently.
Considering the limitation of the RGB-only data, RGB-
D images rapidly emerged in computer vision tasks due to
providing additional rich information. For example, Andreas
et al. [40] released an object segmentation dataset, which
comprises of 111 RGB-D images with stacked and occluding
摘要:

EnhancingFine-Grained3DObjectRecognitionusingHybridMulti-ModalVisionTransformer-CNNModelsSongsongXiong1,GeorgiosTziafas1andHamidrezaKasaei1Abstract—Robotsoperatinginhuman-centeredenviron-ments,suchasretailstores,restaurants,andhouseholds,areoftenrequiredtodistinguishbetweensimilarobjectsindifferentc...

展开>> 收起<<
Enhancing Fine-Grained 3D Object Recognition using Hybrid Multi-Modal Vision Transformer-CNN Models Songsong Xiong1 Georgios Tziafas1and Hamidreza Kasaei1.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:4.47MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注