
Enhancing Fine-Grained 3D Object Recognition
using Hybrid Multi-Modal Vision Transformer-CNN Models
Songsong Xiong1, Georgios Tziafas1and Hamidreza Kasaei1
Abstract— Robots operating in human-centered environ-
ments, such as retail stores, restaurants, and households, are
often required to distinguish between similar objects in different
contexts with a high degree of accuracy. However, fine-grained
object recognition remains a challenge in robotics due to the
high intra-category and low inter-category dissimilarities. In
addition, the limited number of fine-grained 3D datasets poses
a significant problem in addressing this issue effectively. In this
paper, we propose a hybrid multi-modal Vision Transformer
(ViT) and Convolutional Neural Networks (CNN) approach to
improve the performance of fine-grained visual classification
(FGVC). To address the shortage of FGVC 3D datasets, we
generated two synthetic datasets. The first dataset consists
of 20 categories related to restaurants with a total of 100
instances, while the second dataset contains 120 shoe instances.
Our approach was evaluated on both datasets, and the results
indicate that it outperforms both CNN-only and ViT-only
baselines, achieving a recognition accuracy of 94.50% and
93.51% on the restaurant and shoe datasets, respectively. Ad-
ditionally, we have made our FGVC RGB-D datasets available
to the research community to enable further experimentation
and advancement. Furthermore, we successfully integrated our
proposed method with a robot framework and demonstrated
its potential as a fine-grained perception tool in both simulated
and real-world robotic scenarios.
I. INTRODUCTION
As society continues to grow, labor shortages have be-
come increasingly prevalent. Consequently, robots are gain-
ing popularity in human-centered environments [1], [2]. To
safely operate in such domains, the robot should be able
to recognize fine-grained objects accurately. For example,
a restaurant robot must be able to categorize drinks with
similar packaging but varying attributes. Similarly, a service
robot in the shoe store is required to sort shoes with
comparable appearances. Fine-grained visual categorization
(FGVC) has recently received considerable attention [3] as
it aims to identify sub-categories within the same basic-
level classes. However, it remains a challenging task due to
the high intra-category and low inter-category dissimilarity
issues [4]. Furthermore, the performance of FGVC is often
hindered due to limited available datasets [3].
Recently, the majority of studies have been focused on
RGB fine-grained recognition. These studies include the
CUB-200-2011 dataset [3], Oxford Flowers dataset [5], Air-
craft dataset [6], and Pets dataset [7]. Additionally, some
studies have employed both RGB and depth sensors to
perform a variety of robotic tasks, including object classifica-
tion [8]–[12] and action recognition [13]. These studies have
1Department of Artificial Intelligence, University of Groningen,
Groningen, The Netherlands
{s.xiong, g.t.tziafas,hamidreza.kasaei}@rug.nl
Fig. 1. An example of fine-grained object visual classification (FGVC)
scenario: (left) The robot leverages an RGB-D camera to perceive the
environment ; (right) The distribution of various object categories across the
RGB-D feature space is displayed using a t-SNE plot. This plot reveals that
distinguishing between fine-grained categories like knife, fork, and spoon
is more challenging than distinguishing between basic-level categories like
mugs and bottles.
demonstrated that using multi-modal object representations
enhances recognition accuracy.
To the best of our knowledge, there are only two FGVC
RGB-D datasets currently available. The first dataset is
limited to hand-grasp classification, while the second dataset
is centered around vegetables and fruits, but unfortunately,
it is not publicly available [14], [15]. Furthermore, due
to their small scale, these datasets impose restrictions on
the performance of deep learning methods for FGVC [16].
Consequently, there is a lack of large-scale RGB-D datasets
that can be utilized for FGVC purposes. To address the
shortage of FGVC 3D datasets, we generated two synthetic
datasets. The first dataset consists of 20 categories related to
restaurants with a total of 100 instances, while the second
dataset contains 120 shoe instances.
Furthermore, we propose a hybrid multi-modal approach
based on ensembles of CNN-ViT networks to enhance the
accuracy of fine-grained 3D object recognition. An overview
of our approach is shown in Fig. 2. We performed extensive
sets of experiments to assess the performance of the proposed
approach. Experimental results show that our multi-modal
approach surpasses corresponding unimodal CNN-only and
ViT-only approaches in recognition accuracy. Additionally,
we have successfully integrated our proposed method with a
robot framework and demonstrated its potential as a fine-
grained perception tool in both simulated and real-world
robotic scenarios.
In summary, our key contributions are twofold:
•We propose a hybrid multi-modal approach based on
ViT-CNN networks to enhance fine-grained 3D object
recognition.
arXiv:2210.04613v2 [cs.CV] 6 Mar 2023