
Object Goal Navigation Based on Semantics and RGB Ego View
Snehasis Banerjee1,2, Brojeshwar Bhowmick1and Ruddra Dev Roychoudhury1
Abstract— This paper presents an architecture and method-
ology to empower a service robot to navigate an indoor
environment with semantic decision making, given RGB ego
view. This method leverages the knowledge of robot’s actuation
capability and that of scenes, objects and their relations –
represented in a semantic form. The robot navigates based
on GeoSem map - a relational combination of geometric and
semantic map. The goal given to the robot is to find an object
in a unknown environment with no navigational map and only
egocentric RGB camera perception. The approach is tested both
on a simulation environment and real life indoor settings. The
presented approach was found to outperform human users in
gamified evaluations with respect to average completion time.
I. INTRODUCTION
Recent advances in AI and Robotics has led to the emer-
gence of service robots capable of complex navigation tasks,
which has two well established paradigms – (a) geometric
reconstruction and path-planning based approaches (b) end-
to-end learning-based methods. To navigate an environment,
the robot needs to sense the scene in some way. In this tasks,
camera sensor has been the primary choice for perception in
navigation tasks along with relevant distance measure sensors
such as echo, infra-red, lidar. However, among the popular
sensors, RGB camera has a lower cost and higher availability.
In fact to make service robots more affordable and main-
stream, a system using only RGB perception and relying on
software intelligence to enable navigation capability is highly
desired. This paper is an effort in that line to use only RGB
perception in a wheeled robot to enable smart navigation
using fast semantic inferences.
Humans are generally very good at the task of navigation.
If a person is asked to find an object in a unknown scene,
his (or her) decision making will be based on visual cues in
current scene. Inspired by aforementioned human intuition,
in this work, the robot takes navigation decision based on an
amalgation of current zone’s probability of having an object,
visible objects that are closely related to the target object,
visible occlusions that may hide the object along with other
factors like free space, obstacles and risky or restricted areas.
A. Problem Description
The task is to navigate in an unmapped indoor environment
from a starting point to a specified object location and the
task is marked as complete if the object is visible to a
practical extent. The task needs to be completed based on
RGB perception of egocentric view of onboard robot camera.
The main contributions of the paper are as follows:
(i) object goal navigation algorithm using ‘GeoSem’ map and
1TCS Research, Tata Consultancy Services, Kolkata Primary
Author - 2snehasis.banerjee@tcs.com
Fig. 1: System Architecture for Semantic Navigation
semantic knowledge, with RGB ego view image as input.
(ii) a system architecture to enable semantic navigation.
(iii) successful functional evaluation of proposed methodol-
ogy in both simulated and real world indoor environments.
II. RELATED WORK
Although there has been a considerable amount of work
on the general object goal navigation problem, there has been
limited work on searching an object type just based on ego-
centric RGB camera mounted on a robot in indoor settings
using semantics. A number of works [1] uses deep neural
network for object-centric navigation policies. The closest
work on Semantic Visual Navigation [2] uses scene priors
to navigate to known as well as unknown objects in limited
settings using Graph Convolutional Networks (GCNs). The
aforementioned work has the following limitations: (a) no
concrete decision model when two or more objects are in
same frame (b) actual motion planning is not formulated
(c) No testing on real life scenario (d) Deep Reinforcement
Learning framework requires significant amount of training
to learn action policies. Our approach addresses the afore-
mentioned limitations in the sense that it tackles multiple
objects in single frame by design; path planning methods are
integrated in the decision making process; it has been tested
in limited but real world scenario; does not require extensive
training; and is based on knowledge-based decision making.
III. SYSTEM ARCHITECTURE
Fig. 1 presents the system architecture. For a wheeled
robot, the Action Space is limited to the movements in
forward direction or backward; and rotations towards left or
right with respect to ego view; and stop. The system takes in
perception from RGB camera mounted in first person view
of robot and gives out an action control command for the
robot to move. The next RGB view is the only feedback
post actuation that the robot can get. However, as the action
arXiv:2210.11543v1 [cs.RO] 20 Oct 2022