Embodied Referring Expression for Manipulation
Question Answering in Interactive Environment
Qie Sima1, Sinan Tan1, Huaping Liu1,2,†
1Department of Computer Science and Technology, Tsinghua University
2Beijing National Research Center for Information Science and Technology, China
Abstract—Embodied agents are expected to perform more
complicated tasks in an interactive environment, with the
progress of Embodied AI in recent years. Existing embodied
tasks including Embodied Referring Expression (ERE) and
other QA-form tasks mainly focuses on interaction in term of
linguistic instruction. Therefore, enabling the agent to manip-
ulate objects in the environment for exploration actively has
become a challenging problem for the community. To solve this
problem, We introduce a new embodied task: Remote Embodied
Manipulation Question Answering (REMQA) to combine ERE
with manipulation tasks. In the REMQA task, the agent needs
to navigate to a remote position and perform manipulation with
the target object to answer the question. We build a benchmark
dataset for the REMQA task in the AI2-THOR simulator. To this
end, a framework with 3D semantic reconstruction and modular
network paradigms is proposed. The evaluation of the proposed
framework on the REMQA dataset is presented to validate its
effectiveness.
Index Terms—Embodied AI, Referring Expression,Visual Se-
mantics, Question Answering
I. INTRODUCTION
Recently, the AI community has witnessed the prosperity of
Embodied AI where agents are required to perform tasks in
various forms with egocentric vision. The success of embodied
AI brings up the interest of researchers in the robot community
to transfer methods in off-shelf Embodied AI tasks to robot
platforms.
Currently, most of works in Embodied AI have revolved
around the task of navigation – including position-goal, object-
goal, and area-goal [1]. However, the ability to actively ma-
nipulate objects and physically interact with the environment
becomes crucial in the embodied robot task, where agents need
to perform complex tasks in the real world. As the studies on
embodied tasks have surged in recent years, a wide variety of
embodied tasks has been proposed. However, very few works
have looked into a general framework for embodied tasks that
involve most modular models of robot task in the real world:
Visual reception, Language comprehension, Active navigation
and Manipulation. In an embodied robot task, how to localize
the target object precisely and effectively has always been a
challenge. Since many objects in the real scene are similar in
shape and appearance (e.g., books on shelf, cabinets in the
kitchen).
This work was supported by the Seed Fund of Tsinghua University
(Department of Computer Science and Technology)-Siemens Ltd., China Joint
Research Center for Industrial Intelligence and Internet of Things.
†Corresponding author:Huaping Liu (hpliu@mail.tsinghua.edu.cn)
Referring Expression (RE) is a widely studied cross-modal
task in both computer vision and natural language processing
fields as a vision and language task. In a RE task, the agent
needs to localize a specific target object in the image in
response to a given natural language referring expression. Most
of current studies in referring expression focus on passive
image datasets (e.g. RefCOCO, RefCOCO+ [2], RefCOCOg
[3]) where samples will not change with agent’s decision.
Recently, referring expression tasks in embodied scenarios has
emerged. In an Embodied Referring Expression (ERE) task,
the agent is required to navigate to the position mentioned
in the given expression in a 3D environment and complete
the REC task on the final scene. However, the process of
navigating to the target object scene in most of above tasks
merely consists of spatial movements without interaction with
surrounding environments, such as opening closed objects or
moving occlusion.
Therefore, we introduce a novel embodied task Remote
Embodied Manipulation Question Answering (REMQA)
where the agent is required to navigate to a remote position and
manipulate the target object, which can be precisely localized
by referring expression comprehension. Then, the agent infers
the answer to the question from the post-manipulation layout
of objects. As we illustrate in Fig. 1, the input question consists
of a referring phrase explicitly referring to the target object
(drawer). After navigating to the goal position (toaster), the
agent needs to localize it by distinguishing it from other draw-
ers with referring expression comprehension and conducting
manipulation action (open drawer) to get the answer.
In this work, we focus on the referring expression com-
prehension problems for Manipulation Question Answering
(MQA) task in a physically interactive environment. The main
contributions of this work are listed below:
•Problem. a novel embodied robot task consists of vision
perception, language comprehension and manipulation in
an interactive environment, Remote Embodied Manipu-
lation Question Answering.
•Dataset. a benchmark dataset of proposed task with a
set of indoor object arrangements of different rooms in
an interactive environment and questions within referring
expression about the objects in the environment.
•Method. a framework to handle the proposed task in
which Language Attention Network and 3D semantic
memory prior-ed navigation are implemented. Experi-
arXiv:2210.02709v1 [cs.RO] 6 Oct 2022