Embodied Referring Expression for Manipulation Question Answering in Interactive Environment Qie Sima1 Sinan Tan1 Huaping Liu12y

2025-04-29 0 0 2.46MB 7 页 10玖币
侵权投诉
Embodied Referring Expression for Manipulation
Question Answering in Interactive Environment
Qie Sima1, Sinan Tan1, Huaping Liu1,2,
1Department of Computer Science and Technology, Tsinghua University
2Beijing National Research Center for Information Science and Technology, China
Abstract—Embodied agents are expected to perform more
complicated tasks in an interactive environment, with the
progress of Embodied AI in recent years. Existing embodied
tasks including Embodied Referring Expression (ERE) and
other QA-form tasks mainly focuses on interaction in term of
linguistic instruction. Therefore, enabling the agent to manip-
ulate objects in the environment for exploration actively has
become a challenging problem for the community. To solve this
problem, We introduce a new embodied task: Remote Embodied
Manipulation Question Answering (REMQA) to combine ERE
with manipulation tasks. In the REMQA task, the agent needs
to navigate to a remote position and perform manipulation with
the target object to answer the question. We build a benchmark
dataset for the REMQA task in the AI2-THOR simulator. To this
end, a framework with 3D semantic reconstruction and modular
network paradigms is proposed. The evaluation of the proposed
framework on the REMQA dataset is presented to validate its
effectiveness.
Index Terms—Embodied AI, Referring Expression,Visual Se-
mantics, Question Answering
I. INTRODUCTION
Recently, the AI community has witnessed the prosperity of
Embodied AI where agents are required to perform tasks in
various forms with egocentric vision. The success of embodied
AI brings up the interest of researchers in the robot community
to transfer methods in off-shelf Embodied AI tasks to robot
platforms.
Currently, most of works in Embodied AI have revolved
around the task of navigation – including position-goal, object-
goal, and area-goal [1]. However, the ability to actively ma-
nipulate objects and physically interact with the environment
becomes crucial in the embodied robot task, where agents need
to perform complex tasks in the real world. As the studies on
embodied tasks have surged in recent years, a wide variety of
embodied tasks has been proposed. However, very few works
have looked into a general framework for embodied tasks that
involve most modular models of robot task in the real world:
Visual reception, Language comprehension, Active navigation
and Manipulation. In an embodied robot task, how to localize
the target object precisely and effectively has always been a
challenge. Since many objects in the real scene are similar in
shape and appearance (e.g., books on shelf, cabinets in the
kitchen).
This work was supported by the Seed Fund of Tsinghua University
(Department of Computer Science and Technology)-Siemens Ltd., China Joint
Research Center for Industrial Intelligence and Internet of Things.
Corresponding author:Huaping Liu (hpliu@mail.tsinghua.edu.cn)
Referring Expression (RE) is a widely studied cross-modal
task in both computer vision and natural language processing
fields as a vision and language task. In a RE task, the agent
needs to localize a specific target object in the image in
response to a given natural language referring expression. Most
of current studies in referring expression focus on passive
image datasets (e.g. RefCOCO, RefCOCO+ [2], RefCOCOg
[3]) where samples will not change with agent’s decision.
Recently, referring expression tasks in embodied scenarios has
emerged. In an Embodied Referring Expression (ERE) task,
the agent is required to navigate to the position mentioned
in the given expression in a 3D environment and complete
the REC task on the final scene. However, the process of
navigating to the target object scene in most of above tasks
merely consists of spatial movements without interaction with
surrounding environments, such as opening closed objects or
moving occlusion.
Therefore, we introduce a novel embodied task Remote
Embodied Manipulation Question Answering (REMQA)
where the agent is required to navigate to a remote position and
manipulate the target object, which can be precisely localized
by referring expression comprehension. Then, the agent infers
the answer to the question from the post-manipulation layout
of objects. As we illustrate in Fig. 1, the input question consists
of a referring phrase explicitly referring to the target object
(drawer). After navigating to the goal position (toaster), the
agent needs to localize it by distinguishing it from other draw-
ers with referring expression comprehension and conducting
manipulation action (open drawer) to get the answer.
In this work, we focus on the referring expression com-
prehension problems for Manipulation Question Answering
(MQA) task in a physically interactive environment. The main
contributions of this work are listed below:
Problem. a novel embodied robot task consists of vision
perception, language comprehension and manipulation in
an interactive environment, Remote Embodied Manipu-
lation Question Answering.
Dataset. a benchmark dataset of proposed task with a
set of indoor object arrangements of different rooms in
an interactive environment and questions within referring
expression about the objects in the environment.
Method. a framework to handle the proposed task in
which Language Attention Network and 3D semantic
memory prior-ed navigation are implemented. Experi-
arXiv:2210.02709v1 [cs.RO] 6 Oct 2022
Fig. 1: A demonstration of the Remote Embodied Manipulation Question Answering task. The agent needs to navigate to the
goal position, localize the target object and perform manipulation to answer the question.
mental validation of the proposed model has been con-
ducted in an interactive environment with the physical
engine.
In the rest of the paper, Section II presents a review of
related works. Summary of data for pretraining and proposed
benchmark dataset is introduced in Section III. Section IV
details our proposed model for the Remote Embodied Manipu-
lation Question Answering task in an interactive environment.
Section V presents the experimental results and Section VI
concludes this work.
II. RELATED WORK
A. Referring Expression
Referring Expression on Static Dataset: Most of works
in referring expression comprehension focus on comprehen-
sion tasks in datasets built on classical static visual datasets
(COCO, Flick et al.). Specifically, RE tasks can be categorized
to two kinds with aspect to labels used for localization: 1)
Referring Expression Comprehension (REC): a bounding box
2) Referring Expression Segmentation (RES): a segmentation
mask. For REC task, Mao et al. [4]introduce the first CNN-
LSTM method:MMI as a general solution to REC task. Yu et
al. propose a visual comparative method (Visdif) to distinguish
the target object from the surrounding objects rather than
extracting features by CNN. Furthermore, Yu et al. [5] raise
MAttNet: Modular Attention Network to decompose refer-
ring expressions into different modular channels for accurate
matching. Besides CNN-LSTM methods, some works [6],
[7]present models of the relationship between images and
expressions and some others [8] utilize the pre-trained vision
and language models for REC task. For RES task, Li et al.
[9] propose a multi-modal LSTM for vision and linguistic
fusion. To obtain more accurate results for long referring
expressions, Shi et al. [10] employ an attention mechanism
in raised keyword-aware network. Luo et al. [11] introduce
Multi-task Collaborative Network (MCN) as a joint learning
framework of RES.
Embodied Referring Expression: Due to the absence
of interaction in conventional referring expression tasks, re-
searchers have recently tried to transplant referring expression
tasks to embodied scenarios. Several ERE tasks and datasets
have been released in recent years. Most proposed ERE tasks
can be classified into two main categories with aspect to
platform: 1) ERE task in manipulator scenario: INGRESS [12]
2) ERE task in mobile navigation scenario: REVIERE [13],
Touchdown-SDR [14], REVE-CE [15], ALFRED [16] The
community havs developed several methods that enable agents
to tackle embodied tasks that require active interaction with the
environment. Wu et al. [13] propose a Navigator-Pointer model
as a baseline for REVIERE dataset. Gao et al. employ room
object-aware attention mechanism and transformer architecture
in REVIERE. Lin et al. [17] pre-train agent with cross-modal
alignment sub-tasks for ERE task.
B. Embodied Robot Task
As an intersection of robotics, computer vision and natural
language processing, the study of embodied robot tasks has
gained much attention from all the above fields. A wide variety
of embodied tasks has been formulated in recent years. The
off-shelf embodied robot tasks can be categorized into two
main types: Visual Navigation and Question Answering.
Visual Navigation Task: Visual Language Naviga-
tion(VLN) [18], Visual Semantic Navigation (VSN) [19] re-
quires the agent to actively navigate to the goal position fol-
lowing linguistic information: language instructions for VLN
摘要:

EmbodiedReferringExpressionforManipulationQuestionAnsweringinInteractiveEnvironmentQieSima1,SinanTan1,HuapingLiu1;2;y1DepartmentofComputerScienceandTechnology,TsinghuaUniversity2BeijingNationalResearchCenterforInformationScienceandTechnology,ChinaAbstract—Embodiedagentsareexpectedtoperformmorecompli...

展开>> 收起<<
Embodied Referring Expression for Manipulation Question Answering in Interactive Environment Qie Sima1 Sinan Tan1 Huaping Liu12y.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:2.46MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注