Embodied Referring Expression for Manipulation Question Answering in Interactive Environment Qie Sima1 Sinan Tan1 Huaping Liu12y

2025-04-29 0 0 2.46MB 7 页 10玖币

侵权投诉

Embodied Referring Expression for Manipulation

Question Answering in Interactive Environment

Qie Sima1, Sinan Tan1, Huaping Liu1,2,†

1Department of Computer Science and Technology, Tsinghua University

2Beijing National Research Center for Information Science and Technology, China

Abstract—Embodied agents are expected to perform more

complicated tasks in an interactive environment, with the

progress of Embodied AI in recent years. Existing embodied

tasks including Embodied Referring Expression (ERE) and

other QA-form tasks mainly focuses on interaction in term of

linguistic instruction. Therefore, enabling the agent to manip-

ulate objects in the environment for exploration actively has

become a challenging problem for the community. To solve this

problem, We introduce a new embodied task: Remote Embodied

Manipulation Question Answering (REMQA) to combine ERE

with manipulation tasks. In the REMQA task, the agent needs

to navigate to a remote position and perform manipulation with

the target object to answer the question. We build a benchmark

dataset for the REMQA task in the AI2-THOR simulator. To this

end, a framework with 3D semantic reconstruction and modular

network paradigms is proposed. The evaluation of the proposed

framework on the REMQA dataset is presented to validate its

effectiveness.

Index Terms—Embodied AI, Referring Expression,Visual Se-

mantics, Question Answering

I. INTRODUCTION

Recently, the AI community has witnessed the prosperity of

Embodied AI where agents are required to perform tasks in

various forms with egocentric vision. The success of embodied

AI brings up the interest of researchers in the robot community

to transfer methods in off-shelf Embodied AI tasks to robot

platforms.

Currently, most of works in Embodied AI have revolved

around the task of navigation – including position-goal, object-

goal, and area-goal [1]. However, the ability to actively ma-

nipulate objects and physically interact with the environment

becomes crucial in the embodied robot task, where agents need

to perform complex tasks in the real world. As the studies on

embodied tasks have surged in recent years, a wide variety of

embodied tasks has been proposed. However, very few works

have looked into a general framework for embodied tasks that

involve most modular models of robot task in the real world:

Visual reception, Language comprehension, Active navigation

and Manipulation. In an embodied robot task, how to localize

the target object precisely and effectively has always been a

challenge. Since many objects in the real scene are similar in

shape and appearance (e.g., books on shelf, cabinets in the

kitchen).

This work was supported by the Seed Fund of Tsinghua University

(Department of Computer Science and Technology)-Siemens Ltd., China Joint

Research Center for Industrial Intelligence and Internet of Things.

†Corresponding author:Huaping Liu (hpliu@mail.tsinghua.edu.cn)

Referring Expression (RE) is a widely studied cross-modal

task in both computer vision and natural language processing

ﬁelds as a vision and language task. In a RE task, the agent

needs to localize a speciﬁc target object in the image in

response to a given natural language referring expression. Most

of current studies in referring expression focus on passive

image datasets (e.g. RefCOCO, RefCOCO+ [2], RefCOCOg

[3]) where samples will not change with agent’s decision.

Recently, referring expression tasks in embodied scenarios has

emerged. In an Embodied Referring Expression (ERE) task,

the agent is required to navigate to the position mentioned

in the given expression in a 3D environment and complete

the REC task on the ﬁnal scene. However, the process of

navigating to the target object scene in most of above tasks

merely consists of spatial movements without interaction with

surrounding environments, such as opening closed objects or

moving occlusion.

Therefore, we introduce a novel embodied task Remote

Embodied Manipulation Question Answering (REMQA)

where the agent is required to navigate to a remote position and

manipulate the target object, which can be precisely localized

by referring expression comprehension. Then, the agent infers

the answer to the question from the post-manipulation layout

of objects. As we illustrate in Fig. 1, the input question consists

of a referring phrase explicitly referring to the target object

(drawer). After navigating to the goal position (toaster), the

agent needs to localize it by distinguishing it from other draw-

ers with referring expression comprehension and conducting

manipulation action (open drawer) to get the answer.

In this work, we focus on the referring expression com-

prehension problems for Manipulation Question Answering

(MQA) task in a physically interactive environment. The main

contributions of this work are listed below:

•Problem. a novel embodied robot task consists of vision

perception, language comprehension and manipulation in

an interactive environment, Remote Embodied Manipu-

lation Question Answering.

•Dataset. a benchmark dataset of proposed task with a

set of indoor object arrangements of different rooms in

an interactive environment and questions within referring

expression about the objects in the environment.

•Method. a framework to handle the proposed task in

which Language Attention Network and 3D semantic

memory prior-ed navigation are implemented. Experi-

arXiv:2210.02709v1 [cs.RO] 6 Oct 2022

Fig. 1: A demonstration of the Remote Embodied Manipulation Question Answering task. The agent needs to navigate to the

goal position, localize the target object and perform manipulation to answer the question.

mental validation of the proposed model has been con-

ducted in an interactive environment with the physical

engine.

In the rest of the paper, Section II presents a review of

related works. Summary of data for pretraining and proposed

benchmark dataset is introduced in Section III. Section IV

details our proposed model for the Remote Embodied Manipu-

lation Question Answering task in an interactive environment.

Section V presents the experimental results and Section VI

concludes this work.

II. RELATED WORK

A. Referring Expression

Referring Expression on Static Dataset: Most of works

in referring expression comprehension focus on comprehen-

sion tasks in datasets built on classical static visual datasets

(COCO, Flick et al.). Speciﬁcally, RE tasks can be categorized

to two kinds with aspect to labels used for localization: 1)

Referring Expression Comprehension (REC): a bounding box

2) Referring Expression Segmentation (RES): a segmentation

mask. For REC task, Mao et al. [4]introduce the ﬁrst CNN-

LSTM method:MMI as a general solution to REC task. Yu et

al. propose a visual comparative method (Visdif) to distinguish

the target object from the surrounding objects rather than

extracting features by CNN. Furthermore, Yu et al. [5] raise

MAttNet: Modular Attention Network to decompose refer-

ring expressions into different modular channels for accurate

matching. Besides CNN-LSTM methods, some works [6],

[7]present models of the relationship between images and

expressions and some others [8] utilize the pre-trained vision

and language models for REC task. For RES task, Li et al.

[9] propose a multi-modal LSTM for vision and linguistic

fusion. To obtain more accurate results for long referring

expressions, Shi et al. [10] employ an attention mechanism

in raised keyword-aware network. Luo et al. [11] introduce

Multi-task Collaborative Network (MCN) as a joint learning

framework of RES.

Embodied Referring Expression: Due to the absence

of interaction in conventional referring expression tasks, re-

searchers have recently tried to transplant referring expression

tasks to embodied scenarios. Several ERE tasks and datasets

have been released in recent years. Most proposed ERE tasks

can be classiﬁed into two main categories with aspect to

platform: 1) ERE task in manipulator scenario: INGRESS [12]

2) ERE task in mobile navigation scenario: REVIERE [13],

Touchdown-SDR [14], REVE-CE [15], ALFRED [16] The

community havs developed several methods that enable agents

to tackle embodied tasks that require active interaction with the

environment. Wu et al. [13] propose a Navigator-Pointer model

as a baseline for REVIERE dataset. Gao et al. employ room

object-aware attention mechanism and transformer architecture

in REVIERE. Lin et al. [17] pre-train agent with cross-modal

alignment sub-tasks for ERE task.

B. Embodied Robot Task

As an intersection of robotics, computer vision and natural

language processing, the study of embodied robot tasks has

gained much attention from all the above ﬁelds. A wide variety

of embodied tasks has been formulated in recent years. The

off-shelf embodied robot tasks can be categorized into two

main types: Visual Navigation and Question Answering.

Visual Navigation Task: Visual Language Naviga-

tion(VLN) [18], Visual Semantic Navigation (VSN) [19] re-

quires the agent to actively navigate to the goal position fol-

lowing linguistic information: language instructions for VLN

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EmbodiedReferringExpressionforManipulationQuestionAnsweringinInteractiveEnvironmentQieSima1,SinanTan1,HuapingLiu1;2;y1DepartmentofComputerScienceandTechnology,TsinghuaUniversity2BeijingNationalResearchCenterforInformationScienceandTechnology,ChinaAbstractEmbodiedagentsareexpectedtoperformmorecompli...

展开>> 收起<<

Embodied Referring Expression for Manipulation Question Answering in Interactive Environment Qie Sima1 Sinan Tan1 Huaping Liu12y.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Embodied Referring Expression for Manipulation Question Answering in Interactive Environment Qie Sima1 Sinan Tan1 Huaping Liu12y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: