DANLI Deliberative Agent for Following Natural Language Instructions Yichi Zhang Jianing Yang Jiayi Pan Shane Storks Nikhil Devraj Ziqiao Ma Keunwoo Peter Yu Yuwei Bao Joyce Chai

2025-05-06 0 0 2.2MB 19 页 10玖币

侵权投诉

DANLI: Deliberative Agent for Following Natural Language Instructions

Yichi Zhang Jianing Yang Jiayi Pan Shane Storks Nikhil Devraj

Ziqiao Ma Keunwoo Peter Yu Yuwei Bao Joyce Chai

Computer Science and Engineering Division, University of Michigan

zhangyic@umich.edu

Abstract

Recent years have seen an increasing amount

of work on embodied AI agents that can per-

form tasks by following human language in-

structions. However, most of these agents are

reactive, meaning that they simply learn and

imitate behaviors encountered in the training

data. These reactive agents are insufﬁcient

for long-horizon complex tasks. To address

this limitation, we propose a neuro-symbolic

deliberative agent that, while following lan-

guage instructions, proactively applies reason-

ing and planning based on its neural and sym-

bolic representations acquired from past expe-

rience (e.g., natural language and egocentric

vision). We show that our deliberative agent

achieves greater than 70% improvement over

reactive baselines on the challenging TEACh

benchmark. Moreover, the underlying reason-

ing and planning processes, together with our

modular framework, offer impressive trans-

parency and explainability to the behaviors

of the agent. This enables an in-depth un-

derstanding of the agent’s capabilities, which

shed light on challenges and opportunities for

future embodied agents for instruction follow-

ing. The code is available at https://github.

com/sled-group/DANLI.

1 Introduction

Natural language instruction following with embod-

ied AI agents (Chai et al.,2018;Anderson et al.,

2018;Thomason et al.,2019;Qi et al.,2020;Shrid-

har et al.,2020;Padmakumar et al.,2021) is a

notoriously difﬁcult problem, where an agent must

interpret human language commands to perform

actions in the physical world and achieve a goal. Es-

pecially challenging is the hierarchical nature of ev-

eryday tasks,

which often require reasoning about

For example, making breakfast may require preparing one

or more dishes (e.g., toast and coffee), each of which requires

several sub-tasks of navigating through the environment and

manipulating objects (e.g., ﬁnding a knife, slicing bread, cook-

ing it in the toaster), and even more ﬁne-grained primitive

actions entailed by them (e.g., walk forward, pick up knife).

subgoals and reconciling them with the world state

and overall goal. However, despite recent progress,

past approaches are typically reactive (Wooldridge,

1995) in their execution of actions: conditioned

on the rich, multimodal inputs from the environ-

ment, they perform actions directly without using

an explicit representation of the world to facili-

tate grounded reasoning and planning (Pashevich

et al.,2021;Zhang and Chai,2021;Sharma et al.,

2022). Such an approach is inefﬁcient, as natural

language instructions often omit trivial steps that

a human may be assumed to already know (Zhou

et al.,2021). Besides, the lack of any explicit sym-

bolic component makes such approaches hard to

interpret, especially when the agent makes errors.

Inspired by previous work toward deliberative

agents in robotic task planning, which apply long-

term action planning over known world and goal

states (She et al.,2014;Agia et al.,2022;Srivas-

tava et al.,2021;Wang et al.,2022), we introduce

DANLI

, a neuro-symbolic Deliberative Agent for

following Natural Language Instructions.

DANLI

combines learned symbolic representations of task

subgoals and the surrounding environment with

a robust symbolic planning algorithm to execute

tasks. First, we build a uniquely rich semantic spa-

tial representation (Section 3.1), acquired online

from the surrounding environment and language

descriptions to capture symbolic information about

object instances and their physical states. To cap-

ture the highest level of hierarchy in tasks, we pro-

pose a neural task monitor (Section 3.2) that learns

to extract symbolic information about task progress

and upcoming subgoals from the dialog and action

history. Using these elements as a planning domain,

we lastly apply an online planning algorithm (Sec-

tion 3.3) to plan low-level actions for subgoals in

the environment, taking advantage of

DANLI

’s trans-

parent reasoning and planning pipeline to detect

and recover from errors.

Our empirical results demonstrate that our de-

arXiv:2210.12485v1 [cs.AI] 22 Oct 2022

Hi, what is

my task?

Turn Left

Action

Observation

Environment

Agent

Start Prediction

Today you’ll be

making sandwich

First make two

slices of toast

Timeline

Task Goal

…

Forward

Pickup Bread

…

Slice a lettuce

Make 2 toasts

Make a

sandwich

Check the sink Knife?

Input

Internal

Reasoning

Process

Commander Follower

Figure 1: An example task in TEACh.

liberative

DANLI

agent outperforms reactive ap-

proaches with better success rates and overwhelm-

ingly more efﬁcient policies on the challenging

Task-driven Embodied Agents that Chat (TEACh)

benchmark (Padmakumar et al.,2021). Importantly,

due to its interpretable symbolic representation and

explicit reasoning mechanisms, our approach offers

detailed insights into the agent’s planning, manipu-

lation, and navigation capabilities. This gives the

agent a unique self awareness about the kind of ex-

ceptions that have occurred, and therefore makes it

possible to adapt strategies to cope with exceptions

and continually strengthen the system.

2 Problem Deﬁnition

The challenge of hierarchical tasks is prominent in

the recent Task-driven Embodied Agents that Chat

(TEACh) benchmark for this problem (Padmaku-

mar et al.,2021). Here, language instructions are

instantiated as a task-oriented dialog between the

agent and a commander (who has comprehensive

knowledge about the task and environment, but can-

not perform any actions), with varying granularity

and completeness of guidance given. We focus on

the Execution from Dialog History (EDH) setting

in TEACh, where the agent is given a dialog history

as input, and is expected to execute a sequence of

actions and achieve the goal set out by the com-

mander. This setting allows us to abstract away

the problem of dialog generation, and focus on the

already difﬁcult problem of instruction following

from task-oriented dialog.

As shown in Figure 1, a task, e.g., Make a Sand-

wich, may have several subtasks that the agent must

achieve in order to satisfy the overall task goal. The

success of a task/subtask is achieved by meeting a

set of goal conditions, such as slicing a bread and

toasting two pieces. At each timestep, the agent

receives an egocentric visual observation of the

world and the full dialog history up to that time,

and may execute a single low-level action. Actions

can either involve navigation, e.g., to step forward,

or manipulation, e.g., to pick up an object. Ma-

nipulation actions additionally require the agent to

identify the action’s target object by specifying a

pixel in its ﬁeld of view to highlight the object. The

execution continues until the agent predicts a

Stop

action, otherwise the session will terminate after

1000 timesteps or 30 failed actions. At this time,

we can evaluate the agent’s completion of the task.

It is worth noting that while we focus on TEACh,

our approach is largely transferable between bench-

mark datasets and simulation environments, albeit

requiring retraining of some components.

3 A Neuro-Symbolic Deliberative Agent

An overview of our neuro-symbolic deliberative

agent is shown in Figure 2. We ﬁrst introduce the

symbolic notions used in our system. We use the

object-oriented representation (Diuk et al.,2008)

to represent the symbolic world state. Each object

instance is assigned an instance ID consisting of

its canonicalized class name and ordinal. We de-

ﬁne a state in the form of

Predicate(Arguments)

as an object’s physical state or a relation to an-

other object. We deﬁne subgoals as particular

states that the agent should achieve while com-

pleting the task, represented by a symbolic form

(Patient, Predicate, Destination)2

, where

the

Patient

and

Destination

are object classes,

and the

Predicate

is a state that can be applied to

the

Patient

. We deﬁne an action in the agent’s

plan as

ActionType(Arguments)

where each ar-

gument is an object instance.

To complete tasks, our agent reasons over a

learned spatial-symbolic map representation (Sec-

tion 3.1) to generate a hierarchical plan. At the

high level, it applies a neural language model to the

dialog and action history to predict the complete

sequence of completed and future subgoals in sym-

bolic form (Section 3.2). For each predicted sub-

goal, it then plans a sequence of low-level actions

using both the symbolic subgoal and world repre-

2isPlacedTo is the only predicate with a Destination.

:(Knife,isPickedUp)

:(Bread,isSliced)

:(BreadSlice,isPlacedTo,Toaster)

:(BreadSlice,isCooked)

✔

✔



CounterTop_0

Knife_1

CounterTop_0

-Hold(Bread_0)

-Hold(Plant_0)

-(New!)Hold

(BreadSliced_0)

-(New!)Hold

(BreadSliced_1)

Bread_0

-isOn(CounterTop_0)

-(New!)isSliced

Toaster_0

-isOn(CounterTop_2)

-isToggledOff

(New!)BreadSlice_0

Internal Representation Update Subgoal Progress Monitoring



:Place(Knife_1,CounterTop_0)



:PickUp(BreadSlice_0)



:GoTo(Toaster_0)



:Place(BreadSlice_0,Toaster_0)

Internal Reasoning Process

Current

Observation

Agent

Action Planning

✔Completed Working Planned

(New!)BreadSlice_1

Knife_1

-isPickedUp

Place(Knife_1,CounterTop_0)

Visually Grounded Execution

Previous Action:

Slice(Bread_0)

camera pose

Subgoal :=(Patient,Predicate,Destination)

State := Predicate(Arguments) Action:= ActionType(Arguments)

Figure 2: Illustration of our agent’s reasoning process behind a single decision step. After receiving the current

observation, the agent ﬁrst updates its internal representation (orange), then checks the current subgoal progress

(blue), and plans for the next steps (green). Finally the ﬁrst action in the plan is popped out and grounds to the

agent’s ego-centric view for execution. In the pop-up boxes we show example object instances with their instance

ids, states and positions in the 3D map. New instances and state changes are labeled in green. The status for each

subgoal and action is labeled in front of it, where the arrows denote status transitions.

sentations online, with robustness to various types

of planning and execution failures (Section 3.3).

Next, we describe how each component works and

highlight our key innovations.

3.1 World Representation Construction

The reasoning process of an embodied AI agent

relies heavily on a strong internal representation of

the world. As shown in Figure 2, we implement

the internal representation as a semantic map in-

corporating rich symbolic information about object

instances and their physical states. We introduce

our methods for the construction of this representa-

tion in this section.

3D Semantic Voxel Map

As the agent moves

through the environment while completing a task,

it constructs a 3D semantic voxel map to model its

spatial layout. Following Blukis et al. (2022), we

use a depth estimator to project the pixels of ego-

centric observation images and detected objects to

3D point cloud and bin the points into

0.25

vox-

els. The resulting map can help symbolic planner

(Section 3.3) break down high-level navigation ac-

tions, such as

GOTO Knife_0

, to atomic navigation

actions such as Forward, TurnLeft, LookUp.3

Object Instance Lookup Table

Everyday tasks

can involve multiple instances of the same object,

and thus modeling only object class information

3See Appendix A.4 for more details on path planning.

may be insufﬁcient.

As shown in the internal rep-

resentation update part of Figure 2, we store object

instance information for a single task episode in a

symbolic lookup table, where each instance in the

environment is assigned a unique ID once observed.

These symbols in the lookup table become the plan-

ning domain of the symbolic planner (Section 3.3).

To collect this symbolic lookup table, we use a

panoptic segmentation model

to detect all object

instances in the current 2D egocentric visual frame.

These 2D instance detections are then projected

into the 3D map, and we use each instance’s 3D

centroid and size information to match and update

existing object instances’ information in the lookup

table

. As the agent moves through the scene and

receives more visual observations, the symbolic

lookup table becomes more complete and accurate.

Physical State Prediction

Additionally, tasks

can hinge upon the physical states of particular ob-

ject instances. For example, when making coffee,

the agent should disambiguate dirty and clean cof-

For example, when making a sandwich, the agent will

likely need to distinguish the top and bottom pieces of bread

to make the sandwich complete.

As opposed to a semantic segmentation model as used in

prior work (Chaplot et al.,2020;Min et al.,2022;Blukis et al.,

2022), which can only detect object class information.

To perform this update and decide whether a newly de-

tected instance should be merged with an existing instance or

added as a new one, we use a matching algorithm described

in Appendix A.5.

fee mugs and make sure to use the clean mug. To

recognize the physical state of each object instance,

we propose a physical state classiﬁcation model

where inputs include the image region of a detected

object instance and its class identiﬁer, and the out-

put is physical state labels for the instance. As

classifying the physical state from visual observa-

tion alone can introduce errors, we also incorporate

the effect of the agent’s actions into physical state

classiﬁcations. For example, the

isToggledOn

at-

tribute is automatically modiﬁed after the agent

applies the

ToggleOn

action, overriding the classi-

ﬁer’s prediction.

3.2 Subgoal-Based Task Monitoring

Due to the hierarchical nature of tasks, natural lan-

guage instructions may express a mix of high-level

and low-level instructions. In order to monitor and

control the completion of a long-horizon task given

such complex inputs, we ﬁrst model the sequence

of high-level subgoals, i.e., key intermediate steps

necessary to complete it.

As shown in Figure 3, we apply a sequence-to-

sequence approach powered by language models to

learn subgoals from the dialog and action history.

At the beginning of each session, our agent uses

these inputs to predict the sequence of all subgoals.

Our key insight is that to better predict subgoals-

to-do, it is also important to infer what has been

done. As such, we propose to additionally predict

the completed subgoals, and include the agent’s

action history as an input to support the prediction.

To take advantage of the power of pre-trained

language models for this type of problem, all in-

puts and outputs are translated into language form.

First, we convert the agent’s action history into syn-

thetic language (e.g.

PickUp(Cup) →

“get cup”),

and feed it together with the history of dialog ut-

terances into the encoder. We then decode lan-

guage expressions for subgoals one by one in an

autoregressive manner. As the raw outputs from

the decoder can often be noisy due to language

expression ambiguity or incompleteness, we add

anatural-in, structure-out decoder which learns

to classify each of the subgoal components into

its symbolic form, and transform them to a lan-

guage phrase as decoder input to predict the next

subgoals.

3.3 Online Symbolic Planning

Symbolic planners excel at generating reliable and

interpretable plans. Given predicted subgoals and a

constructed spatial-symbolic representation, PDDL

Subgoal Predictio

Follower: Hi. What can I do for

you? Commander: Find a cup.

Follower go for cup, pick up cup.

Commander : Put it on the table.

Encoder Decoder

Dialog Action D

MLP MLP MLP

Patient Predicate Dest.

Future SGs

Completed SGs

Symbolic

Subgoals

Sym2Lang

Neural Networks

Subgoal Component

(b) Different configurations of encoder inputs / predictions

(a) The neural architecture of our subgoal predictor

Completed SG:

Future SG:

Dialog History Only

Dialog & Action History

Follower: Hi. What can I do for

you? Commander: Find a cup.

Commander : Put it on the table.

Future SG :

Completed & Future Subgoals

Future Subgoals Only

Completed subgoals : get cup ; Future subgoals : place cup to table ;

(Cup,isPickedUp)EOS (Cup,isPlacedTo,Table)EOS

Encoder Input Decoder Predictions

(Cup,isPlacedTo,Table)

(Cup,isPickedUp)

Figure 3: Overview of the subgoal learning process.

Figure (a) shows the model architecture. Figure (b)

shows the different input/output conﬁgurations we ex-

periment with. Figure (c) illustrates the decoding pro-

cess. We ﬁrst predict completed subgoals and then pre-

dict the future subgoals conditioned on them, where we

use different prompts to distinguish different types of

subgoals.

(Aeronautiques et al.,1998) planning algorithms

can be applied to generate a plan for each subgoal.

These short-horizon planning problems reduce the

chance of drifting from the plan during execution.

Nonetheless, failures are bound to happen during

execution. A notable advantage of our approach is

the transparency of its reasoning process, which not

only allows us to examine the world representation

and plan, but also gives the agent some awareness

about potential exceptions and enables the develop-

ment of mechanisms for replanning. In this section,

we introduce several new mechanisms to make on-

line symbolic planning feasible and robust in a

dynamic physical world.

Finding Unobserved Objects

The agent’s par-

tial observability of the environment may cause

a situation where in order to complete a subgoal,

the agent needs an object that has not been ob-

served yet. In this case, a traditional symbolic

planner cannot propose a plan, and thus will fail

the task. To circumvent this shortcoming, we ex-

tend the planner by letting the agent search for the

missing object(s). Speciﬁcally, during planning,

our agent assumes that all objects relevant to sub-

See Appendix A.7.1 for more details on PDDL planning.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DANLI:DeliberativeAgentforFollowingNaturalLanguageInstructionsYichiZhangJianingYangJiayiPanShaneStorksNikhilDevrajZiqiaoMaKeunwooPeterYuYuweiBaoJoyceChaiComputerScienceandEngineeringDivision,UniversityofMichiganzhangyic@umich.eduAbstractRecentyearshaveseenanincreasingamountofworkonembodiedAIagentsth...

展开>> 收起<<

DANLI Deliberative Agent for Following Natural Language Instructions Yichi Zhang Jianing Yang Jiayi Pan Shane Storks Nikhil Devraj Ziqiao Ma Keunwoo Peter Yu Yuwei Bao Joyce Chai.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DANLI Deliberative Agent for Following Natural Language Instructions Yichi Zhang Jianing Yang Jiayi Pan Shane Storks Nikhil Devraj Ziqiao Ma Keunwoo Peter Yu Yuwei Bao Joyce Chai

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: