DANLI Deliberative Agent for Following Natural Language Instructions Yichi Zhang Jianing Yang Jiayi Pan Shane Storks Nikhil Devraj Ziqiao Ma Keunwoo Peter Yu Yuwei Bao Joyce Chai

2025-05-06 0 0 2.2MB 19 页 10玖币
侵权投诉
DANLI: Deliberative Agent for Following Natural Language Instructions
Yichi Zhang Jianing Yang Jiayi Pan Shane Storks Nikhil Devraj
Ziqiao Ma Keunwoo Peter Yu Yuwei Bao Joyce Chai
Computer Science and Engineering Division, University of Michigan
zhangyic@umich.edu
Abstract
Recent years have seen an increasing amount
of work on embodied AI agents that can per-
form tasks by following human language in-
structions. However, most of these agents are
reactive, meaning that they simply learn and
imitate behaviors encountered in the training
data. These reactive agents are insufficient
for long-horizon complex tasks. To address
this limitation, we propose a neuro-symbolic
deliberative agent that, while following lan-
guage instructions, proactively applies reason-
ing and planning based on its neural and sym-
bolic representations acquired from past expe-
rience (e.g., natural language and egocentric
vision). We show that our deliberative agent
achieves greater than 70% improvement over
reactive baselines on the challenging TEACh
benchmark. Moreover, the underlying reason-
ing and planning processes, together with our
modular framework, offer impressive trans-
parency and explainability to the behaviors
of the agent. This enables an in-depth un-
derstanding of the agent’s capabilities, which
shed light on challenges and opportunities for
future embodied agents for instruction follow-
ing. The code is available at https://github.
com/sled-group/DANLI.
1 Introduction
Natural language instruction following with embod-
ied AI agents (Chai et al.,2018;Anderson et al.,
2018;Thomason et al.,2019;Qi et al.,2020;Shrid-
har et al.,2020;Padmakumar et al.,2021) is a
notoriously difficult problem, where an agent must
interpret human language commands to perform
actions in the physical world and achieve a goal. Es-
pecially challenging is the hierarchical nature of ev-
eryday tasks,
1
which often require reasoning about
1
For example, making breakfast may require preparing one
or more dishes (e.g., toast and coffee), each of which requires
several sub-tasks of navigating through the environment and
manipulating objects (e.g., finding a knife, slicing bread, cook-
ing it in the toaster), and even more fine-grained primitive
actions entailed by them (e.g., walk forward, pick up knife).
subgoals and reconciling them with the world state
and overall goal. However, despite recent progress,
past approaches are typically reactive (Wooldridge,
1995) in their execution of actions: conditioned
on the rich, multimodal inputs from the environ-
ment, they perform actions directly without using
an explicit representation of the world to facili-
tate grounded reasoning and planning (Pashevich
et al.,2021;Zhang and Chai,2021;Sharma et al.,
2022). Such an approach is inefficient, as natural
language instructions often omit trivial steps that
a human may be assumed to already know (Zhou
et al.,2021). Besides, the lack of any explicit sym-
bolic component makes such approaches hard to
interpret, especially when the agent makes errors.
Inspired by previous work toward deliberative
agents in robotic task planning, which apply long-
term action planning over known world and goal
states (She et al.,2014;Agia et al.,2022;Srivas-
tava et al.,2021;Wang et al.,2022), we introduce
DANLI
, a neuro-symbolic Deliberative Agent for
following Natural Language Instructions.
DANLI
combines learned symbolic representations of task
subgoals and the surrounding environment with
a robust symbolic planning algorithm to execute
tasks. First, we build a uniquely rich semantic spa-
tial representation (Section 3.1), acquired online
from the surrounding environment and language
descriptions to capture symbolic information about
object instances and their physical states. To cap-
ture the highest level of hierarchy in tasks, we pro-
pose a neural task monitor (Section 3.2) that learns
to extract symbolic information about task progress
and upcoming subgoals from the dialog and action
history. Using these elements as a planning domain,
we lastly apply an online planning algorithm (Sec-
tion 3.3) to plan low-level actions for subgoals in
the environment, taking advantage of
DANLI
s trans-
parent reasoning and planning pipeline to detect
and recover from errors.
Our empirical results demonstrate that our de-
arXiv:2210.12485v1 [cs.AI] 22 Oct 2022
21
Hi, what is
my task?
Turn Left
Action
Observation
Environment
Agent
Start Prediction
Today you’ll be
making sandwich
First make two
slices of toast
Timeline
Task Goal
Forward
Pickup Bread
Slice a lettuce
Make 2 toasts
Make a
sandwich
Check the sink Knife?
Input
Internal
Reasoning
Process
Commander Follower
Figure 1: An example task in TEACh.
liberative
DANLI
agent outperforms reactive ap-
proaches with better success rates and overwhelm-
ingly more efficient policies on the challenging
Task-driven Embodied Agents that Chat (TEACh)
benchmark (Padmakumar et al.,2021). Importantly,
due to its interpretable symbolic representation and
explicit reasoning mechanisms, our approach offers
detailed insights into the agent’s planning, manipu-
lation, and navigation capabilities. This gives the
agent a unique self awareness about the kind of ex-
ceptions that have occurred, and therefore makes it
possible to adapt strategies to cope with exceptions
and continually strengthen the system.
2 Problem Definition
The challenge of hierarchical tasks is prominent in
the recent Task-driven Embodied Agents that Chat
(TEACh) benchmark for this problem (Padmaku-
mar et al.,2021). Here, language instructions are
instantiated as a task-oriented dialog between the
agent and a commander (who has comprehensive
knowledge about the task and environment, but can-
not perform any actions), with varying granularity
and completeness of guidance given. We focus on
the Execution from Dialog History (EDH) setting
in TEACh, where the agent is given a dialog history
as input, and is expected to execute a sequence of
actions and achieve the goal set out by the com-
mander. This setting allows us to abstract away
the problem of dialog generation, and focus on the
already difficult problem of instruction following
from task-oriented dialog.
As shown in Figure 1, a task, e.g., Make a Sand-
wich, may have several subtasks that the agent must
achieve in order to satisfy the overall task goal. The
success of a task/subtask is achieved by meeting a
set of goal conditions, such as slicing a bread and
toasting two pieces. At each timestep, the agent
receives an egocentric visual observation of the
world and the full dialog history up to that time,
and may execute a single low-level action. Actions
can either involve navigation, e.g., to step forward,
or manipulation, e.g., to pick up an object. Ma-
nipulation actions additionally require the agent to
identify the action’s target object by specifying a
pixel in its field of view to highlight the object. The
execution continues until the agent predicts a
Stop
action, otherwise the session will terminate after
1000 timesteps or 30 failed actions. At this time,
we can evaluate the agent’s completion of the task.
It is worth noting that while we focus on TEACh,
our approach is largely transferable between bench-
mark datasets and simulation environments, albeit
requiring retraining of some components.
3 A Neuro-Symbolic Deliberative Agent
An overview of our neuro-symbolic deliberative
agent is shown in Figure 2. We first introduce the
symbolic notions used in our system. We use the
object-oriented representation (Diuk et al.,2008)
to represent the symbolic world state. Each object
instance is assigned an instance ID consisting of
its canonicalized class name and ordinal. We de-
fine a state in the form of
Predicate(Arguments)
as an object’s physical state or a relation to an-
other object. We define subgoals as particular
states that the agent should achieve while com-
pleting the task, represented by a symbolic form
(Patient, Predicate, Destination)2
, where
the
Patient
and
Destination
are object classes,
and the
Predicate
is a state that can be applied to
the
Patient
. We define an action in the agent’s
plan as
ActionType(Arguments)
where each ar-
gument is an object instance.
To complete tasks, our agent reasons over a
learned spatial-symbolic map representation (Sec-
tion 3.1) to generate a hierarchical plan. At the
high level, it applies a neural language model to the
dialog and action history to predict the complete
sequence of completed and future subgoals in sym-
bolic form (Section 3.2). For each predicted sub-
goal, it then plans a sequence of low-level actions
using both the symbolic subgoal and world repre-
2isPlacedTo is the only predicate with a Destination.
25
:(Knife,isPickedUp)
:(Bread,isSliced)
:(BreadSlice,isPlacedTo,Toaster)
:(BreadSlice,isCooked)
CounterTop_0
Knife_1
CounterTop_0
-Hold(Bread_0)
-Hold(Plant_0)
-(New!)Hold
(BreadSliced_0)
-(New!)Hold
(BreadSliced_1)
Bread_0
-isOn(CounterTop_0)
-(New!)isSliced
Toaster_0
-isOn(CounterTop_2)
-isToggledOff
(New!)BreadSlice_0
Internal Representation Update Subgoal Progress Monitoring
:Place(Knife_1,CounterTop_0)
:PickUp(BreadSlice_0)
:GoTo(Toaster_0)
:Place(BreadSlice_0,Toaster_0)
Internal Reasoning Process
Current
Observation
Agent
Action Planning
Completed Working Planned
(New!)BreadSlice_1
Knife_1
-isPickedUp
Place(Knife_1,CounterTop_0)
Visually Grounded Execution
Previous Action:
Slice(Bread_0)
camera pose
Subgoal :=(Patient,Predicate,Destination)
State := Predicate(Arguments) Action:= ActionType(Arguments)
Figure 2: Illustration of our agent’s reasoning process behind a single decision step. After receiving the current
observation, the agent first updates its internal representation (orange), then checks the current subgoal progress
(blue), and plans for the next steps (green). Finally the first action in the plan is popped out and grounds to the
agent’s ego-centric view for execution. In the pop-up boxes we show example object instances with their instance
ids, states and positions in the 3D map. New instances and state changes are labeled in green. The status for each
subgoal and action is labeled in front of it, where the arrows denote status transitions.
sentations online, with robustness to various types
of planning and execution failures (Section 3.3).
Next, we describe how each component works and
highlight our key innovations.
3.1 World Representation Construction
The reasoning process of an embodied AI agent
relies heavily on a strong internal representation of
the world. As shown in Figure 2, we implement
the internal representation as a semantic map in-
corporating rich symbolic information about object
instances and their physical states. We introduce
our methods for the construction of this representa-
tion in this section.
3D Semantic Voxel Map
As the agent moves
through the environment while completing a task,
it constructs a 3D semantic voxel map to model its
spatial layout. Following Blukis et al. (2022), we
use a depth estimator to project the pixels of ego-
centric observation images and detected objects to
3D point cloud and bin the points into
0.25
m
3
vox-
els. The resulting map can help symbolic planner
(Section 3.3) break down high-level navigation ac-
tions, such as
GOTO Knife_0
, to atomic navigation
actions such as Forward, TurnLeft, LookUp.3
Object Instance Lookup Table
Everyday tasks
can involve multiple instances of the same object,
and thus modeling only object class information
3See Appendix A.4 for more details on path planning.
may be insufficient.
4
As shown in the internal rep-
resentation update part of Figure 2, we store object
instance information for a single task episode in a
symbolic lookup table, where each instance in the
environment is assigned a unique ID once observed.
These symbols in the lookup table become the plan-
ning domain of the symbolic planner (Section 3.3).
To collect this symbolic lookup table, we use a
panoptic segmentation model
5
to detect all object
instances in the current 2D egocentric visual frame.
These 2D instance detections are then projected
into the 3D map, and we use each instance’s 3D
centroid and size information to match and update
existing object instances’ information in the lookup
table
6
. As the agent moves through the scene and
receives more visual observations, the symbolic
lookup table becomes more complete and accurate.
Physical State Prediction
Additionally, tasks
can hinge upon the physical states of particular ob-
ject instances. For example, when making coffee,
the agent should disambiguate dirty and clean cof-
4
For example, when making a sandwich, the agent will
likely need to distinguish the top and bottom pieces of bread
to make the sandwich complete.
5
As opposed to a semantic segmentation model as used in
prior work (Chaplot et al.,2020;Min et al.,2022;Blukis et al.,
2022), which can only detect object class information.
6
To perform this update and decide whether a newly de-
tected instance should be merged with an existing instance or
added as a new one, we use a matching algorithm described
in Appendix A.5.
fee mugs and make sure to use the clean mug. To
recognize the physical state of each object instance,
we propose a physical state classification model
where inputs include the image region of a detected
object instance and its class identifier, and the out-
put is physical state labels for the instance. As
classifying the physical state from visual observa-
tion alone can introduce errors, we also incorporate
the effect of the agent’s actions into physical state
classifications. For example, the
isToggledOn
at-
tribute is automatically modified after the agent
applies the
ToggleOn
action, overriding the classi-
fier’s prediction.
3.2 Subgoal-Based Task Monitoring
Due to the hierarchical nature of tasks, natural lan-
guage instructions may express a mix of high-level
and low-level instructions. In order to monitor and
control the completion of a long-horizon task given
such complex inputs, we first model the sequence
of high-level subgoals, i.e., key intermediate steps
necessary to complete it.
As shown in Figure 3, we apply a sequence-to-
sequence approach powered by language models to
learn subgoals from the dialog and action history.
At the beginning of each session, our agent uses
these inputs to predict the sequence of all subgoals.
Our key insight is that to better predict subgoals-
to-do, it is also important to infer what has been
done. As such, we propose to additionally predict
the completed subgoals, and include the agent’s
action history as an input to support the prediction.
To take advantage of the power of pre-trained
language models for this type of problem, all in-
puts and outputs are translated into language form.
First, we convert the agent’s action history into syn-
thetic language (e.g.
PickUp(Cup)
“get cup”),
and feed it together with the history of dialog ut-
terances into the encoder. We then decode lan-
guage expressions for subgoals one by one in an
autoregressive manner. As the raw outputs from
the decoder can often be noisy due to language
expression ambiguity or incompleteness, we add
anatural-in, structure-out decoder which learns
to classify each of the subgoal components into
its symbolic form, and transform them to a lan-
guage phrase as decoder input to predict the next
subgoals.
3.3 Online Symbolic Planning
Symbolic planners excel at generating reliable and
interpretable plans. Given predicted subgoals and a
constructed spatial-symbolic representation, PDDL
Subgoal Predictio
22
Follower: Hi. What can I do for
you? Commander: Find a cup.
Follower go for cup, pick up cup.
Commander : Put it on the table.
Encoder Decoder
Dialog Action D
MLP MLP MLP
Patient Predicate Dest.
Future SGs
Completed SGs
Symbolic
Subgoals
Sym2Lang
Neural Networks
Subgoal Component
(b) Different configurations of encoder inputs / predictions
(a) The neural architecture of our subgoal predictor
Completed SG:
Future SG:
Dialog History Only
Dialog & Action History
Follower: Hi. What can I do for
you? Commander: Find a cup.
Commander : Put it on the table.
Future SG :
Completed & Future Subgoals
Future Subgoals Only
Completed subgoals : get cup ; Future subgoals : place cup to table ;
(Cup,isPickedUp)EOS (Cup,isPlacedTo,Table)EOS
(c) Illustration of the decoding process
Encoder Input Decoder Predictions
(Cup,isPlacedTo,Table)
(Cup,isPlacedTo,Table)
(Cup,isPickedUp)
Figure 3: Overview of the subgoal learning process.
Figure (a) shows the model architecture. Figure (b)
shows the different input/output configurations we ex-
periment with. Figure (c) illustrates the decoding pro-
cess. We first predict completed subgoals and then pre-
dict the future subgoals conditioned on them, where we
use different prompts to distinguish different types of
subgoals.
(Aeronautiques et al.,1998) planning algorithms
can be applied to generate a plan for each subgoal.
7
These short-horizon planning problems reduce the
chance of drifting from the plan during execution.
Nonetheless, failures are bound to happen during
execution. A notable advantage of our approach is
the transparency of its reasoning process, which not
only allows us to examine the world representation
and plan, but also gives the agent some awareness
about potential exceptions and enables the develop-
ment of mechanisms for replanning. In this section,
we introduce several new mechanisms to make on-
line symbolic planning feasible and robust in a
dynamic physical world.
Finding Unobserved Objects
The agent’s par-
tial observability of the environment may cause
a situation where in order to complete a subgoal,
the agent needs an object that has not been ob-
served yet. In this case, a traditional symbolic
planner cannot propose a plan, and thus will fail
the task. To circumvent this shortcoming, we ex-
tend the planner by letting the agent search for the
missing object(s). Specifically, during planning,
our agent assumes that all objects relevant to sub-
7
See Appendix A.7.1 for more details on PDDL planning.
摘要:

DANLI:DeliberativeAgentforFollowingNaturalLanguageInstructionsYichiZhangJianingYangJiayiPanShaneStorksNikhilDevrajZiqiaoMaKeunwooPeterYuYuweiBaoJoyceChaiComputerScienceandEngineeringDivision,UniversityofMichiganzhangyic@umich.eduAbstractRecentyearshaveseenanincreasingamountofworkonembodiedAIagentsth...

展开>> 收起<<
DANLI Deliberative Agent for Following Natural Language Instructions Yichi Zhang Jianing Yang Jiayi Pan Shane Storks Nikhil Devraj Ziqiao Ma Keunwoo Peter Yu Yuwei Bao Joyce Chai.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:2.2MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注