
DANLI: Deliberative Agent for Following Natural Language Instructions
Yichi Zhang Jianing Yang Jiayi Pan Shane Storks Nikhil Devraj
Ziqiao Ma Keunwoo Peter Yu Yuwei Bao Joyce Chai
Computer Science and Engineering Division, University of Michigan
zhangyic@umich.edu
Abstract
Recent years have seen an increasing amount
of work on embodied AI agents that can per-
form tasks by following human language in-
structions. However, most of these agents are
reactive, meaning that they simply learn and
imitate behaviors encountered in the training
data. These reactive agents are insufficient
for long-horizon complex tasks. To address
this limitation, we propose a neuro-symbolic
deliberative agent that, while following lan-
guage instructions, proactively applies reason-
ing and planning based on its neural and sym-
bolic representations acquired from past expe-
rience (e.g., natural language and egocentric
vision). We show that our deliberative agent
achieves greater than 70% improvement over
reactive baselines on the challenging TEACh
benchmark. Moreover, the underlying reason-
ing and planning processes, together with our
modular framework, offer impressive trans-
parency and explainability to the behaviors
of the agent. This enables an in-depth un-
derstanding of the agent’s capabilities, which
shed light on challenges and opportunities for
future embodied agents for instruction follow-
ing. The code is available at https://github.
com/sled-group/DANLI.
1 Introduction
Natural language instruction following with embod-
ied AI agents (Chai et al.,2018;Anderson et al.,
2018;Thomason et al.,2019;Qi et al.,2020;Shrid-
har et al.,2020;Padmakumar et al.,2021) is a
notoriously difficult problem, where an agent must
interpret human language commands to perform
actions in the physical world and achieve a goal. Es-
pecially challenging is the hierarchical nature of ev-
eryday tasks,
1
which often require reasoning about
1
For example, making breakfast may require preparing one
or more dishes (e.g., toast and coffee), each of which requires
several sub-tasks of navigating through the environment and
manipulating objects (e.g., finding a knife, slicing bread, cook-
ing it in the toaster), and even more fine-grained primitive
actions entailed by them (e.g., walk forward, pick up knife).
subgoals and reconciling them with the world state
and overall goal. However, despite recent progress,
past approaches are typically reactive (Wooldridge,
1995) in their execution of actions: conditioned
on the rich, multimodal inputs from the environ-
ment, they perform actions directly without using
an explicit representation of the world to facili-
tate grounded reasoning and planning (Pashevich
et al.,2021;Zhang and Chai,2021;Sharma et al.,
2022). Such an approach is inefficient, as natural
language instructions often omit trivial steps that
a human may be assumed to already know (Zhou
et al.,2021). Besides, the lack of any explicit sym-
bolic component makes such approaches hard to
interpret, especially when the agent makes errors.
Inspired by previous work toward deliberative
agents in robotic task planning, which apply long-
term action planning over known world and goal
states (She et al.,2014;Agia et al.,2022;Srivas-
tava et al.,2021;Wang et al.,2022), we introduce
DANLI
, a neuro-symbolic Deliberative Agent for
following Natural Language Instructions.
DANLI
combines learned symbolic representations of task
subgoals and the surrounding environment with
a robust symbolic planning algorithm to execute
tasks. First, we build a uniquely rich semantic spa-
tial representation (Section 3.1), acquired online
from the surrounding environment and language
descriptions to capture symbolic information about
object instances and their physical states. To cap-
ture the highest level of hierarchy in tasks, we pro-
pose a neural task monitor (Section 3.2) that learns
to extract symbolic information about task progress
and upcoming subgoals from the dialog and action
history. Using these elements as a planning domain,
we lastly apply an online planning algorithm (Sec-
tion 3.3) to plan low-level actions for subgoals in
the environment, taking advantage of
DANLI
’s trans-
parent reasoning and planning pipeline to detect
and recover from errors.
Our empirical results demonstrate that our de-
arXiv:2210.12485v1 [cs.AI] 22 Oct 2022