Generating Executable Action Plans with Environmentally-Aware
Language Models
Maitrey Gramopadhye1and Daniel Szafir1
Abstract— Large Language Models (LLMs) trained using
massive text datasets have recently shown promise in generating
action plans for robotic agents from high level text queries.
However, these models typically do not consider the robot’s
environment, resulting in generated plans that may not actually
be executable, due to ambiguities in the planned actions or en-
vironmental constraints. In this paper, we propose an approach
to generate environmentally-aware action plans that agents
are better able to execute. Our approach involves integrating
environmental objects and object relations as additional inputs
into LLM action plan generation to provide the system with
an awareness of its surroundings, resulting in plans where
each generated action is mapped to objects present in the
scene. We also design a novel scoring function that, along with
generating the action steps and associating them with objects,
helps the system disambiguate among object instances and take
into account their states. We evaluated our approach using the
VirtualHome simulator and the ActivityPrograms knowledge
base and found that action plans generated from our system had
a 310% improvement in executability and a 147% improvement
in correctness over prior work. The complete code and a demo
of our method is publicly available at https://github.
com/hri-ironlab/scene_aware_language_planner.
I. INTRODUCTION
Recent work in the natural language processing (NLP) and
machine learning (ML) communities has made tremendous
breakthroughs in several core aspects of computational lin-
guistics and language modeling driven by advances in deep
learning, data, hardware, and techniques. These advance-
ments have led to the release of pretrained large (million
and billion+ parameter) language models (LLMs) that have
achieved the state-of-the-art across a variety of tasks such as
text classification, generation, and summarization, question
answering, and machine translation, that demonstrate some
abilities to meaningfully understand the real world [1], [2],
[3], [4], [5], [6], [7], [8]. LLMs also demonstrate cross-
domain and cross-modal generalizations, such as retrieving
videos from text, visual question answering and task plan-
ning [9]. In particular, recent works have explored using
LLMs to convert high-level natural language commands to
actionable steps (e.g. “bring water” →“grab glass”, “fill
glass with water”, “walk to table”, “put glass on table”) for
intelligent agents [10], [11], [12], [13], [14], [15]. Trained
on diverse and extensive data, LLMs have the distinct ability
to form action plans for varied high-level tasks.
While promising, the action steps generated by LLMs in
prior work are not always executable by a robot platform.
For instance, for a task “clean the room” a LLM might
generate an output “call cleaning agency on phone”; while
1University of North Carolina at Chapel Hill, United States
being correct, this action plan might not be executable since
the agent might not grasp the concept of “call” or have the
“phone” object in it’s environment. This limitation arises
because LLMs are trained solely on large text corpora and
have essentially never had any interaction with an embodied
environment. As a result, the action steps they generate lack
context on the robot’s surroundings and capabilities.
To address this issue, prior works have explored grounding
LLMs by fine-tuning models using human interactions [15],
[16], [17] or training models for downstream tasks using
pretrained LLMs as frozen backbones [18], [19], [20], [21],
[22], [23], [24], [25]. However, these methods often require
training on extensive annotated data, which can be expensive
or infeasible to obtain, or can lead to loss of generalized
knowledge from the LLM. Instead, recent research has in-
vestigated biasing LLM output without altering their weights
by using prompt engineering [10], [11], [26] or constraining
LLM output to a corpus of available action steps defined a
priori that are known to be within a robot’s capabilities [10],
[11]. This line of research focuses on methods that can utilise
the capabilities of LLMs while preserving their generality
and with substantially less additional annotated data.
While these systems effectively perform common sense
grounding by extracting knowledge from an LLM, they
employ a one-fits-all approach without considering the vari-
ations possible in the actionable environment. As a result,
executing the action plans generated by these systems either
requires approximations to the agent’s environment or time-
consuming and costly pretraining to generate an affordance
score to determine the probability that an action will succeed
or produce a favourable outcome towards task completion,
given the current agent and environment states. Additionally,
since prior systems are environment agnostic, it is not
possible to use them to generate executable action plans
for tasks requiring object disambiguation. For example, to
generate correct action plans for tasks that require interaction
with multiple objects with the same name, the system needs
to be able to distinguish among object instances.
We propose a novel method to address these issues
while generating low-level action plans from high-level task
specifications. Our approach is an extension to Huang et
al., 2022 [10]. From an Example set (see §IV) using the
ActivityPrograms knowledge base collected by Puig et al.,
2018 [27], we sample an example similar to the query task
and environment and use it to design a prompt for a LLM
(details of which are given in §III-A). We then use the LLM
to autoregressively generate candidates for each action step.
To rank the generated candidates, we design multiple scores
arXiv:2210.04964v2 [cs.RO] 2 May 2023