Generating Executable Action Plans with Environmentally-Aware Language Models

2025-05-06 0 0 3.71MB 12 页 10玖币
侵权投诉
Generating Executable Action Plans with Environmentally-Aware
Language Models
Maitrey Gramopadhye1and Daniel Szafir1
Abstract Large Language Models (LLMs) trained using
massive text datasets have recently shown promise in generating
action plans for robotic agents from high level text queries.
However, these models typically do not consider the robot’s
environment, resulting in generated plans that may not actually
be executable, due to ambiguities in the planned actions or en-
vironmental constraints. In this paper, we propose an approach
to generate environmentally-aware action plans that agents
are better able to execute. Our approach involves integrating
environmental objects and object relations as additional inputs
into LLM action plan generation to provide the system with
an awareness of its surroundings, resulting in plans where
each generated action is mapped to objects present in the
scene. We also design a novel scoring function that, along with
generating the action steps and associating them with objects,
helps the system disambiguate among object instances and take
into account their states. We evaluated our approach using the
VirtualHome simulator and the ActivityPrograms knowledge
base and found that action plans generated from our system had
a 310% improvement in executability and a 147% improvement
in correctness over prior work. The complete code and a demo
of our method is publicly available at https://github.
com/hri-ironlab/scene_aware_language_planner.
I. INTRODUCTION
Recent work in the natural language processing (NLP) and
machine learning (ML) communities has made tremendous
breakthroughs in several core aspects of computational lin-
guistics and language modeling driven by advances in deep
learning, data, hardware, and techniques. These advance-
ments have led to the release of pretrained large (million
and billion+ parameter) language models (LLMs) that have
achieved the state-of-the-art across a variety of tasks such as
text classification, generation, and summarization, question
answering, and machine translation, that demonstrate some
abilities to meaningfully understand the real world [1], [2],
[3], [4], [5], [6], [7], [8]. LLMs also demonstrate cross-
domain and cross-modal generalizations, such as retrieving
videos from text, visual question answering and task plan-
ning [9]. In particular, recent works have explored using
LLMs to convert high-level natural language commands to
actionable steps (e.g. “bring water” “grab glass”, “fill
glass with water”, “walk to table”, “put glass on table”) for
intelligent agents [10], [11], [12], [13], [14], [15]. Trained
on diverse and extensive data, LLMs have the distinct ability
to form action plans for varied high-level tasks.
While promising, the action steps generated by LLMs in
prior work are not always executable by a robot platform.
For instance, for a task “clean the room” a LLM might
generate an output “call cleaning agency on phone”; while
1University of North Carolina at Chapel Hill, United States
being correct, this action plan might not be executable since
the agent might not grasp the concept of “call” or have the
“phone” object in it’s environment. This limitation arises
because LLMs are trained solely on large text corpora and
have essentially never had any interaction with an embodied
environment. As a result, the action steps they generate lack
context on the robot’s surroundings and capabilities.
To address this issue, prior works have explored grounding
LLMs by fine-tuning models using human interactions [15],
[16], [17] or training models for downstream tasks using
pretrained LLMs as frozen backbones [18], [19], [20], [21],
[22], [23], [24], [25]. However, these methods often require
training on extensive annotated data, which can be expensive
or infeasible to obtain, or can lead to loss of generalized
knowledge from the LLM. Instead, recent research has in-
vestigated biasing LLM output without altering their weights
by using prompt engineering [10], [11], [26] or constraining
LLM output to a corpus of available action steps defined a
priori that are known to be within a robot’s capabilities [10],
[11]. This line of research focuses on methods that can utilise
the capabilities of LLMs while preserving their generality
and with substantially less additional annotated data.
While these systems effectively perform common sense
grounding by extracting knowledge from an LLM, they
employ a one-fits-all approach without considering the vari-
ations possible in the actionable environment. As a result,
executing the action plans generated by these systems either
requires approximations to the agent’s environment or time-
consuming and costly pretraining to generate an affordance
score to determine the probability that an action will succeed
or produce a favourable outcome towards task completion,
given the current agent and environment states. Additionally,
since prior systems are environment agnostic, it is not
possible to use them to generate executable action plans
for tasks requiring object disambiguation. For example, to
generate correct action plans for tasks that require interaction
with multiple objects with the same name, the system needs
to be able to distinguish among object instances.
We propose a novel method to address these issues
while generating low-level action plans from high-level task
specifications. Our approach is an extension to Huang et
al., 2022 [10]. From an Example set (see §IV) using the
ActivityPrograms knowledge base collected by Puig et al.,
2018 [27], we sample an example similar to the query task
and environment and use it to design a prompt for a LLM
(details of which are given in §III-A). We then use the LLM
to autoregressively generate candidates for each action step.
To rank the generated candidates, we design multiple scores
arXiv:2210.04964v2 [cs.RO] 2 May 2023
Fig. 1. Visualization of an example action plan being executed in VirtualHome. Within the virtual home environment a simulated humanoid agent carries
out the robot task sequences generated by our environmentally-aware language model.
for the actions and their associated objects (see §III-B and
§III-C). After the top candidate is selected, we append it to
the action plan and repeat the process until the entire action
plan is generated.
To evaluate our action plans, we use the recently released
VirtualHome interface [27] (Figure 1 shows a visualization
of an example action plan running in VirtualHome). We use
several metrics (details in §IV-A), including executability,
Longest Common Sub-sequence (LCS), and final graph
correctness to autonomously test generated action plans on
VirtualHome. Overall, we found that our method increased
action plan executability and correctness by 310% and 147%
respectively over a state-of-the-art baseline.
II. RELATED WORK
Our work builds upon recent efforts in robotics to lever-
age the potential of LLMs. For instance, researchers are
beginning to explore LLMs in the context of applying com-
monsense reasoning to natural language instructions [28],
providing robotic agents with zero-shot action plans [10], and
supplying high-level semantic knowledge about robot tasks
[11]. Below, we review related research in task planning,
LLMs, and action plan grounding.
A. Task Planning
The problem of task planning involves generating a series
of steps to accomplish a goal in a constrained environment.
Historically, this problem has been widely studied in robotics
[29], [30], [31], with most approaches solving it by opti-
mizing the generated plan, given environment constraints,
[32], [33] and using symbolic planning [29], [31]. Recently,
machine learning methods have been employed to relax the
constraints on the environment and allow higher-level task
specifications by leveraging techniques such as reinforcement
learning or graph learning to learn task hierarchy [34], [35],
[36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46],
[47], [48], [49]. However, most of these methods require
extensive training from demonstrations, or explicitly encoded
environmental knowledge and may not generalize to unseen
environments and tasks. The use of LLMs, which encapsulate
generalized world knowledge, may help plan for novel tasks
and new environments.
B. Large Language Models
Large language models (LLMs) are language models,
usually inspired by the transformer architecture [50], tens
of gigabytes in size and trained on enormous amounts
of unstructured text data. Recent advances in the field of
natural language processing have shown that LLMs are
useful for several downstream applications including interac-
tive dialogue, essay generation, creating websites from text
descriptions, automatic code completion, etc. [1], [2], [4],
[3]. During their pretraining, LLMs can accumulate diverse
and extensive knowledge [51], [52], [53] that enables their
use in applications beyond NLP, such as retrieving visual
features [54] and solving mathematical problems [55], [56]
or as pretrained models for other modalities [57], [58]. In
robotics, knowledge embedded in LLMs can be utilised to
generate actionable plans for agents from high-level queries.
However, in order for a plan to be executable by a robot, the
outputs from the LLMs need to be grounded in the context
of the robot’s environment and capabilities.
C. Grounding Natural Language in Action Plans
There has been considerable work towards grounding nat-
ural language in actionable steps. Prior research has focused
on parsing natural language or analysing it as series of lexical
tokens to remove ambiguity and map language commands
to admissible actions [59], [60], [61], [62]. However, these
methods usually require extensive, manually coded rules and
thus fail to generalize to novel environments and tasks. More
relevant to our approach, recent work has explored grounding
language models using additional environment elements [63],
[64], [65], [66], [67]. Techniques include prompting [10],
[26] and constraining language model outputs to admissible
actions [11], [12], [13], [14]. To also ground the output
of language models in the environment of the agent, prior
works have tried using LLMs as fixed backbones, [18], [20],
[21], [22], [23], [24], [25], [68] fine-tuning or ranking model
outputs through interactions with the environment [15], [16],
[17]. Our work extends such approaches, where we use
additional inputs from the environment (i.e., objects and their
properties) to condition the model output without any fine-
tuning of the LLM or extra training to learn value functions
for ranking LLM outputs.
Fig. 2. An overview of our approach. We generate action plans by first
selecting an example that has a similar task and environment to the query.
We use this example to autoregressively prompt the Planning LM to generate
an action plan and map the output to admissible actions and objects using
the Translation LM.
III. APPROACH
In this section, we discuss our proposed method to gen-
erate directly executable action plans from high-level tasks
(Figure 2 provides a visual overview). Motivated by Huang
et al., 2022 [10], our approach uses two language models,
aplanning LM (LMP) to generate the action plan and
calculate a score for the similarity of an object with the other
objects associated with the action plan; and a translation LM
(LMT) to calculate embeddings for objects and actions.
A. LLM Action Plan Prompt Generation
Large language models have the ability to learn from con-
text during inference, i.e., when autoregressively sampled,
LLMs can generate meaningful text to complete or extend
a given textual prompt [1]. We leverage this capability in
designing prompts for LLM sampling that generate action
plans. Specifically, we select an example from an Example
set of task and action plans synthesized from the Activi-
tyPrograms dataset (see §IV) and construct a prompt for the
LLM by prepending the example task and action plan to the
current task.
We dynamically select the example during inference to
design a prompt similar to the query. As in Huang et al., 2022
[10], we use the query task to select the example. However,
one of our novel extensions is to also use the environment
associated with the query to construct a prompt, keeping in
mind the objects (and their states) the agent can currently
interact with. For a query (Q) with task “Play video games”,
an example (Ex1) with task “play board games” may be
chosen considering just the task similarity. However, another
example (Ex2) with task “Use the computer” may be more
relevant because the action plan for both Qand Ex2would
have actions involving similar objects, such as “switch on
computer”, “type on keyboard”, “push mouse”, etc., which
may not be present in the action plan for Ex1. Considering
the environment in selecting the example may also help in
disambiguating between examples with high task similarity
but different objects. For example, a “clean room” action
plan, which uses a rag, and a “clean floor” plan, which uses
a mop, may both have high task similarity to a “clean the
house” query task. Considering the objects present in the
environment (E) of the query (Q) (e.g., a rag is present, but
not a mop) can help determine the better example.
We start by selecting {Qi
e}Ne
i=1 examples that have tasks
{Ti
e}Ne
i=1 similar to the task Tof the query Q. Here Neis
a hyperparameter. We use the cosine similarity (C) of task
embeddings to calculate task similarity given by:
SM(T, Te) = C(LMT(T), LMT(Te))
We then compare the environments {Ei
e}Ne
i=1 of the selected
examples with the environment (E) of the query (Q). An
environment from a sample in our dataset is structured like
a graph, with the graph nodes representing the available
objects. The nodes also have information about object prop-
erties (eg. grabbable, openable, movable, etc.) and the current
states of the objects (eg. clean, closed, etc.). The edges in
the graph represent the relations between objects (eg. inside,
on, facing, close to, etc.). We calculate the environment
similarity as the mean of the intersection over unions of the
nodes and edges respectively:
SG(E, Ee) = 1
2·(IoU(nodes(E), nodes(Ee))+
IoU(edges(E), edges(Ee)))
Finally, from the selected Neexamples, we select one
example Q
ethat maximises the example score given by:
SM(T
e, T ) + Ws·SG(E
e, E)
where Wsis a hyperparameter. With the example task T
e,
action plan A
e, and query task T, we form a prompt (P ra=
T
e+A
e+T) for generating the action plan and a set
of objects (P ro) associated with the actions plan A
e. We
use P roto calculate the similarity scores between any new
objects and the objects already associated with the action
plan (See §III-C).
B. Action Step Generation
As in Huang et al., 2022 [10], we sample the LMP
multiple times using prompt P rato get ksamples for each
action step, and the LLM generation probability associated
with each sample step (Pa). Pagives a score for how relevant
the planner LM thinks the sample is to the current action
plan and prompt. However, since the output of the language
model is unconstrained, it can include infeasible steps that
the agent cannot actually execute. To make sure the actions
generated are executable, we map each sample to its closest
摘要:

GeneratingExecutableActionPlanswithEnvironmentally-AwareLanguageModelsMaitreyGramopadhye1andDanielSzar1Abstract—LargeLanguageModels(LLMs)trainedusingmassivetextdatasetshaverecentlyshownpromiseingeneratingactionplansforroboticagentsfromhighleveltextqueries.However,thesemodelstypicallydonotconsiderth...

展开>> 收起<<
Generating Executable Action Plans with Environmentally-Aware Language Models.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:3.71MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注