Generating Executable Action Plans with Environmentally-Aware Language Models

2025-05-06 0 0 3.71MB 12 页 10玖币

侵权投诉

Generating Executable Action Plans with Environmentally-Aware

Language Models

Maitrey Gramopadhye1and Daniel Szaﬁr1

Abstract— Large Language Models (LLMs) trained using

massive text datasets have recently shown promise in generating

action plans for robotic agents from high level text queries.

However, these models typically do not consider the robot’s

environment, resulting in generated plans that may not actually

be executable, due to ambiguities in the planned actions or en-

vironmental constraints. In this paper, we propose an approach

to generate environmentally-aware action plans that agents

are better able to execute. Our approach involves integrating

environmental objects and object relations as additional inputs

into LLM action plan generation to provide the system with

an awareness of its surroundings, resulting in plans where

each generated action is mapped to objects present in the

scene. We also design a novel scoring function that, along with

generating the action steps and associating them with objects,

helps the system disambiguate among object instances and take

into account their states. We evaluated our approach using the

VirtualHome simulator and the ActivityPrograms knowledge

base and found that action plans generated from our system had

a 310% improvement in executability and a 147% improvement

in correctness over prior work. The complete code and a demo

of our method is publicly available at https://github.

com/hri-ironlab/scene_aware_language_planner.

I. INTRODUCTION

Recent work in the natural language processing (NLP) and

machine learning (ML) communities has made tremendous

breakthroughs in several core aspects of computational lin-

guistics and language modeling driven by advances in deep

learning, data, hardware, and techniques. These advance-

ments have led to the release of pretrained large (million

and billion+ parameter) language models (LLMs) that have

achieved the state-of-the-art across a variety of tasks such as

text classiﬁcation, generation, and summarization, question

answering, and machine translation, that demonstrate some

abilities to meaningfully understand the real world [1], [2],

[3], [4], [5], [6], [7], [8]. LLMs also demonstrate cross-

domain and cross-modal generalizations, such as retrieving

videos from text, visual question answering and task plan-

ning [9]. In particular, recent works have explored using

LLMs to convert high-level natural language commands to

actionable steps (e.g. “bring water” →“grab glass”, “ﬁll

glass with water”, “walk to table”, “put glass on table”) for

intelligent agents [10], [11], [12], [13], [14], [15]. Trained

on diverse and extensive data, LLMs have the distinct ability

to form action plans for varied high-level tasks.

While promising, the action steps generated by LLMs in

prior work are not always executable by a robot platform.

For instance, for a task “clean the room” a LLM might

generate an output “call cleaning agency on phone”; while

1University of North Carolina at Chapel Hill, United States

being correct, this action plan might not be executable since

the agent might not grasp the concept of “call” or have the

“phone” object in it’s environment. This limitation arises

because LLMs are trained solely on large text corpora and

have essentially never had any interaction with an embodied

environment. As a result, the action steps they generate lack

context on the robot’s surroundings and capabilities.

To address this issue, prior works have explored grounding

LLMs by ﬁne-tuning models using human interactions [15],

[16], [17] or training models for downstream tasks using

pretrained LLMs as frozen backbones [18], [19], [20], [21],

[22], [23], [24], [25]. However, these methods often require

training on extensive annotated data, which can be expensive

or infeasible to obtain, or can lead to loss of generalized

knowledge from the LLM. Instead, recent research has in-

vestigated biasing LLM output without altering their weights

by using prompt engineering [10], [11], [26] or constraining

LLM output to a corpus of available action steps deﬁned a

priori that are known to be within a robot’s capabilities [10],

[11]. This line of research focuses on methods that can utilise

the capabilities of LLMs while preserving their generality

and with substantially less additional annotated data.

While these systems effectively perform common sense

grounding by extracting knowledge from an LLM, they

employ a one-ﬁts-all approach without considering the vari-

ations possible in the actionable environment. As a result,

executing the action plans generated by these systems either

requires approximations to the agent’s environment or time-

consuming and costly pretraining to generate an affordance

score to determine the probability that an action will succeed

or produce a favourable outcome towards task completion,

given the current agent and environment states. Additionally,

since prior systems are environment agnostic, it is not

possible to use them to generate executable action plans

for tasks requiring object disambiguation. For example, to

generate correct action plans for tasks that require interaction

with multiple objects with the same name, the system needs

to be able to distinguish among object instances.

We propose a novel method to address these issues

while generating low-level action plans from high-level task

speciﬁcations. Our approach is an extension to Huang et

al., 2022 [10]. From an Example set (see §IV) using the

ActivityPrograms knowledge base collected by Puig et al.,

2018 [27], we sample an example similar to the query task

and environment and use it to design a prompt for a LLM

(details of which are given in §III-A). We then use the LLM

to autoregressively generate candidates for each action step.

To rank the generated candidates, we design multiple scores

arXiv:2210.04964v2 [cs.RO] 2 May 2023

Fig. 1. Visualization of an example action plan being executed in VirtualHome. Within the virtual home environment a simulated humanoid agent carries

out the robot task sequences generated by our environmentally-aware language model.

for the actions and their associated objects (see §III-B and

§III-C). After the top candidate is selected, we append it to

the action plan and repeat the process until the entire action

plan is generated.

To evaluate our action plans, we use the recently released

VirtualHome interface [27] (Figure 1 shows a visualization

of an example action plan running in VirtualHome). We use

several metrics (details in §IV-A), including executability,

Longest Common Sub-sequence (LCS), and ﬁnal graph

correctness to autonomously test generated action plans on

VirtualHome. Overall, we found that our method increased

action plan executability and correctness by 310% and 147%

respectively over a state-of-the-art baseline.

II. RELATED WORK

Our work builds upon recent efforts in robotics to lever-

age the potential of LLMs. For instance, researchers are

beginning to explore LLMs in the context of applying com-

monsense reasoning to natural language instructions [28],

providing robotic agents with zero-shot action plans [10], and

supplying high-level semantic knowledge about robot tasks

[11]. Below, we review related research in task planning,

LLMs, and action plan grounding.

A. Task Planning

The problem of task planning involves generating a series

of steps to accomplish a goal in a constrained environment.

Historically, this problem has been widely studied in robotics

[29], [30], [31], with most approaches solving it by opti-

mizing the generated plan, given environment constraints,

[32], [33] and using symbolic planning [29], [31]. Recently,

machine learning methods have been employed to relax the

constraints on the environment and allow higher-level task

speciﬁcations by leveraging techniques such as reinforcement

learning or graph learning to learn task hierarchy [34], [35],

[36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46],

[47], [48], [49]. However, most of these methods require

extensive training from demonstrations, or explicitly encoded

environmental knowledge and may not generalize to unseen

environments and tasks. The use of LLMs, which encapsulate

generalized world knowledge, may help plan for novel tasks

and new environments.

B. Large Language Models

Large language models (LLMs) are language models,

usually inspired by the transformer architecture [50], tens

of gigabytes in size and trained on enormous amounts

of unstructured text data. Recent advances in the ﬁeld of

natural language processing have shown that LLMs are

useful for several downstream applications including interac-

tive dialogue, essay generation, creating websites from text

descriptions, automatic code completion, etc. [1], [2], [4],

[3]. During their pretraining, LLMs can accumulate diverse

and extensive knowledge [51], [52], [53] that enables their

use in applications beyond NLP, such as retrieving visual

features [54] and solving mathematical problems [55], [56]

or as pretrained models for other modalities [57], [58]. In

robotics, knowledge embedded in LLMs can be utilised to

generate actionable plans for agents from high-level queries.

However, in order for a plan to be executable by a robot, the

outputs from the LLMs need to be grounded in the context

of the robot’s environment and capabilities.

C. Grounding Natural Language in Action Plans

There has been considerable work towards grounding nat-

ural language in actionable steps. Prior research has focused

on parsing natural language or analysing it as series of lexical

tokens to remove ambiguity and map language commands

to admissible actions [59], [60], [61], [62]. However, these

methods usually require extensive, manually coded rules and

thus fail to generalize to novel environments and tasks. More

relevant to our approach, recent work has explored grounding

language models using additional environment elements [63],

[64], [65], [66], [67]. Techniques include prompting [10],

[26] and constraining language model outputs to admissible

actions [11], [12], [13], [14]. To also ground the output

of language models in the environment of the agent, prior

works have tried using LLMs as ﬁxed backbones, [18], [20],

[21], [22], [23], [24], [25], [68] ﬁne-tuning or ranking model

outputs through interactions with the environment [15], [16],

[17]. Our work extends such approaches, where we use

additional inputs from the environment (i.e., objects and their

properties) to condition the model output without any ﬁne-

tuning of the LLM or extra training to learn value functions

for ranking LLM outputs.

Fig. 2. An overview of our approach. We generate action plans by ﬁrst

selecting an example that has a similar task and environment to the query.

We use this example to autoregressively prompt the Planning LM to generate

an action plan and map the output to admissible actions and objects using

the Translation LM.

III. APPROACH

In this section, we discuss our proposed method to gen-

erate directly executable action plans from high-level tasks

(Figure 2 provides a visual overview). Motivated by Huang

et al., 2022 [10], our approach uses two language models,

aplanning LM (LMP) to generate the action plan and

calculate a score for the similarity of an object with the other

objects associated with the action plan; and a translation LM

(LMT) to calculate embeddings for objects and actions.

A. LLM Action Plan Prompt Generation

Large language models have the ability to learn from con-

text during inference, i.e., when autoregressively sampled,

LLMs can generate meaningful text to complete or extend

a given textual prompt [1]. We leverage this capability in

designing prompts for LLM sampling that generate action

plans. Speciﬁcally, we select an example from an Example

set of task and action plans synthesized from the Activi-

tyPrograms dataset (see §IV) and construct a prompt for the

LLM by prepending the example task and action plan to the

current task.

We dynamically select the example during inference to

design a prompt similar to the query. As in Huang et al., 2022

[10], we use the query task to select the example. However,

one of our novel extensions is to also use the environment

associated with the query to construct a prompt, keeping in

mind the objects (and their states) the agent can currently

interact with. For a query (Q) with task “Play video games”,

an example (Ex1) with task “play board games” may be

chosen considering just the task similarity. However, another

example (Ex2) with task “Use the computer” may be more

relevant because the action plan for both Qand Ex2would

have actions involving similar objects, such as “switch on

computer”, “type on keyboard”, “push mouse”, etc., which

may not be present in the action plan for Ex1. Considering

the environment in selecting the example may also help in

disambiguating between examples with high task similarity

but different objects. For example, a “clean room” action

plan, which uses a rag, and a “clean ﬂoor” plan, which uses

a mop, may both have high task similarity to a “clean the

house” query task. Considering the objects present in the

environment (E) of the query (Q) (e.g., a rag is present, but

not a mop) can help determine the better example.

We start by selecting {Qi

e}Ne

i=1 examples that have tasks

{Ti

e}Ne

i=1 similar to the task Tof the query Q. Here Neis

a hyperparameter. We use the cosine similarity (C) of task

embeddings to calculate task similarity given by:

SM(T, Te) = C(LMT(T), LMT(Te))

We then compare the environments {Ei

e}Ne

i=1 of the selected

examples with the environment (E) of the query (Q). An

environment from a sample in our dataset is structured like

a graph, with the graph nodes representing the available

objects. The nodes also have information about object prop-

erties (eg. grabbable, openable, movable, etc.) and the current

states of the objects (eg. clean, closed, etc.). The edges in

the graph represent the relations between objects (eg. inside,

on, facing, close to, etc.). We calculate the environment

similarity as the mean of the intersection over unions of the

nodes and edges respectively:

SG(E, Ee) = 1

2·(IoU(nodes(E), nodes(Ee))+

IoU(edges(E), edges(Ee)))

Finally, from the selected Neexamples, we select one

example Q∗

ethat maximises the example score given by:

SM(T∗

e, T ) + Ws·SG(E∗

e, E)

where Wsis a hyperparameter. With the example task T∗

action plan A∗

e, and query task T, we form a prompt (P ra=

T∗

e+A∗

e+T) for generating the action plan and a set

of objects (P ro) associated with the actions plan A∗

e. We

use P roto calculate the similarity scores between any new

objects and the objects already associated with the action

plan (See §III-C).

B. Action Step Generation

As in Huang et al., 2022 [10], we sample the LMP

multiple times using prompt P rato get ksamples for each

action step, and the LLM generation probability associated

with each sample step (Pa). Pagives a score for how relevant

the planner LM thinks the sample is to the current action

plan and prompt. However, since the output of the language

model is unconstrained, it can include infeasible steps that

the agent cannot actually execute. To make sure the actions

generated are executable, we map each sample to its closest

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeneratingExecutableActionPlanswithEnvironmentally-AwareLanguageModelsMaitreyGramopadhye1andDanielSzar1AbstractLargeLanguageModels(LLMs)trainedusingmassivetextdatasetshaverecentlyshownpromiseingeneratingactionplansforroboticagentsfromhighleveltextqueries.However,thesemodelstypicallydonotconsiderth...

展开>> 收起<<

Generating Executable Action Plans with Environmentally-Aware Language Models.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generating Executable Action Plans with Environmentally-Aware Language Models

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: