Large Language Models Can Implement Policy Iteration Ethan Brooks1 Logan Walls2 Richard L. Lewis2 Satinder Singh1

2025-05-03 0 0 934.34KB 13 页 10玖币

侵权投诉

Large Language Models

Can Implement Policy Iteration

Ethan Brooks1, Logan Walls2, Richard L. Lewis2, Satinder Singh1

1Computer Science and Engineering, University of Michigan

2Department of Psychology, University of Michigan

{ethanbro,logwalls,rickl,baveja}@umich.edu

Abstract

In this work, we demonstrate a method for implementing policy iteration using

a large language model. While the application of foundation models to RL has

received considerable attention, most approaches rely on either (1) the curation of

expert demonstrations (either through manual design or task-speciﬁc pretraining)

or (2) adaptation to the task of interest using gradient methods (either ﬁne-tuning

or training of adapter layers). Both of these techniques have drawbacks. Collecting

demonstrations is labor-intensive, and algorithms that rely on them do not out-

perform the experts from which the demonstrations were derived. All gradient

techniques are inherently slow, sacriﬁcing the “few-shot” quality that makes in-

context learning attractive to begin with. Our method demonstrates that a large

language model can be used to implement policy iteration using the machinery

of in-context learning, enabling it to learn to perform RL tasks without expert

demonstrations or gradients. Our approach iteratively updates the contents of the

prompt from which it derives its policy through trial-and-error interaction with an

RL environment. In order to eliminate the role of in-weights learning (on which

approaches like Decision Transformer rely heavily), we demonstrate our method

using Codex (M. Chen et al. 2021b), a language model with no prior knowledge of

the domains on which we evaluate it.

1 Introduction

In many settings, models implemented using a transformer or recurrent architecture will improve their

performance as information accumulated in their context or memory. We refer to this phenomenon as

“in-context learning.” (Brown et al. 2020b) demonstrated a technique for inducing this phenomenon

by prompting a large language model with a small number of input/output exemplars. An interesting

property of in-context learning in the case of large pre-trained models (or “foundation models”) is

that the models are not directly trained to optimize a meta-learning objective, but demonstrate an

emergent capacity to generalize (or at least specialize) to diverse downstream task-distributions (Wei

et al. 2022b).

A litany of existing work has explored methods for applying this remarkable capability to downstream

tasks (see Related Work), including Reinforcement Learning (RL). Most work in this area either (1)

assumes access to expert demonstrations — collected either from human experts (Huang et al. 2022b;

Baker et al. 2022), or domain-speciﬁc pre-trained RL agents (L. Chen et al. 2021; Lee et al. 2022;

Janner et al. 2021; Reed et al. 2022; Xu et al. 2022). — or (2) relies on gradient-based methods — e.g.

ﬁne-tuning of the foundation models parameters as a whole (Lee et al. 2022; Reed et al. 2022; Baker

et al. 2022) or newly training an adapter layer or preﬁx vectors while keeping the original foundation

models frozen (X. L. Li et al. 2021; A. K. Singh et al. 2022; Karimi Mahabadi et al. 2022).

Preprint. Under review.

arXiv:2210.03821v2 [cs.LG] 13 Aug 2023

LLM

A(1) =

st=

For each action in {A(1), ..., A(n)}:

LLM

a2LLM

... Q(st,A(1)) = Pγuru

...

LLM

A(n)=

st=

LLM

a2LLM

... Q(st,A(n)) = Pγuru

arg maxa

Environment

Figure 1: For each possible action

A(1),...,A(n)

, the LLM generates a rollout by alternately

predicting transitions and selecting actions. Q-value estimates are discounted sums of rewards. The

action is chosen greedily with respect to Q-values. Both state/reward prediction and next action

selection use trajectories from

to create prompts for the LLM. Changes to the content of

change

the prompts that the LLM receives, allowing the model to improve its behavior over time.

Our work demonstrates an approach to in-context learning which relaxes these assumptions. Our

method, In-Context Policy Iteration (ICPI), implements policy iteration using the prompt content,

instead of the model parameters, as the locus of learning, thereby avoiding gradient methods. Further-

more, the use of policy iteration frees us from expert demonstrations because suboptimal prompts

can be improved over the course of training.

We illustrate the method empirically on six small illustrative RL tasks— chain, distractor-chain,

maze, mini-catch, mini-invaders, and point-mass—in which the method very quickly ﬁnds good

policies. We also compare ﬁve pretrained Large Language Models (LLMs), including two different

size models trained on natural language—OPT-30B and GPT-J—and three different sizes of a model

trained on program code—two sizes of Codex as well as InCoder. On our six domains, we ﬁnd

that only the largest model (the

code-davinci-001

variant of Codex) consistently demonstrates

learning.

2 Related Work

A common application of foundation models to RL involves tasks that have language input, for

example natural language instructions/goals (D. Garg et al. 2022; Hill et al. 2020) or text-based games

(Peng et al. 2021; I. Singh et al. 2021; Majumdar et al. 2020; Ammanabrolu et al. 2021). Another

approach encodes RL trajectories into token sequences, and processes them with a foundation model,

model representations as input to deep RL architectures (S. Li et al. 2022; Tarasov et al. 2022; Tam

et al. 2022). Finally, a recent set of approaches treat RL as a sequence modeling problem and use the

foundation models itself to predict states or actions. This section will focus on this last category.

2.1 Learning from demonstrations

Many recent sequence-based approaches to reinforcement learning use demonstrations that come

either from human experts or pretrained RL agents. For example, Huang et al. 2022b use a frozen

LLM as a planner for everyday household tasks by constructing a preﬁx from human-generated

task instructions, and then using the LLM to generate instructions for new tasks. This work is

extended by Huang et al. 2022a. Similarly, Ahn et al. 2022 use a value function that is trained on

human demonstrations to rank candidate actions produced by an LLM. Baker et al. 2022 use human

demonstrations to train the foundation model itself: they use video recordings of human Minecraft

players to train a foundation models that plays Minecraft. Works that rely on pretrained RL agents

include Janner et al. 2021 who train a “Trajectory Transformer” to predict trajectory sequences in

continuous control tasks by using trajectories generated by pretrained agents, and L. Chen et al. 2021,

who use a dataset of ofﬂine trajectories to train a “Decision Transformer” that predicts actions from

state-action-reward sequences in RL environments like Atari. Two approaches build on this method

to improve generalization: Lee et al. 2022 use trajectories generated by a DQN agent to train a single

Decision Transformer that can play many Atari games, and Xu et al. 2022 use a combination of

human and artiﬁcial trajectories to train a Decision Transformer that achieves few-shot generalization

on continuous control tasks. Reed et al. 2022 take task-generality a step farther and use datasets

generated by pretrained agents to train a multi-modal agent that performs a wide array of RL (e.g.

Atari, continuous control) and non-RL (e.g. image captioning, chat) tasks.

Some of the above works include non-expert demonstrations as well. L. Chen et al. 2021 include

experiments with trajectories generated by random (as opposed to expert) policies. Lee et al. 2022

and Xu et al. 2022 also use datasets that include trajectories generated by partially trained agents in

addition to fully trained agents. Like these works, our proposed method (ICPI) does not rely on expert

demonstrations—but we note two key differences between our approach and existing approaches.

Firstly, ICPI only consumes self-generated trajectories, so it does not require any demonstrations

(like L. Chen et al. 2021 with random trajectories, but unlike Lee et al. 2022, Xu et al. 2022, and the

other approaches reviewed above). Secondly, ICPI relies primarily on in-context learning rather than

in-weights learning to achieve generalization (like Xu et al. 2022, but unlike L. Chen et al. 2021 &

Lee et al. 2022). For discussion about in-weights vs. in-context learning see Chan et al. 2022.

2.2 Gradient-based Training & Finetuning on RL Tasks

Most approaches involve training or ﬁne-tuning foundation models on RL tasks. For example, Janner

et al. 2021; L. Chen et al. 2021; Lee et al. 2022; Xu et al. 2022; Baker et al. 2022; Reed et al. 2022

all use models that are trained from scratch on tasks of interest, and A. K. Singh et al. 2022; Ahn

et al. 2022; Huang et al. 2022a combine frozen foundation models with trainable components or

adapters. In contrast, Huang et al. 2022b use frozen foundation models for planning, without training

or ﬁne-tuning on RL tasks. Like Huang et al. 2022b, ICPI does not update the parameters of the

foundation model, but relies on the frozen model’s in-context learning abilities. However, ICPI

gradually builds and improves the prompts within the space deﬁned by the given ﬁxed text-format for

observations, actions, and rewards (in contrast to Huang et al. 2022b, which uses the frozen model to

select good prompts from a given ﬁxed library of goal/plan descriptions).

2.3 In-Context Learning

Several recent papers have speciﬁcally studied in-context learning. Laskin et al. 2022 demonstrates

an approach to performing in-context reinforcement learning by training a model on complete RL

learning histories, demonstrating that the model actually distills the improvement operator of the

source algorithm. Chan et al. 2022 and S. Garg et al. 2022 provide analyses of the properties that

drive in-context learning, the ﬁrst in the context of image classiﬁcation, the second in the context of

regression onto a continuous function. These papers identify various properties, including “burstiness,”

model-size, and model-architecture, that in-context learning depends on. Y. Chen et al. 2022 studies

the sensitivity of in-context learning to small perturbations of the context. They propose a novel

method that uses sensitivity as a proxy for model certainty.

Algorithm 1 Training Loop

1: function TRAIN(environment)

2: initialize D▷replay buffer containing full history of behavior

3: while training do

4: o0←Reset environment.

5: while episode is not done do

6: at←arg maxaQ(ot, a, D)▷policy improvement

7: ot+1, rt, bt←Execute atin environment.

8: t←t+ 1

9: end while

10: D ← D ∪ (o0, a0, r0, b0, o1, . . . , ot, at, rt, bt, ot+1)▷add trajectory to buffer

11: end while

12: end function

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LargeLanguageModelsCanImplementPolicyIterationEthanBrooks1,LoganWalls2,RichardL.Lewis2,SatinderSingh11ComputerScienceandEngineering,UniversityofMichigan2DepartmentofPsychology,UniversityofMichigan{ethanbro,logwalls,rickl,baveja}@umich.eduAbstractInthiswork,wedemonstrateamethodforimplementingpolicyit...

展开>> 收起<<

Large Language Models Can Implement Policy Iteration Ethan Brooks1 Logan Walls2 Richard L. Lewis2 Satinder Singh1.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Large Language Models Can Implement Policy Iteration Ethan Brooks1 Logan Walls2 Richard L. Lewis2 Satinder Singh1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: