Large Language Models Can Implement Policy Iteration Ethan Brooks1 Logan Walls2 Richard L. Lewis2 Satinder Singh1

2025-05-03 0 0 934.34KB 13 页 10玖币
侵权投诉
Large Language Models
Can Implement Policy Iteration
Ethan Brooks1, Logan Walls2, Richard L. Lewis2, Satinder Singh1
1Computer Science and Engineering, University of Michigan
2Department of Psychology, University of Michigan
{ethanbro,logwalls,rickl,baveja}@umich.edu
Abstract
In this work, we demonstrate a method for implementing policy iteration using
a large language model. While the application of foundation models to RL has
received considerable attention, most approaches rely on either (1) the curation of
expert demonstrations (either through manual design or task-specific pretraining)
or (2) adaptation to the task of interest using gradient methods (either fine-tuning
or training of adapter layers). Both of these techniques have drawbacks. Collecting
demonstrations is labor-intensive, and algorithms that rely on them do not out-
perform the experts from which the demonstrations were derived. All gradient
techniques are inherently slow, sacrificing the “few-shot” quality that makes in-
context learning attractive to begin with. Our method demonstrates that a large
language model can be used to implement policy iteration using the machinery
of in-context learning, enabling it to learn to perform RL tasks without expert
demonstrations or gradients. Our approach iteratively updates the contents of the
prompt from which it derives its policy through trial-and-error interaction with an
RL environment. In order to eliminate the role of in-weights learning (on which
approaches like Decision Transformer rely heavily), we demonstrate our method
using Codex (M. Chen et al. 2021b), a language model with no prior knowledge of
the domains on which we evaluate it.
1 Introduction
In many settings, models implemented using a transformer or recurrent architecture will improve their
performance as information accumulated in their context or memory. We refer to this phenomenon as
“in-context learning. (Brown et al. 2020b) demonstrated a technique for inducing this phenomenon
by prompting a large language model with a small number of input/output exemplars. An interesting
property of in-context learning in the case of large pre-trained models (or “foundation models”) is
that the models are not directly trained to optimize a meta-learning objective, but demonstrate an
emergent capacity to generalize (or at least specialize) to diverse downstream task-distributions (Wei
et al. 2022b).
A litany of existing work has explored methods for applying this remarkable capability to downstream
tasks (see Related Work), including Reinforcement Learning (RL). Most work in this area either (1)
assumes access to expert demonstrations — collected either from human experts (Huang et al. 2022b;
Baker et al. 2022), or domain-specific pre-trained RL agents (L. Chen et al. 2021; Lee et al. 2022;
Janner et al. 2021; Reed et al. 2022; Xu et al. 2022). — or (2) relies on gradient-based methods — e.g.
fine-tuning of the foundation models parameters as a whole (Lee et al. 2022; Reed et al. 2022; Baker
et al. 2022) or newly training an adapter layer or prefix vectors while keeping the original foundation
models frozen (X. L. Li et al. 2021; A. K. Singh et al. 2022; Karimi Mahabadi et al. 2022).
Preprint. Under review.
arXiv:2210.03821v2 [cs.LG] 13 Aug 2023
st
LLM
a1
s1
A(1) =
st=
For each action in {A(1), ..., A(n)}:
D
s2
r1
LLM
D
a2LLM
D
s3
r2
... Q(st,A(1)) = Pγuru
...
...
...
...
...
LLM
s1
a1
A(n)=
st=
D
s2
r1
LLM
D
a2LLM
D
s3
r2
... Q(st,A(n)) = Pγuru
arg maxa
at
Environment
Figure 1: For each possible action
A(1),...,A(n)
, the LLM generates a rollout by alternately
predicting transitions and selecting actions. Q-value estimates are discounted sums of rewards. The
action is chosen greedily with respect to Q-values. Both state/reward prediction and next action
selection use trajectories from
D
to create prompts for the LLM. Changes to the content of
D
change
the prompts that the LLM receives, allowing the model to improve its behavior over time.
Our work demonstrates an approach to in-context learning which relaxes these assumptions. Our
method, In-Context Policy Iteration (ICPI), implements policy iteration using the prompt content,
instead of the model parameters, as the locus of learning, thereby avoiding gradient methods. Further-
more, the use of policy iteration frees us from expert demonstrations because suboptimal prompts
can be improved over the course of training.
We illustrate the method empirically on six small illustrative RL tasks— chain, distractor-chain,
maze, mini-catch, mini-invaders, and point-mass—in which the method very quickly finds good
policies. We also compare five pretrained Large Language Models (LLMs), including two different
size models trained on natural language—OPT-30B and GPT-J—and three different sizes of a model
trained on program code—two sizes of Codex as well as InCoder. On our six domains, we find
that only the largest model (the
code-davinci-001
variant of Codex) consistently demonstrates
learning.
2 Related Work
A common application of foundation models to RL involves tasks that have language input, for
example natural language instructions/goals (D. Garg et al. 2022; Hill et al. 2020) or text-based games
(Peng et al. 2021; I. Singh et al. 2021; Majumdar et al. 2020; Ammanabrolu et al. 2021). Another
approach encodes RL trajectories into token sequences, and processes them with a foundation model,
model representations as input to deep RL architectures (S. Li et al. 2022; Tarasov et al. 2022; Tam
et al. 2022). Finally, a recent set of approaches treat RL as a sequence modeling problem and use the
foundation models itself to predict states or actions. This section will focus on this last category.
2.1 Learning from demonstrations
Many recent sequence-based approaches to reinforcement learning use demonstrations that come
either from human experts or pretrained RL agents. For example, Huang et al. 2022b use a frozen
LLM as a planner for everyday household tasks by constructing a prefix from human-generated
task instructions, and then using the LLM to generate instructions for new tasks. This work is
extended by Huang et al. 2022a. Similarly, Ahn et al. 2022 use a value function that is trained on
human demonstrations to rank candidate actions produced by an LLM. Baker et al. 2022 use human
demonstrations to train the foundation model itself: they use video recordings of human Minecraft
players to train a foundation models that plays Minecraft. Works that rely on pretrained RL agents
include Janner et al. 2021 who train a “Trajectory Transformer” to predict trajectory sequences in
continuous control tasks by using trajectories generated by pretrained agents, and L. Chen et al. 2021,
who use a dataset of offline trajectories to train a “Decision Transformer” that predicts actions from
2
state-action-reward sequences in RL environments like Atari. Two approaches build on this method
to improve generalization: Lee et al. 2022 use trajectories generated by a DQN agent to train a single
Decision Transformer that can play many Atari games, and Xu et al. 2022 use a combination of
human and artificial trajectories to train a Decision Transformer that achieves few-shot generalization
on continuous control tasks. Reed et al. 2022 take task-generality a step farther and use datasets
generated by pretrained agents to train a multi-modal agent that performs a wide array of RL (e.g.
Atari, continuous control) and non-RL (e.g. image captioning, chat) tasks.
Some of the above works include non-expert demonstrations as well. L. Chen et al. 2021 include
experiments with trajectories generated by random (as opposed to expert) policies. Lee et al. 2022
and Xu et al. 2022 also use datasets that include trajectories generated by partially trained agents in
addition to fully trained agents. Like these works, our proposed method (ICPI) does not rely on expert
demonstrations—but we note two key differences between our approach and existing approaches.
Firstly, ICPI only consumes self-generated trajectories, so it does not require any demonstrations
(like L. Chen et al. 2021 with random trajectories, but unlike Lee et al. 2022, Xu et al. 2022, and the
other approaches reviewed above). Secondly, ICPI relies primarily on in-context learning rather than
in-weights learning to achieve generalization (like Xu et al. 2022, but unlike L. Chen et al. 2021 &
Lee et al. 2022). For discussion about in-weights vs. in-context learning see Chan et al. 2022.
2.2 Gradient-based Training & Finetuning on RL Tasks
Most approaches involve training or fine-tuning foundation models on RL tasks. For example, Janner
et al. 2021; L. Chen et al. 2021; Lee et al. 2022; Xu et al. 2022; Baker et al. 2022; Reed et al. 2022
all use models that are trained from scratch on tasks of interest, and A. K. Singh et al. 2022; Ahn
et al. 2022; Huang et al. 2022a combine frozen foundation models with trainable components or
adapters. In contrast, Huang et al. 2022b use frozen foundation models for planning, without training
or fine-tuning on RL tasks. Like Huang et al. 2022b, ICPI does not update the parameters of the
foundation model, but relies on the frozen model’s in-context learning abilities. However, ICPI
gradually builds and improves the prompts within the space defined by the given fixed text-format for
observations, actions, and rewards (in contrast to Huang et al. 2022b, which uses the frozen model to
select good prompts from a given fixed library of goal/plan descriptions).
2.3 In-Context Learning
Several recent papers have specifically studied in-context learning. Laskin et al. 2022 demonstrates
an approach to performing in-context reinforcement learning by training a model on complete RL
learning histories, demonstrating that the model actually distills the improvement operator of the
source algorithm. Chan et al. 2022 and S. Garg et al. 2022 provide analyses of the properties that
drive in-context learning, the first in the context of image classification, the second in the context of
regression onto a continuous function. These papers identify various properties, including “burstiness,
model-size, and model-architecture, that in-context learning depends on. Y. Chen et al. 2022 studies
the sensitivity of in-context learning to small perturbations of the context. They propose a novel
method that uses sensitivity as a proxy for model certainty.
Algorithm 1 Training Loop
1: function TRAIN(environment)
2: initialize Dreplay buffer containing full history of behavior
3: while training do
4: o0Reset environment.
5: while episode is not done do
6: atarg maxaQ(ot, a, D)policy improvement
7: ot+1, rt, btExecute atin environment.
8: tt+ 1
9: end while
10: D D ∪ (o0, a0, r0, b0, o1, . . . , ot, at, rt, bt, ot+1)add trajectory to buffer
11: end while
12: end function
3
摘要:

LargeLanguageModelsCanImplementPolicyIterationEthanBrooks1,LoganWalls2,RichardL.Lewis2,SatinderSingh11ComputerScienceandEngineering,UniversityofMichigan2DepartmentofPsychology,UniversityofMichigan{ethanbro,logwalls,rickl,baveja}@umich.eduAbstractInthiswork,wedemonstrateamethodforimplementingpolicyit...

展开>> 收起<<
Large Language Models Can Implement Policy Iteration Ethan Brooks1 Logan Walls2 Richard L. Lewis2 Satinder Singh1.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:934.34KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注