
state-action-reward sequences in RL environments like Atari. Two approaches build on this method
to improve generalization: Lee et al. 2022 use trajectories generated by a DQN agent to train a single
Decision Transformer that can play many Atari games, and Xu et al. 2022 use a combination of
human and artificial trajectories to train a Decision Transformer that achieves few-shot generalization
on continuous control tasks. Reed et al. 2022 take task-generality a step farther and use datasets
generated by pretrained agents to train a multi-modal agent that performs a wide array of RL (e.g.
Atari, continuous control) and non-RL (e.g. image captioning, chat) tasks.
Some of the above works include non-expert demonstrations as well. L. Chen et al. 2021 include
experiments with trajectories generated by random (as opposed to expert) policies. Lee et al. 2022
and Xu et al. 2022 also use datasets that include trajectories generated by partially trained agents in
addition to fully trained agents. Like these works, our proposed method (ICPI) does not rely on expert
demonstrations—but we note two key differences between our approach and existing approaches.
Firstly, ICPI only consumes self-generated trajectories, so it does not require any demonstrations
(like L. Chen et al. 2021 with random trajectories, but unlike Lee et al. 2022, Xu et al. 2022, and the
other approaches reviewed above). Secondly, ICPI relies primarily on in-context learning rather than
in-weights learning to achieve generalization (like Xu et al. 2022, but unlike L. Chen et al. 2021 &
Lee et al. 2022). For discussion about in-weights vs. in-context learning see Chan et al. 2022.
2.2 Gradient-based Training & Finetuning on RL Tasks
Most approaches involve training or fine-tuning foundation models on RL tasks. For example, Janner
et al. 2021; L. Chen et al. 2021; Lee et al. 2022; Xu et al. 2022; Baker et al. 2022; Reed et al. 2022
all use models that are trained from scratch on tasks of interest, and A. K. Singh et al. 2022; Ahn
et al. 2022; Huang et al. 2022a combine frozen foundation models with trainable components or
adapters. In contrast, Huang et al. 2022b use frozen foundation models for planning, without training
or fine-tuning on RL tasks. Like Huang et al. 2022b, ICPI does not update the parameters of the
foundation model, but relies on the frozen model’s in-context learning abilities. However, ICPI
gradually builds and improves the prompts within the space defined by the given fixed text-format for
observations, actions, and rewards (in contrast to Huang et al. 2022b, which uses the frozen model to
select good prompts from a given fixed library of goal/plan descriptions).
2.3 In-Context Learning
Several recent papers have specifically studied in-context learning. Laskin et al. 2022 demonstrates
an approach to performing in-context reinforcement learning by training a model on complete RL
learning histories, demonstrating that the model actually distills the improvement operator of the
source algorithm. Chan et al. 2022 and S. Garg et al. 2022 provide analyses of the properties that
drive in-context learning, the first in the context of image classification, the second in the context of
regression onto a continuous function. These papers identify various properties, including “burstiness,”
model-size, and model-architecture, that in-context learning depends on. Y. Chen et al. 2022 studies
the sensitivity of in-context learning to small perturbations of the context. They propose a novel
method that uses sensitivity as a proxy for model certainty.
Algorithm 1 Training Loop
1: function TRAIN(environment)
2: initialize D▷replay buffer containing full history of behavior
3: while training do
4: o0←Reset environment.
5: while episode is not done do
6: at←arg maxaQ(ot, a, D)▷policy improvement
7: ot+1, rt, bt←Execute atin environment.
8: t←t+ 1
9: end while
10: D ← D ∪ (o0, a0, r0, b0, o1, . . . , ot, at, rt, bt, ot+1)▷add trajectory to buffer
11: end while
12: end function
3