Learning General World Models in a Handful of Reward-Free Deployments Yingchen Xu

2025-05-02 0 0 2.94MB 28 页 10玖币
侵权投诉
Learning General World Models in a Handful of
Reward-Free Deployments
Yingchen Xu
UCL, Meta AI
Jack Parker-Holder
University of Oxford
Aldo Pacchiano
Microsoft Research
Philip J. Ball
University of Oxford
Oleh Rybkin
UPenn
Stephen J. Roberts
University of Oxford
Tim Rockt¨
aschel
UCL
Edward Grefenstette
UCL, Cohere
Abstract
Building generally capable agents is a grand challenge for deep reinforcement
learning (RL). To approach this challenge practically, we outline two key desiderata:
1) to facilitate generalization, exploration should be task agnostic; 2) to facilitate
scalability, exploration policies should collect large quantities of data without
costly centralized retraining. Combining these two properties, we introduce the
reward-free deployment efficiency setting, a new paradigm for RL research. We
then present CASCADE, a novel approach for self-supervised exploration in this
new setting. CASCADE seeks to learn a world model by collecting data with a
population of agents, using an information theoretic objective inspired by Bayesian
Active Learning. CASCADE achieves this by specifically maximizing the diversity
of trajectories sampled by the population through a novel cascading objective.
We provide theoretical intuition for CASCADE which we show in a tabular setting
improves upon na
¨
ıve approaches that do not account for population diversity.
We then demonstrate that CASCADE collects diverse task-agnostic datasets and
learns agents that generalize zero-shot to novel, unseen downstream tasks on Atari,
MiniGrid, Crafter and the DM Control Suite. Code and videos are available at
https://ycxuyingchen.github.io/cascade/
1 Introduction
Reinforcement learning (RL, [
105
]) has achieved a number of impressive feats over the past decade,
with successes in games [
69
,
13
,
100
], robotics [
48
,
77
], and the emergence of real world applications
[
11
,
25
]. Indeed, now that RL has successfully mastered a host of individual tasks, the community
has begun to focus on the grand challenge of building generally capable agents [90,109,68,5].
In this work, we take steps towards building generalist agents at scale, where we outline two key
desiderata. First, for agents to become generalists that can adapt to novel tasks, we eschew the
notion of restricting agent learning to task-specific reward functions and focus on the reward-free
problem setting instead [
83
,
28
], whereby agents must discover novel skills and behaviors without
supervision.
2
Consider the problem of learning to control robotic arms, where we may already have
some expert offline data to learn from. In many cases this data will cover only a subset of the entire
range of possible behaviors. Therefore, to learn additional general skills, it is imperative to collect
additional novel and diverse data, and to do so without a pre-specified reward function.
Second, to ensure scalability, we should have access to a large fleet of robots that we can deploy
to gather this data for a large number of timesteps [
60
], without costly and lengthy centralized
Equal contribution. Correspondence to ycxu@meta.com.
2Indeed, designing reward functions to learn behaviors can have unintended consequences [74,6].
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12719v1 [cs.LG] 23 Oct 2022
Population
Diversity
Imagine Latent Trajectories
Enc
Diverse
Explorers
Deploy Explorers
Collect Data
Zero-shot
Generalization
New Tasks
Train Explorers
LD LD
Per State Info Gain
Image
Latent
State
Encoder
Latent
Disagreement
Train World Model
Figure 1:
Overview.
Left: CASCADE trains a population of diverse explorers and uses them to collect large
batches of reward-free trajectories for learning a general world model that facilitates zero-generalization to
novel tasks. Right: To train
B
exploration agents in parallel, at each training step
t
,CASCADE first infers a
latent state
st
from image observation
ot
. It then rolls out latent trajectories
τ1,...,τB
in imagination using
the current exploration policies
π1,...,πB
. The training objective for each policy
πi
is to optimize 1) the
population diversity estimated by the disagreement of the final states of imagined trajectories sampled from
policies
π1,...,πi
; 2) the expected per state information gain over all future timesteps
t+ 1,...,T
, computed
as the disagreement of an ensemble of dynamics models.
retraining during this crucial phase [
67
]. This has recently been referred to as deployment efficiency,
falling between the typical online/offline RL dichotomy. Limiting deployments not only reduces the
overhead in retraining exploration policies, but also limits the potential costs and risks present when
deploying new policies [
108
,
53
], an important consideration in many real world settings, such as
robotics [48], education [65] and healthcare [29].
Combining these two desiderata, we introduce the reward-free deployment efficiency setting, a new
paradigm for deep RL research. To tackle this new problem we train a population of exploration
policies to collect large quantities of useful data via world models [20,34,39]. World models allow
agents to plan and/or train policies without interacting with the true environment. They have already
been shown to be highly effective for deployment efficiency [
67
], offline RL [
118
,
49
,
8
,
88
] and
self-supervised exploration [
99
,
97
]. Furthermore, world models offer the potential for increasing
agent generalization capabilities [
7
,
37
,
68
,
16
,
8
,
21
,
10
,
58
], one of the frontiers of RL research
[
80
,
51
]. However, since existing self-supervised methods for learning world models are designed
to collect only a few transitions with a single exploration policy, they likely produce a homogenous
dataset when deployed at scale, which does not optimally improve the model.
1 2
Figure 2: Motivation for CASCADE: Green
areas represent high expected information
gain. If we train a population of agents in-
dependently, at deployment time they will
all follow the trajectory to #1, producing a
homogenous dataset. However, if we con-
sider the diversity of the data then we will
produce agents that reach both #1 and #2.
Instead, drawing analogies from Bayesian Active Learning
[
42
,
52
], we introduce a new information theoretic objec-
tive that maximizes the information gain from an entire
dataset collected by a population of exploration agents (see
Figure 2). We call our method
C
oordinated
A
ctive
S
ample
C
ollection vi
a D
iverse
E
xplorers or CASCADE (Figure 1).
We provide theoretical justification for CASCADE, which
emphasizes the importance of collecting data with diverse
agents. In addition, we provide a rigorous empirical evalua-
tion across four challenging domains that shows CASCADE
can discover a rich dataset from a handful of deployments.
We see that CASCADE produces general exploration strate-
gies that are equally adept at both “deep” exploration prob-
lems and diverse behavior discovery. This makes it possible
to train agents capable of zero-shot transfer when rewards
are provided at test time in a variety of different settings.
To summarize, our contributions are as follows: 1) We introduce a novel problem setting, Reward-
Free Deployment Efficiency, designed to train generalist agents in a scalable fashion; 2) We propose
CASCADE, a theoretically motivated model-based RL agent designed to gather diverse, highly
informative data, inspired by Bayesian Active Learning; 3) We provide analysis that shows CASCADE
theoretically improves sample efficiency over other na
¨
ıve methods that do not ensure sample diversity,
and demonstrate that CASCADE is capable of improved zero-shot transfer in four distinct settings,
ranging from procedurally generated worlds to continuous control from pixels.
2
2 Problem Statement
Reinforcement learning (RL) considers training an agent to solve a Markov Decision Process (MDP),
represented as a tuple
M={S,A, P, R, ρ, γ}
, where
s∈ S
and
a∈ A
are the set of states and
actions respectively,
P(s0|s, a)
is a probability distribution over next states given a previous state
and action,
R(s, a, s0)r
is a reward function mapping a transition to a scalar reward,
ρ
is an
initial state distribution and
γ
is a discount factor. A policy
π
acting in the environment produces a
trajectory
τ={s1, a1, . . . , sH, aH}
for an episode with horizon
H
. Since actions in the trajectory
are sampled from a policy, we can then define the RL problem as finding a policy
π
that maximizes
expected returns in the environment, i.e. π?= arg maxπEτπ[R(τ)].
We seek to learn policies that can transfer to any MDP within a family of MDPs. This can be
formalized as a Contextual MDP [
51
], where observations, dynamics and rewards can vary given a
context. In this paper we consider settings where only the reward varies, thus, if the test-time context
is unknown at training time we must collect data that sufficiently covers the space of possible reward
functions. Finally, to facilitate scalability, we operate in the deployment efficient paradigm [
67
],
whereby policy learning and exploration are completely separate, and during a given deployment,
we gather a large quantity of data without further policy retraining (c.f. online approaches like DER
[
112
], which take multiple gradient steps per exploration timestep in the real environment). Taken
together, we consider the reward-free deployment efficiency problem. This differs from previous work
as follows: 1) unlike previous deployment efficiency work, our exploration is task agnostic; 2) unlike
previous reward-free RL work, we cannot update our exploration policy
πEXP
during deployment.
Thus, the focus of our work is on how to train
πEXP
offline such that it gathers heterogeneous and
informative data which facilitate zero-shot transfer to unknown tasks.
In this paper we make use of model-based RL (MBRL), where the goal is to learn a model of the
environment (or world model [
96
]) and then use it to subsequently train policies to solve downstream
tasks. To do this, the world model needs to approximate both
P
and
R
. Typically, the model will
be a neural network, parameterized by
ψ
, hence we denote the approximate dynamics and reward
functions as
Pψ
and
Rψ
, which produces a new “imaginary” MDP,
Mψ= (S,A, Pψ, Rψ, ρ)
. We
focus on Dyna-style MBRL [
104
], whereby we train a policy (
πθ
parameterized by
θ
) with model-free
RL solely using “imagined” transitions inside
Mψ
. Furthermore, we can train the policy on a single
GPU with parallelized rollouts since the simulator is a neural network [
54
]. The general form of all
methods in this paper is shown in Algorithm 1, with the key difference being step 5: We aim to update
πEXP
in the new imaginary MDP
Mψ
such that it continues to collect a large, diverse quantity of
reward-free data. Note that
πEXP
need not be a single policy, but could also refer to a collection of
policies that we can deploy (either in parallel or in series), such that ππEXP.
Algorithm 1 Reward-Free Deployment Efficiency via World Models
1: Input: Initial exploration policy πEXP
2: for each deployment do
3: Deploy πEXP to collect a large quantity of reward-free data.
4: Train world model on all existing data.
5: Update πEXP in new imaginary MDP Mψ.
6: end for
We focus on learning world models from high dimensional sensory inputs such as pixels [
34
,
76
,
47
],
where at each timestep we are given access to an observation
ot
rather than a state
st
. A series
of recent works have shown tremendous success by mapping the observation to a compact latent
state
zt
[
39
,
38
,
40
]. In this paper we will make use of the model from DreamerV2 [
40
], which has
been shown to produce highly effective policies in a variety of high dimensional environments. The
primary component of DreamerV2 is a Recurrent State Space Model (RSSM) that uses a learned
latent state to predict the image reconstruction, reward
rt
and discount factor
γt
. Aside from the
reward head, all components of the model are trained jointly, in similar fashion to variational encoders
(VAEs, [
50
,
92
]). For zero-shot evaluation, we follow [
97
] and only train the reward head at test time
when provided with labels for our pre-collected data, which is then used to train a behavior policy
offine. Thus, it is critical that our dataset is sufficiently diverse to enable learning novel, unseen
behaviors.
3
3 Coordinated Active Sample Collection
The aim of this work is to train a population of
B
exploration policies
{π(i)
EXP}B
i=1
such that they
collectively acquire data which maximally improves the accuracy of a world model. To achieve this,
we take inspiration from the information theoretic approach in Plan2Explore [
97
], but crucially focus
on maximizing information gain over entire trajectories rather than per state-action, and hence drop
the conditional dependence on state and action (see App. C.1 for why this distinction is important):
πEXP = arg max
π
Idπ
Mψ;Mψ=H(dπ
Mψ)− H(dπ
Mψ|Mψ)(1)
where dπ
Mψis the distribution of states visited by the policy πin the imaginary MDP Mψ. This ob-
jective produces
πEXP
, a policy whose visitation distribution has a high entropy when computed over
model samples, but has low entropy for each individual MDP model (i.e., high epistemic/reducable
uncertainty). We think of each model
Mψ
as sampled from a posterior distribution over models given
the data. A good exploration policy has low entropy on individual models but large entropy across
models, i.e. it is intent in visiting regions of the space where there is large uncertainty about the
model transitions. To make this objective more general, we represent the trajectory data collected by a
policy with a “summary” embedding space [
79
,
72
]. Let
Φ:Γ
be a summary function mapping
trajectories into this embedding space.
PΦ
π[Mψ]
denotes the embedding distribution generated by
policy πin imaginary MDP Mψ. We can now write the objective from Eq. 1as follows:
πEXP = arg max
π
IPΦ
π[Mψ]; Mψ=H(PΦ
π[Mψ]) − H(PΦ
π[Mψ]|Mψ)(2)
This more general framework allows us to consider multiple representations for trajectories, in a
similar fashion to behavioral characterizations in Quality Diversity algorithms [
85
]. For the rest
of this discussion, we will use the final state embedding as our summary representation, whereby
Φ(τ) = hH
, since in the case of the RSSM, the final latent is a compact representation of the entire
trajectory collected by the policy, analogous to the final hidden state in an RNN [94,103].
3.1 A Cascading Objective with Diverse Explorers
We now consider a population-based version of Equation. 2, using
B
agents:
{π(i)
EXP}B
i=1 = arg max
πBΠB
I B
Y
i=1
PΦ
π(i)[Mψ]; Mψ!=H B
Y
i=1
PΦ
π(i)[Mψ]! H B
Y
i=1
PΦ
π(i)[Mψ]Mψ!
(3)
where
πB=π(1),· · · , π(B)
and
QB
i=1 PΦ
π(i)[Mψ]
is the product measure of the policies’ embedding
distributions in Mψ. By definition, the conditional entropy factorizes as:
H B
Y
i=1
PΦ
π(i)[Mψ]Mψ!=
B
X
i=1
HPΦ
π(i)[Mψ]Mψ.(4)
It is now possible to show that maximum information gain is achieved with a diverse set of agents:
Lemma 1.
When all models
Mψ
in the support of the model posterior are deterministic and tabular,
and the space of policies
Π
consists only of deterministic policies, there always exists a solution
{π(i)
EXP}B
i=1
satisfying
π(i)
EXP 6=π(j)
EXPi6=j
. Moreover, there exists a family of tabular MDP models,
such that the maximum cannot be achieved by setting πEXP(i) = πfor a fixed π.
The proof of Lemma 1is in Appendix C.1.1. Since the mutual information objective 3is submodular,
a greedy algorithm yields a
(1 1
e)
approximation of the optimum (where
e
is Euler’s number) [
73
].
Leveraging this insight, let us assume that we already have a set of policies
π(1),· · · , π(i1)
; we then
select the next policy π(i)based on the following greedy objective:
π(i)= arg max
˜π(i)Π
I
i
Y
j=1
PΦ
˜π(j)[Mψ]; Mψ˜π(j)=π(j)ji1
=H
i
Y
j=1
PΦ
˜π(j)[Mψ]˜π(j)=π(j)ji1
H
i
Y
j=1
PΦ
π(j)[Mψ]Mψ,˜π(j)=π(j)ji1
Which can be factorized in similar fashion to Equation 4(See Appendix C).
4
3.2 A Tractable Objective for Deep RL
Inspired by [
97
], we make a couple of approximations to derive a tractable objective for
{πEXP}B
i=1
in the deep RL setting. First, we assume that the final state embedding distributions are Gaussian
with means that depend on the policies and sampled worlds, and variances that depend on the worlds,
i.e.
PΦ
π[Mψ=w] = N(µ(w, π),Σ(w))
. In this case,
H(PΦ
π[Mψ]|Mψ=w) = ρ(w)
, and
Eq. 11 reduces to solving
π(i)= arg max˜π(i)ΠHQi
j=1 PΦ
˜π(j)[Mψ]˜π(j)=π(j)ji1
for
a policy that maximizes the resulting joint entropy of the embedding distribution when added to
the policy population. This produces the following surrogate objective, maximizing a quadratic
cascading disagreement:
PopDivΦ(π|π(1),· · · , π(i1))=EτPπ[Mψ]
1
|D(i1)| − 1X
˜τ∈D(i1)
kΦ(τ)Φ(˜τ)k2
where
D(i1)
is a dataset of imagined trajectories sampled from policies
π(1),· · · , π(i1)
in the
model, and
PopDiv
is short for Population Diversity. Finally, following [
97
,
9
], we also add a per
state information gain component to each policy’s reward to encourage a richer landscape for data
acquisition:
InfoGain(π)=EτPπ[Mψ]hP(s,a)τσ(s, a)i
where
σ(·,·)
is the variance across the
ensemble latent state predictions (for details see App. B).
Taken together, these objectives form our approach, which we call
C
oordinated
A
ctive
S
ample
C
ollection vi
a D
iverse
E
xplorers or CASCADE.CASCADE trains agents to optimize: 1) a diversity
term (
PopDiv
) that takes into account the behaviors of the other agents in the population; 2) an
information gain term (
InfoGain
) that encourages an individual agent to sample states that maximally
improve the model:
π(i)= arg max
πΠhλPopDivΦ(π|{π(j)
EXP}i1
j=1)+ (1 λ)InfoGain(π)i(5)
where
λ
is a weighting hyperparameter that trades off whether we favor individual model information
gain or population diversity. Finally, we train the
B
agents in parallel using (policy) gradient descent
over θ, which makes it possible to achieve the same wall clock time as training a single agent [23].
3.3 Theoretical Motivation
We now seek to provide a tabular analogue to CASCADE which provides a theoretical grounding for
our approach. In App. Dwe outline the pseudo-code of CASCADE-TS, a greedy Thompson Sampling
algorithm [
4
] that produces the
i
-th exploration policy in a tabular enviornment using “imaginary”
data gathered by running policies π(1),· · · , π(i1) in the model. We can then show the following:
Lemma 2. For the class of Binary Tree MDPs, the CASCADE-TS algorithm satisfies,
T(, Sequential) T(, CASCADE-TS)T(, SinglePolicyBatch)
where
T(, ·)
are the expected number of rounds of deploying a population of
B
policies necessary
to learn the true model up to
accuracy;
SinglePolicyBatch
plays a fixed policy
B
times in each
round;
Sequential
does not have a population, and instead interleaves updates and executions of a
single policy Btimes within each round.
The proof is in App. D. Indeed, we see that CASCADE-TS achieves provable efficiency gains over a
na
¨
ıve sampling approach that does not ensure diversity in its deployed agents. This provable gain
is achieved by discouraging the
i
-th policy away from imaginary state-action pairs sampled by the
previous i1policies by using imaginary counts.
Now returning to CASCADE, we can see the importance of leveraging the imaginary data gathered in
the model by the previous
i1
policies when training the
i
-th exploration policy. Concretely, encour-
aging policy
π(i)
to induce high disagreement with the embeddings produced by
{π(1),· · · , π(i1)}
(i.e. the
PopDiv
term) is analogous to the imaginary count bonus term of CASCADE-TS in Line 10 of
Alg. 2, which avoids redundant data collection during deployment.
5
摘要:

LearningGeneralWorldModelsinaHandfulofReward-FreeDeploymentsYingchenXuUCL,MetaAIJackParker-HolderUniversityofOxfordAldoPacchianoMicrosoftResearchPhilipJ.BallUniversityofOxfordOlehRybkinUPennStephenJ.RobertsUniversityofOxfordTimRockt¨aschelUCLEdwardGrefenstetteUCL,CohereAbstractBuildinggenerallyc...

展开>> 收起<<
Learning General World Models in a Handful of Reward-Free Deployments Yingchen Xu.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:2.94MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注