Learning General World Models in a Handful of Reward-Free Deployments Yingchen Xu

2025-05-02 0 0 2.94MB 28 页 10玖币

侵权投诉

Learning General World Models in a Handful of

Reward-Free Deployments

Yingchen Xu∗

UCL, Meta AI

Jack Parker-Holder∗

University of Oxford

Aldo Pacchiano∗

Microsoft Research

Philip J. Ball∗

University of Oxford

Oleh Rybkin

UPenn

Stephen J. Roberts

University of Oxford

Tim Rockt¨

aschel

UCL

Edward Grefenstette

UCL, Cohere

Abstract

Building generally capable agents is a grand challenge for deep reinforcement

learning (RL). To approach this challenge practically, we outline two key desiderata:

1) to facilitate generalization, exploration should be task agnostic; 2) to facilitate

scalability, exploration policies should collect large quantities of data without

costly centralized retraining. Combining these two properties, we introduce the

reward-free deployment efﬁciency setting, a new paradigm for RL research. We

then present CASCADE, a novel approach for self-supervised exploration in this

new setting. CASCADE seeks to learn a world model by collecting data with a

population of agents, using an information theoretic objective inspired by Bayesian

Active Learning. CASCADE achieves this by speciﬁcally maximizing the diversity

of trajectories sampled by the population through a novel cascading objective.

We provide theoretical intuition for CASCADE which we show in a tabular setting

improves upon na

ıve approaches that do not account for population diversity.

We then demonstrate that CASCADE collects diverse task-agnostic datasets and

learns agents that generalize zero-shot to novel, unseen downstream tasks on Atari,

MiniGrid, Crafter and the DM Control Suite. Code and videos are available at

https://ycxuyingchen.github.io/cascade/

1 Introduction

Reinforcement learning (RL, [

105

]) has achieved a number of impressive feats over the past decade,

with successes in games [

100

], robotics [

], and the emergence of real world applications

[

]. Indeed, now that RL has successfully mastered a host of individual tasks, the community

has begun to focus on the grand challenge of building generally capable agents [90,109,68,5].

In this work, we take steps towards building generalist agents at scale, where we outline two key

desiderata. First, for agents to become generalists that can adapt to novel tasks, we eschew the

notion of restricting agent learning to task-speciﬁc reward functions and focus on the reward-free

problem setting instead [

], whereby agents must discover novel skills and behaviors without

supervision.

Consider the problem of learning to control robotic arms, where we may already have

some expert ofﬂine data to learn from. In many cases this data will cover only a subset of the entire

range of possible behaviors. Therefore, to learn additional general skills, it is imperative to collect

additional novel and diverse data, and to do so without a pre-speciﬁed reward function.

Second, to ensure scalability, we should have access to a large ﬂeet of robots that we can deploy

to gather this data for a large number of timesteps [

], without costly and lengthy centralized

∗Equal contribution. Correspondence to ycxu@meta.com.

2Indeed, designing reward functions to learn behaviors can have unintended consequences [74,6].

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12719v1 [cs.LG] 23 Oct 2022

Population

Diversity

Imagine Latent Trajectories

Enc …

…

Diverse

Explorers

Deploy Explorers

Collect Data

Zero-shot

Generalization

New Tasks

Train Explorers

LD LD

…

Per State Info Gain

Image

Latent

State

Encoder

Latent

Disagreement

Train World Model

Figure 1:

Overview.

Left: CASCADE trains a population of diverse explorers and uses them to collect large

batches of reward-free trajectories for learning a general world model that facilitates zero-generalization to

novel tasks. Right: To train

exploration agents in parallel, at each training step

,CASCADE ﬁrst infers a

latent state

from image observation

. It then rolls out latent trajectories

τ1,...,τB

in imagination using

the current exploration policies

π1,...,πB

. The training objective for each policy

πi

is to optimize 1) the

population diversity estimated by the disagreement of the ﬁnal states of imagined trajectories sampled from

policies

π1,...,πi

; 2) the expected per state information gain over all future timesteps

t+ 1,...,T

, computed

as the disagreement of an ensemble of dynamics models.

retraining during this crucial phase [

]. This has recently been referred to as deployment efﬁciency,

falling between the typical online/ofﬂine RL dichotomy. Limiting deployments not only reduces the

overhead in retraining exploration policies, but also limits the potential costs and risks present when

deploying new policies [

108

], an important consideration in many real world settings, such as

robotics [48], education [65] and healthcare [29].

Combining these two desiderata, we introduce the reward-free deployment efﬁciency setting, a new

paradigm for deep RL research. To tackle this new problem we train a population of exploration

policies to collect large quantities of useful data via world models [20,34,39]. World models allow

agents to plan and/or train policies without interacting with the true environment. They have already

been shown to be highly effective for deployment efﬁciency [

], ofﬂine RL [

118

] and

self-supervised exploration [

]. Furthermore, world models offer the potential for increasing

agent generalization capabilities [

], one of the frontiers of RL research

[

]. However, since existing self-supervised methods for learning world models are designed

to collect only a few transitions with a single exploration policy, they likely produce a homogenous

dataset when deployed at scale, which does not optimally improve the model.

1 2

Figure 2: Motivation for CASCADE: Green

areas represent high expected information

gain. If we train a population of agents in-

dependently, at deployment time they will

all follow the trajectory to #1, producing a

homogenous dataset. However, if we con-

sider the diversity of the data then we will

produce agents that reach both #1 and #2.

Instead, drawing analogies from Bayesian Active Learning

[

], we introduce a new information theoretic objec-

tive that maximizes the information gain from an entire

dataset collected by a population of exploration agents (see

Figure 2). We call our method

oordinated

ctive

ample

ollection vi

a D

iverse

xplorers or CASCADE (Figure 1).

We provide theoretical justiﬁcation for CASCADE, which

emphasizes the importance of collecting data with diverse

agents. In addition, we provide a rigorous empirical evalua-

tion across four challenging domains that shows CASCADE

can discover a rich dataset from a handful of deployments.

We see that CASCADE produces general exploration strate-

gies that are equally adept at both “deep” exploration prob-

lems and diverse behavior discovery. This makes it possible

to train agents capable of zero-shot transfer when rewards

are provided at test time in a variety of different settings.

To summarize, our contributions are as follows: 1) We introduce a novel problem setting, Reward-

Free Deployment Efﬁciency, designed to train generalist agents in a scalable fashion; 2) We propose

CASCADE, a theoretically motivated model-based RL agent designed to gather diverse, highly

informative data, inspired by Bayesian Active Learning; 3) We provide analysis that shows CASCADE

theoretically improves sample efﬁciency over other na

ıve methods that do not ensure sample diversity,

and demonstrate that CASCADE is capable of improved zero-shot transfer in four distinct settings,

ranging from procedurally generated worlds to continuous control from pixels.

2 Problem Statement

Reinforcement learning (RL) considers training an agent to solve a Markov Decision Process (MDP),

represented as a tuple

M={S,A, P, R, ρ, γ}

, where

s∈ S

and

a∈ A

are the set of states and

actions respectively,

P(s0|s, a)

is a probability distribution over next states given a previous state

and action,

R(s, a, s0)→r

is a reward function mapping a transition to a scalar reward,

is an

initial state distribution and

is a discount factor. A policy

acting in the environment produces a

trajectory

τ={s1, a1, . . . , sH, aH}

for an episode with horizon

. Since actions in the trajectory

are sampled from a policy, we can then deﬁne the RL problem as ﬁnding a policy

that maximizes

expected returns in the environment, i.e. π?= arg maxπEτ∼π[R(τ)].

We seek to learn policies that can transfer to any MDP within a family of MDPs. This can be

formalized as a Contextual MDP [

], where observations, dynamics and rewards can vary given a

context. In this paper we consider settings where only the reward varies, thus, if the test-time context

is unknown at training time we must collect data that sufﬁciently covers the space of possible reward

functions. Finally, to facilitate scalability, we operate in the deployment efﬁcient paradigm [

whereby policy learning and exploration are completely separate, and during a given deployment,

we gather a large quantity of data without further policy retraining (c.f. online approaches like DER

[

112

], which take multiple gradient steps per exploration timestep in the real environment). Taken

together, we consider the reward-free deployment efﬁciency problem. This differs from previous work

as follows: 1) unlike previous deployment efﬁciency work, our exploration is task agnostic; 2) unlike

previous reward-free RL work, we cannot update our exploration policy

πEXP

during deployment.

Thus, the focus of our work is on how to train

πEXP

ofﬂine such that it gathers heterogeneous and

informative data which facilitate zero-shot transfer to unknown tasks.

In this paper we make use of model-based RL (MBRL), where the goal is to learn a model of the

environment (or world model [

]) and then use it to subsequently train policies to solve downstream

tasks. To do this, the world model needs to approximate both

and

. Typically, the model will

be a neural network, parameterized by

, hence we denote the approximate dynamics and reward

functions as

Pψ

and

Rψ

, which produces a new “imaginary” MDP,

Mψ= (S,A, Pψ, Rψ, ρ)

. We

focus on Dyna-style MBRL [

104

], whereby we train a policy (

πθ

parameterized by

) with model-free

RL solely using “imagined” transitions inside

Mψ

. Furthermore, we can train the policy on a single

GPU with parallelized rollouts since the simulator is a neural network [

]. The general form of all

methods in this paper is shown in Algorithm 1, with the key difference being step 5: We aim to update

πEXP

in the new imaginary MDP

Mψ

such that it continues to collect a large, diverse quantity of

reward-free data. Note that

πEXP

need not be a single policy, but could also refer to a collection of

policies that we can deploy (either in parallel or in series), such that π∈πEXP.

Algorithm 1 Reward-Free Deployment Efﬁciency via World Models

1: Input: Initial exploration policy πEXP

2: for each deployment do

3: Deploy πEXP to collect a large quantity of reward-free data.

4: Train world model on all existing data.

5: Update πEXP in new imaginary MDP Mψ.

6: end for

We focus on learning world models from high dimensional sensory inputs such as pixels [

where at each timestep we are given access to an observation

rather than a state

. A series

of recent works have shown tremendous success by mapping the observation to a compact latent

state

[

]. In this paper we will make use of the model from DreamerV2 [

], which has

been shown to produce highly effective policies in a variety of high dimensional environments. The

primary component of DreamerV2 is a Recurrent State Space Model (RSSM) that uses a learned

latent state to predict the image reconstruction, reward

and discount factor

γt

. Aside from the

reward head, all components of the model are trained jointly, in similar fashion to variational encoders

(VAEs, [

]). For zero-shot evaluation, we follow [

] and only train the reward head at test time

when provided with labels for our pre-collected data, which is then used to train a behavior policy

ofﬁne. Thus, it is critical that our dataset is sufﬁciently diverse to enable learning novel, unseen

behaviors.

3 Coordinated Active Sample Collection

The aim of this work is to train a population of

exploration policies

{π(i)

EXP}B

i=1

such that they

collectively acquire data which maximally improves the accuracy of a world model. To achieve this,

we take inspiration from the information theoretic approach in Plan2Explore [

], but crucially focus

on maximizing information gain over entire trajectories rather than per state-action, and hence drop

the conditional dependence on state and action (see App. C.1 for why this distinction is important):

πEXP = arg max

Idπ

Mψ;Mψ=H(dπ

Mψ)− H(dπ

Mψ|Mψ)(1)

where dπ

Mψis the distribution of states visited by the policy πin the imaginary MDP Mψ. This ob-

jective produces

πEXP

, a policy whose visitation distribution has a high entropy when computed over

model samples, but has low entropy for each individual MDP model (i.e., high epistemic/reducable

uncertainty). We think of each model

Mψ

as sampled from a posterior distribution over models given

the data. A good exploration policy has low entropy on individual models but large entropy across

models, i.e. it is intent in visiting regions of the space where there is large uncertainty about the

model transitions. To make this objective more general, we represent the trajectory data collected by a

policy with a “summary” embedding space [

]. Let

Φ:Γ→Ω

be a summary function mapping

trajectories into this embedding space.

PΦ

π[Mψ]

denotes the embedding distribution generated by

policy πin imaginary MDP Mψ. We can now write the objective from Eq. 1as follows:

πEXP = arg max

IPΦ

π[Mψ]; Mψ=H(PΦ

π[Mψ]) − H(PΦ

π[Mψ]|Mψ)(2)

This more general framework allows us to consider multiple representations for trajectories, in a

similar fashion to behavioral characterizations in Quality Diversity algorithms [

]. For the rest

of this discussion, we will use the ﬁnal state embedding as our summary representation, whereby

Φ(τ) = hH

, since in the case of the RSSM, the ﬁnal latent is a compact representation of the entire

trajectory collected by the policy, analogous to the ﬁnal hidden state in an RNN [94,103].

3.1 A Cascading Objective with Diverse Explorers

We now consider a population-based version of Equation. 2, using

agents:

{π(i)

EXP}B

i=1 = arg max

πB∈ΠB

I B

i=1

PΦ

π(i)[Mψ]; Mψ!=H B

i=1

PΦ

π(i)[Mψ]!− H B

i=1

PΦ

π(i)[Mψ]Mψ!

(3)

where

πB=π(1),· · · , π(B)

and

i=1 PΦ

π(i)[Mψ]

is the product measure of the policies’ embedding

distributions in Mψ. By deﬁnition, the conditional entropy factorizes as:

H B

i=1

PΦ

π(i)[Mψ]Mψ!=

i=1

HPΦ

π(i)[Mψ]Mψ.(4)

It is now possible to show that maximum information gain is achieved with a diverse set of agents:

Lemma 1.

When all models

Mψ

in the support of the model posterior are deterministic and tabular,

and the space of policies

consists only of deterministic policies, there always exists a solution

{π(i)

EXP}B

i=1

satisfying

π(i)

EXP 6=π(j)

EXP∀i6=j

. Moreover, there exists a family of tabular MDP models,

such that the maximum cannot be achieved by setting πEXP(i) = πfor a ﬁxed π.

The proof of Lemma 1is in Appendix C.1.1. Since the mutual information objective 3is submodular,

a greedy algorithm yields a

(1 −1

approximation of the optimum (where

is Euler’s number) [

Leveraging this insight, let us assume that we already have a set of policies

π(1),· · · , π(i−1)

; we then

select the next policy π(i)based on the following greedy objective:

π(i)= arg max

˜π(i)∈Π

I



j=1

PΦ

˜π(j)[Mψ]; Mψ˜π(j)=π(j)∀j≤i−1



=H



j=1

PΦ

˜π(j)[Mψ]˜π(j)=π(j)∀j≤i−1

− H 



j=1

PΦ

π(j)[Mψ]Mψ,˜π(j)=π(j)∀j≤i−1



Which can be factorized in similar fashion to Equation 4(See Appendix C).

3.2 A Tractable Objective for Deep RL

Inspired by [

], we make a couple of approximations to derive a tractable objective for

{πEXP}B

i=1

in the deep RL setting. First, we assume that the ﬁnal state embedding distributions are Gaussian

with means that depend on the policies and sampled worlds, and variances that depend on the worlds,

i.e.

PΦ

π[Mψ=w] = N(µ(w, π),Σ(w))

. In this case,

H(PΦ

π[Mψ]|Mψ=w) = ρ(w)

, and

Eq. 11 reduces to solving

π(i)= arg max˜π(i)∈ΠHQi

j=1 PΦ

˜π(j)[Mψ]˜π(j)=π(j)∀j≤i−1

for

a policy that maximizes the resulting joint entropy of the embedding distribution when added to

the policy population. This produces the following surrogate objective, maximizing a quadratic

cascading disagreement:

PopDivΦ(π|π(1),· · · , π(i−1))=Eτ∼Pπ[Mψ]

1

|D(i−1)| − 1X

˜τ∈D(i−1)

kΦ(τ)−Φ(˜τ)k2



where

D(i−1)

is a dataset of imagined trajectories sampled from policies

π(1),· · · , π(i−1)

in the

model, and

PopDiv

is short for Population Diversity. Finally, following [

], we also add a per

state information gain component to each policy’s reward to encourage a richer landscape for data

acquisition:

InfoGain(π)=Eτ∼Pπ[Mψ]hP(s,a)∈τσ(s, a)i

where

σ(·,·)

is the variance across the

ensemble latent state predictions (for details see App. B).

Taken together, these objectives form our approach, which we call

oordinated

ctive

ample

ollection vi

a D

iverse

xplorers or CASCADE.CASCADE trains agents to optimize: 1) a diversity

term (

PopDiv

) that takes into account the behaviors of the other agents in the population; 2) an

information gain term (

InfoGain

) that encourages an individual agent to sample states that maximally

improve the model:

π(i)= arg max

π∈ΠhλPopDivΦ(π|{π(j)

EXP}i−1

j=1)+ (1 −λ)InfoGain(π)i(5)

where

is a weighting hyperparameter that trades off whether we favor individual model information

gain or population diversity. Finally, we train the

agents in parallel using (policy) gradient descent

over θ, which makes it possible to achieve the same wall clock time as training a single agent [23].

3.3 Theoretical Motivation

We now seek to provide a tabular analogue to CASCADE which provides a theoretical grounding for

our approach. In App. Dwe outline the pseudo-code of CASCADE-TS, a greedy Thompson Sampling

algorithm [

] that produces the

-th exploration policy in a tabular enviornment using “imaginary”

data gathered by running policies π(1),· · · , π(i−1) in the model. We can then show the following:

Lemma 2. For the class of Binary Tree MDPs, the CASCADE-TS algorithm satisﬁes,

T(, Sequential) ≤T(, CASCADE-TS)≤T(, SinglePolicyBatch)

where

T(, ·)

are the expected number of rounds of deploying a population of

policies necessary

to learn the true model up to



accuracy;

SinglePolicyBatch

plays a ﬁxed policy

times in each

round;

Sequential

does not have a population, and instead interleaves updates and executions of a

single policy Btimes within each round.

The proof is in App. D. Indeed, we see that CASCADE-TS achieves provable efﬁciency gains over a

ıve sampling approach that does not ensure diversity in its deployed agents. This provable gain

is achieved by discouraging the

-th policy away from imaginary state-action pairs sampled by the

previous i−1policies by using imaginary counts.

Now returning to CASCADE, we can see the importance of leveraging the imaginary data gathered in

the model by the previous

i−1

policies when training the

-th exploration policy. Concretely, encour-

aging policy

π(i)

to induce high disagreement with the embeddings produced by

{π(1),· · · , π(i−1)}

(i.e. the

PopDiv

term) is analogous to the imaginary count bonus term of CASCADE-TS in Line 10 of

Alg. 2, which avoids redundant data collection during deployment.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningGeneralWorldModelsinaHandfulofReward-FreeDeploymentsYingchenXuUCL,MetaAIJackParker-HolderUniversityofOxfordAldoPacchianoMicrosoftResearchPhilipJ.BallUniversityofOxfordOlehRybkinUPennStephenJ.RobertsUniversityofOxfordTimRockt¨aschelUCLEdwardGrefenstetteUCL,CohereAbstractBuildinggenerallyc...

展开>> 收起<<

Learning General World Models in a Handful of Reward-Free Deployments Yingchen Xu.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning General World Models in a Handful of Reward-Free Deployments Yingchen Xu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: