
2 Problem Statement
Reinforcement learning (RL) considers training an agent to solve a Markov Decision Process (MDP),
represented as a tuple
M={S,A, P, R, ρ, γ}
, where
s∈ S
and
a∈ A
are the set of states and
actions respectively,
P(s0|s, a)
is a probability distribution over next states given a previous state
and action,
R(s, a, s0)→r
is a reward function mapping a transition to a scalar reward,
ρ
is an
initial state distribution and
γ
is a discount factor. A policy
π
acting in the environment produces a
trajectory
τ={s1, a1, . . . , sH, aH}
for an episode with horizon
H
. Since actions in the trajectory
are sampled from a policy, we can then define the RL problem as finding a policy
π
that maximizes
expected returns in the environment, i.e. π?= arg maxπEτ∼π[R(τ)].
We seek to learn policies that can transfer to any MDP within a family of MDPs. This can be
formalized as a Contextual MDP [
51
], where observations, dynamics and rewards can vary given a
context. In this paper we consider settings where only the reward varies, thus, if the test-time context
is unknown at training time we must collect data that sufficiently covers the space of possible reward
functions. Finally, to facilitate scalability, we operate in the deployment efficient paradigm [
67
],
whereby policy learning and exploration are completely separate, and during a given deployment,
we gather a large quantity of data without further policy retraining (c.f. online approaches like DER
[
112
], which take multiple gradient steps per exploration timestep in the real environment). Taken
together, we consider the reward-free deployment efficiency problem. This differs from previous work
as follows: 1) unlike previous deployment efficiency work, our exploration is task agnostic; 2) unlike
previous reward-free RL work, we cannot update our exploration policy
πEXP
during deployment.
Thus, the focus of our work is on how to train
πEXP
offline such that it gathers heterogeneous and
informative data which facilitate zero-shot transfer to unknown tasks.
In this paper we make use of model-based RL (MBRL), where the goal is to learn a model of the
environment (or world model [
96
]) and then use it to subsequently train policies to solve downstream
tasks. To do this, the world model needs to approximate both
P
and
R
. Typically, the model will
be a neural network, parameterized by
ψ
, hence we denote the approximate dynamics and reward
functions as
Pψ
and
Rψ
, which produces a new “imaginary” MDP,
Mψ= (S,A, Pψ, Rψ, ρ)
. We
focus on Dyna-style MBRL [
104
], whereby we train a policy (
πθ
parameterized by
θ
) with model-free
RL solely using “imagined” transitions inside
Mψ
. Furthermore, we can train the policy on a single
GPU with parallelized rollouts since the simulator is a neural network [
54
]. The general form of all
methods in this paper is shown in Algorithm 1, with the key difference being step 5: We aim to update
πEXP
in the new imaginary MDP
Mψ
such that it continues to collect a large, diverse quantity of
reward-free data. Note that
πEXP
need not be a single policy, but could also refer to a collection of
policies that we can deploy (either in parallel or in series), such that π∈πEXP.
Algorithm 1 Reward-Free Deployment Efficiency via World Models
1: Input: Initial exploration policy πEXP
2: for each deployment do
3: Deploy πEXP to collect a large quantity of reward-free data.
4: Train world model on all existing data.
5: Update πEXP in new imaginary MDP Mψ.
6: end for
We focus on learning world models from high dimensional sensory inputs such as pixels [
34
,
76
,
47
],
where at each timestep we are given access to an observation
ot
rather than a state
st
. A series
of recent works have shown tremendous success by mapping the observation to a compact latent
state
zt
[
39
,
38
,
40
]. In this paper we will make use of the model from DreamerV2 [
40
], which has
been shown to produce highly effective policies in a variety of high dimensional environments. The
primary component of DreamerV2 is a Recurrent State Space Model (RSSM) that uses a learned
latent state to predict the image reconstruction, reward
rt
and discount factor
γt
. Aside from the
reward head, all components of the model are trained jointly, in similar fashion to variational encoders
(VAEs, [
50
,
92
]). For zero-shot evaluation, we follow [
97
] and only train the reward head at test time
when provided with labels for our pre-collected data, which is then used to train a behavior policy
offine. Thus, it is critical that our dataset is sufficiently diverse to enable learning novel, unseen
behaviors.
3