
In this paper, we argue that learning an accurate value function on multiple training environments is
more challenging than on a single training environment and requires sufficient regularization. We
demonstrate that a value network trained on multiple environments is more likely to memorize the
training data and cannot generalize to unvisited states within the training environments, which can be
detrimental to not only training performance but also test performance on unseen environments. In
addition, we find that regularization techniques that penalize large estimates of the value network,
originally developed for preventing memorization in the single-environment setting, are also beneficial
for improving both training and test performance in the multi-environment setting. However, this
benefit comes at the cost of premature convergence, which hinders further performance enhancement.
To address this, we propose a new model-free policy gradient algorithm named Delayed-Critic Policy
Gradient (DCPG), which trains the value network with lower update frequency but with more training
data than the policy network. We find that the value network with delayed updates suffers less from
the memorization problem and significantly improves training and test performance. In addition, we
demonstrate that it provides better state representations to the policy network using a single unified
network architecture, unlike the prior methods. Moreover, we introduce a simple self-supervised task
that learns the forward and inverse dynamics of environments using a single discriminator on top of
DCPG. Our algorithms achieve state-of-the-art observational generalization performance and sample
efficiency compared to prior model-free methods on the Procgen benchmark [10].
2 Preliminaries
2.1 Observational Generalization in RL
We consider a collection of environments
M
formulated as Markov Decision Processes (MDPs). Each
environment
m∈ M
is described as a tuple
(Sm,A, Tm, rm, ρm, γ)
, where
Sm
is the image-based
state space,
A
is the action space shared across all environments,
Tm:Sm× A → P(Sm)
is the
transition function,
rm:Sm× A → R
is the reward function,
ρm
is the initial state distribution, and
γ∈[0,1]
is the discount factor. We assume that the state space has visual variations between different
environments. While the transition and reward functions are defined as specific to an environment, we
assume that they exhibit some common structures across all environments. A policy π:S → P(A)
is trained on a finite number of training environments
Mtrain ={mi}n
i=1
, where
S
is the set of all
possible states in M. Our goal is to learn a generalizable policy that maximizes the expected return
on unseen test environments Mtest =M\Mtrain.
In this paper, we utilize the Procgen benchmark as a testbed for observational generalization [
10
]. It is
a collection of 16 video games with high diversity comparable to the ALE benchmark [
5
]. Each game
consists of procedurally generated environment instances with visually different layouts, backgrounds,
and game entities (e.g., the spawn locations and times for enemies), also called levels. The standard
evaluation protocol on the Procgen benchmark is to train a policy on a finite set of training levels and
evaluate its performance on held-out test levels [10].
2.2 Proximal Policy Optimization
Proximal Policy Optimization (PPO) is a powerful model-free policy gradient algorithm that learns a
policy
πθ
and value function
Vφ
parameterized by deep neural networks [
39
]. For training, PPO first
collects trajectories
τ
using the old policy network
πθold
right before the update. Then, the policy
network is trained with the collected trajectories for several epochs to maximize the following clipped
surrogate policy objective Jπdesigned to constrain the size of policy update:
Jπ(θ) = Est,at∼τmin πθ(at|st)
πθold (at|st)ˆ
At,clip πθ(at|st)
πθold (at|st),1−, 1 + ˆ
At,
where
ˆ
At
is an estimate of the advantage function at timestep
t
. Concurrently, the value network is
trained with the collected trajectories to minimize the following value objective JV:
JV(φ) = Est∼τ1
2Vφ(st)−ˆ
Rt2,
where
ˆ
Rt=ˆ
At+Vφ(st)
is the value function target. It is used to compute the advantage estimates
via generalized advantage estimator (GAE) [38].
2