Rethinking Value Function Learning for Generalization in Reinforcement Learning Seungyong Moon12 JunYeong Lee12 Hyun Oh Song123

2025-04-29 0 0 1.69MB 28 页 10玖币
侵权投诉
Rethinking Value Function Learning for
Generalization in Reinforcement Learning
Seungyong Moon1,2, JunYeong Lee1,2, Hyun Oh Song1,2,3
1Seoul National University, 2Neural Processing Research Center, 3DeepMetrics
{symoon11,mascheroni99,hyunoh}@mllab.snu.ac.kr
Abstract
Our work focuses on training RL agents on multiple visually diverse environments
to improve observational generalization performance. In prior methods, policy and
value networks are separately optimized using a disjoint network architecture to
avoid interference and obtain a more accurate value function. We identify that a
value network in the multi-environment setting is more challenging to optimize and
prone to memorizing the training data than in the conventional single-environment
setting. In addition, we find that appropriate regularization on the value network is
necessary to improve both training and test performance. To this end, we propose
Delayed-Critic Policy Gradient (DCPG), a policy gradient algorithm that implicitly
penalizes value estimates by optimizing the value network less frequently with
more training data than the policy network. This can be implemented using a single
unified network architecture. Furthermore, we introduce a simple self-supervised
task that learns the forward and inverse dynamics of environments using a single
discriminator, which can be jointly optimized with the value network. Our proposed
algorithms significantly improve observational generalization performance and
sample efficiency on the Procgen Benchmark.
1 Introduction
In recent years, deep reinforcement learning (RL) has achieved remarkable success in various domains,
such as robotic controls and games [
31
,
21
,
37
]. To apply RL algorithms to more practical scenarios,
such as autonomous vehicles or healthcare systems, they should be robust against the non-stationarity
of real-world environments and capable of performing well on unseen situations during deployment.
However, current state-of-the-art RL algorithms often fail to generalize to unseen test environments
with visual variations, even if they achieve high performance in their training environments [
16
,
46
,
9
].
This problem is referred to as observational overfitting [41].
Training RL agents on a finite number of visually diverse environments and testing them on unseen
environments is the standard protocol for evaluating observational generalization in RL [
10
]. Several
methods have attempted to improve generalization in this framework by adopting the regularization
techniques that originate from supervised learning or training robust state representations via self-
supervised learning [
9
,
22
,
35
,
30
]. However, these methods have mainly focused on developing new
auxiliary objectives on the existing RL algorithms intended for the conventional single-environment
setting such as PPO [
39
]. Some recent works have investigated the interference between policy and
value function optimization arising from the multiple training environments and proposed new training
schemes that decouple the policy and value network training with a separate network architecture to
obtain an accurate value function [11, 34].
Corresponding author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.09960v2 [cs.LG] 8 Jan 2023
In this paper, we argue that learning an accurate value function on multiple training environments is
more challenging than on a single training environment and requires sufficient regularization. We
demonstrate that a value network trained on multiple environments is more likely to memorize the
training data and cannot generalize to unvisited states within the training environments, which can be
detrimental to not only training performance but also test performance on unseen environments. In
addition, we find that regularization techniques that penalize large estimates of the value network,
originally developed for preventing memorization in the single-environment setting, are also beneficial
for improving both training and test performance in the multi-environment setting. However, this
benefit comes at the cost of premature convergence, which hinders further performance enhancement.
To address this, we propose a new model-free policy gradient algorithm named Delayed-Critic Policy
Gradient (DCPG), which trains the value network with lower update frequency but with more training
data than the policy network. We find that the value network with delayed updates suffers less from
the memorization problem and significantly improves training and test performance. In addition, we
demonstrate that it provides better state representations to the policy network using a single unified
network architecture, unlike the prior methods. Moreover, we introduce a simple self-supervised task
that learns the forward and inverse dynamics of environments using a single discriminator on top of
DCPG. Our algorithms achieve state-of-the-art observational generalization performance and sample
efficiency compared to prior model-free methods on the Procgen benchmark [10].
2 Preliminaries
2.1 Observational Generalization in RL
We consider a collection of environments
M
formulated as Markov Decision Processes (MDPs). Each
environment
m∈ M
is described as a tuple
(Sm,A, Tm, rm, ρm, γ)
, where
Sm
is the image-based
state space,
A
is the action space shared across all environments,
Tm:Sm× A P(Sm)
is the
transition function,
rm:Sm× A R
is the reward function,
ρm
is the initial state distribution, and
γ[0,1]
is the discount factor. We assume that the state space has visual variations between different
environments. While the transition and reward functions are defined as specific to an environment, we
assume that they exhibit some common structures across all environments. A policy π:S → P(A)
is trained on a finite number of training environments
Mtrain ={mi}n
i=1
, where
S
is the set of all
possible states in M. Our goal is to learn a generalizable policy that maximizes the expected return
on unseen test environments Mtest =M\Mtrain.
In this paper, we utilize the Procgen benchmark as a testbed for observational generalization [
10
]. It is
a collection of 16 video games with high diversity comparable to the ALE benchmark [
5
]. Each game
consists of procedurally generated environment instances with visually different layouts, backgrounds,
and game entities (e.g., the spawn locations and times for enemies), also called levels. The standard
evaluation protocol on the Procgen benchmark is to train a policy on a finite set of training levels and
evaluate its performance on held-out test levels [10].
2.2 Proximal Policy Optimization
Proximal Policy Optimization (PPO) is a powerful model-free policy gradient algorithm that learns a
policy
πθ
and value function
Vφ
parameterized by deep neural networks [
39
]. For training, PPO first
collects trajectories
τ
using the old policy network
πθold
right before the update. Then, the policy
network is trained with the collected trajectories for several epochs to maximize the following clipped
surrogate policy objective Jπdesigned to constrain the size of policy update:
Jπ(θ) = Est,atτmin πθ(at|st)
πθold (at|st)ˆ
At,clip πθ(at|st)
πθold (at|st),1, 1 + ˆ
At,
where
ˆ
At
is an estimate of the advantage function at timestep
t
. Concurrently, the value network is
trained with the collected trajectories to minimize the following value objective JV:
JV(φ) = Estτ1
2Vφ(st)ˆ
Rt2,
where
ˆ
Rt=ˆ
At+Vφ(st)
is the value function target. It is used to compute the advantage estimates
via generalized advantage estimator (GAE) [38].
2
Encoder
θ
Policy
πθ(· | s)
Jπ(θ)
Cϕ(θ)
Policy phase
Aux phase
Value
Vθ(s)
JV(θ)
CV(θ)
s
(a) PPO network
Policy
encoder
θ
Policy
πθ(· | s)
Jπ(θ)
Cπ(θ)
Aux value
Vθ(s)
Jaux (θ)
s
Value
encoder
ϕ
Value
Vϕ(s)
JV(ϕ)
JV(ϕ)
s
(b) PPG network
Encoder
θ
Policy
πθ(· | s)
Jπ(θ)
Cπ(θ)
Value
Vθ(s)
CV(θ)
JV(θ)
Dynamics
fθ(s, a, s)
CV(θ)
Jf(θ)
s, s
a
(c) DDCPG network
Figure 1: Network architectures for PPO, PPG, and DDCPG. The objectives
Jπ
,
JV
,
Jaux
, and
Jf
denote the policy, value, auxiliary value, and dynamics objectives, respectively. The regularizers
Cπ
and
CV
denote the policy and value regularizers, respectively. The blue and red terms represent
optimization problems during the policy and auxiliary phases, respectively.
In practice, the policy and value networks are jointly optimized with shared parameters (i.e.,
θ=φ
),
especially in image-based RL [
14
,
44
]. For example, they can be implemented using a shared encoder
followed by separate linear heads, as shown in Figure 1a. Sharing parameters is advantageous in that
representations learned by each objective can be beneficial to the other. It also reduces memory costs
and accelerates training time. However, a shared network architecture complicates the optimization
as a single encoder should be optimized over multiple objectives whose gradients may have varying
scales and directions. It also constrains the policy and value networks to be optimized under the same
training hyperparameter setting, such as batch size and the number of epochs, severely limiting the
flexibility of PPO.
2.3 Phasic Policy Gradient
Phasic Policy Gradient (PPG) is an algorithm built upon PPO that significantly improves observational
generalization by addressing the problems of sharing parameters [
11
]. More specifically, PPG employs
separate encoders for the policy and value networks, as shown in Figure 1b. In addition, it introduces
an auxiliary value head
Vθ
on top of the policy encoder in order to distill useful representations from
the value network into the encoder. For training, PPG alternates between policy and auxiliary phases.
During the policy phase, which is repeated
Nπ
times, the policy and value networks are trained with
newly-collected trajectories to optimize the policy and value objectives from PPO, respectively. Then,
all states and value function targets in the trajectories are stored in a buffer
B
. During the auxiliary
phase, the auxiliary value head and the policy network are jointly trained with all data in the buffer to
optimize the following auxiliary value objective Jaux and policy regularizer Cπ:
Jaux(θ) = Est∼B 1
2Vθ(st)ˆ
Rt2, Cπ(θ) = Est∼B [DKL(πθold (· | st)kπθ(· | st))] ,
where
πθold
is the policy network right before the auxiliary phase and
DKL
denotes the KL divergence.
In other words, the value network is distilled into the policy encoder while maintaining the outputs of
the policy network unchanged. Moreover, the value network is additionally trained with all data in the
buffer to optimize the value objective from PPO to obtain a more accurate value function. It is worth
noting that the training data size in the auxiliary phase is
Nπ
times larger than the policy phase. It has
been claimed that the distillation of a better-trained value network with a separate architecture and
the additional training for a more accurate value network can improve observational generalization
performance and sample efficiency [11].
3
1 2 10 50 200
0
0.2
0.4
0.6
# training levels
Stiffness
BigFish
1 2 10 50 200
0
0.2
0.4
# training levels
Stiffness
Chaser
1 2 10 50 200
0
0.2
0.4
# training levels
Stiffness
Climber
1 2 10 50 200
0
0.2
0.4
0.6
# training levels
Stiffness
StarPilot
PPG DCPG
Figure 2: Average stiffness of value networks for PPG and DCPG on 4 Procgen games while varying
the number of training levels.
3 Motivation
3.1 Difficulty of Training Value Network on Multiple Training Environments
We begin by investigating the difficulty of obtaining an accurate value network across multiple training
environments. Indeed, learning a value network that better approximates the true value function on
the given training environments can result in improved training performance [
42
]. However, even in
a simple setting where an agent is trained on a single environment, it has been shown that a value
network is likely to memorize the training data and unable to extrapolate well to unseen states even in
the same training environment [
26
,
23
,
12
,
17
]. This problem can be exacerbated when the number of
training environments increases. Intuitively, given the fixed number of environment steps, the value
network will be provided fewer training samples per environment and rely more on memorization.
To corroborate this claim, we measure the stiffness of the value network between states
(s, s0)
while
varying the number of training environments [18, 6], which is defined by
ρ(s, s0) = φJV(φ;s)|φJV(φ;s0)
k∇φJV(φ;s)k2k∇φJV(φ;s0)k2
.
Low stiffness indicates that updating the network parameters toward minimizing the value objective
for one state will have a negative effect on the minimization of the value objective for other states [
6
].
That is, the value network is less able to adjust its parameter to predict the true value function across
different states and instead tends to memorize only the states it has encountered. More specifically,
we train PPG agents on the Procgen games while increasing the number of training levels from 0
to 200 and compute the average stiffness across all state pairs in a mini-batch of size
214
(=16,384)
throughout training. The detailed experimental settings and results can be found in Appendix A.
The green lines in Figure 2 show that the stiffness of the value network decreases as the number of
training environments increases, as expected. It implies that the value network trained on multiple
environments is more likely to memorize the training data and cannot accurately predict the values of
unvisited states from the training environments. This memorization problem brings us to train a value
network with sufficient regularization.
3.2 Training Value Network with Explicit Regularization
Next, we examine the effectiveness of value network regularization in the multi-environment setting.
We consider applying two existing regularization techniques developed to prevent the memorization
problem in the single-environment setting, especially when training data is limited. The first method
is discount regularization (DR), which trains a myopic value network with a lower discount factor
γ0
[
33
]. The second method is activation regularization (AR), which optimizes a value network with
L2
penalty on its outputs:
Jreg
V(φ) = Estτ1
2Vφ(st)ˆ
Rt2
+α
2Vφ(st)2,
4
0 5 10 15 20 25
0
10
20
30
Environment step (106)
Average return
BigFish, Train
0 5 10 15 20 25
0
10
20
30
Environment step (106)
Average return
BigFish, Test
PPG PPG+DR PPG+AR
(a) Average training and test returns.
05 10 15 20 25
0
5
10
Environment step (106)
Value
PPG+DR, BigFish
0 5 10 15 20 25
0
5
10
Environment step (106)
Value
PPG+AR, BigFish
True Predicted
(b) True and predicted values.
Figure 3: (a) Training and test performance curves of PPG, PPG+DR, and PPG+AR on BigFish. (b)
True and predicted values measured at the initial states of training environments for PPG+DR and
PPG+AR on BigFish. The mean is computed over 10 different runs.
where
α > 0
is the regularization coefficient [
3
]. We train PPG agents with each of these two methods
using 200 training levels on the Procgen games. We reduce the discount factor from
γ= 0.999
to
γ0= 0.995
for PPG+DR and use
α= 0.05
for PPG+AR. We measure the average training and test
returns to evaluate the training performance and its transferability to unseen test environments.
As shown in Figure 3a, the value network regularization improves the training and test performance
of PPG on BigFish to some extent. It implies that explicitly suppressing the value network also helps
to mitigate memorization in the multi-environment setting. We also observe that these regularization
methods improve the training and test performance across all Procgen games on average. The detailed
experimental setting and results can be found in Appendix B.
Despite its effectiveness, explicit value network regularization can lead to a suboptimal solution as
the number of environment steps increases. Figure 3b shows the true and predicted values measured
at the initial states of the training environments for PPG+DR and PPG+AR on BigFish. The predicted
values with explicit regularization reach a plateau too quickly, suggesting that excessive regularization
later hinders the value network from learning an accurate value function. This motivates us to develop
a more flexible regularization method that boosts training and test performance while allowing the
value network to converge to true values.
4 Delayed-Critic Policy Gradient
In this section, we present a novel model-free policy gradient algorithm called Delayed-Critic Policy
Gradient (DCPG), which effectively addresses the memorization problem of the value network in a
simple and flexible manner. The key idea is that the value network should be optimized with a larger
amount of training data to avoid memorizing a small number of recently visited states, based on the
stiffness analysis in Section 3.1. Furthermore, the value network should be optimized with a delay
compared to the policy network to implicitly suppress the value estimate, based on the regularization
analysis in Section 3.2.
4.1 Algorithm
DCPG follows a similar procedure to PPG by alternating policy and auxiliary phases. Still, it employs
a shared network architecture in the same way as PPO and does not require any additional auxiliary
head, as shown in Figure 1c. During the policy phase, which occurs more frequently but with less
training data than the auxiliary phase, the policy network is trained with newly-collected trajectories
to optimize the policy objective from PPO. In contrast, the value network is constrained to preserve
its outputs by optimizing the following value regularizer CV:
CV(θ) = Estτ1
2(Vθ(st)Vθold (st))2,
5
摘要:

RethinkingValueFunctionLearningforGeneralizationinReinforcementLearningSeungyongMoon1;2,JunYeongLee1;2,HyunOhSong1;2;31SeoulNationalUniversity,2NeuralProcessingResearchCenter,3DeepMetrics{symoon11,mascheroni99,hyunoh}@mllab.snu.ac.krAbstractOurworkfocusesontrainingRLagentsonmultiplevisuallydiversee...

展开>> 收起<<
Rethinking Value Function Learning for Generalization in Reinforcement Learning Seungyong Moon12 JunYeong Lee12 Hyun Oh Song123.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:1.69MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注