Rethinking Value Function Learning for Generalization in Reinforcement Learning Seungyong Moon12 JunYeong Lee12 Hyun Oh Song123

2025-04-29 0 0 1.69MB 28 页 10玖币

侵权投诉

Rethinking Value Function Learning for

Generalization in Reinforcement Learning

Seungyong Moon1,2, JunYeong Lee1,2, Hyun Oh Song1,2,3∗

1Seoul National University, 2Neural Processing Research Center, 3DeepMetrics

{symoon11,mascheroni99,hyunoh}@mllab.snu.ac.kr

Abstract

Our work focuses on training RL agents on multiple visually diverse environments

to improve observational generalization performance. In prior methods, policy and

value networks are separately optimized using a disjoint network architecture to

avoid interference and obtain a more accurate value function. We identify that a

value network in the multi-environment setting is more challenging to optimize and

prone to memorizing the training data than in the conventional single-environment

setting. In addition, we ﬁnd that appropriate regularization on the value network is

necessary to improve both training and test performance. To this end, we propose

Delayed-Critic Policy Gradient (DCPG), a policy gradient algorithm that implicitly

penalizes value estimates by optimizing the value network less frequently with

more training data than the policy network. This can be implemented using a single

uniﬁed network architecture. Furthermore, we introduce a simple self-supervised

task that learns the forward and inverse dynamics of environments using a single

discriminator, which can be jointly optimized with the value network. Our proposed

algorithms signiﬁcantly improve observational generalization performance and

sample efﬁciency on the Procgen Benchmark.

1 Introduction

In recent years, deep reinforcement learning (RL) has achieved remarkable success in various domains,

such as robotic controls and games [

]. To apply RL algorithms to more practical scenarios,

such as autonomous vehicles or healthcare systems, they should be robust against the non-stationarity

of real-world environments and capable of performing well on unseen situations during deployment.

However, current state-of-the-art RL algorithms often fail to generalize to unseen test environments

with visual variations, even if they achieve high performance in their training environments [

This problem is referred to as observational overﬁtting [41].

Training RL agents on a ﬁnite number of visually diverse environments and testing them on unseen

environments is the standard protocol for evaluating observational generalization in RL [

]. Several

methods have attempted to improve generalization in this framework by adopting the regularization

techniques that originate from supervised learning or training robust state representations via self-

supervised learning [

]. However, these methods have mainly focused on developing new

auxiliary objectives on the existing RL algorithms intended for the conventional single-environment

setting such as PPO [

]. Some recent works have investigated the interference between policy and

value function optimization arising from the multiple training environments and proposed new training

schemes that decouple the policy and value network training with a separate network architecture to

obtain an accurate value function [11, 34].

∗Corresponding author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.09960v2 [cs.LG] 8 Jan 2023

In this paper, we argue that learning an accurate value function on multiple training environments is

more challenging than on a single training environment and requires sufﬁcient regularization. We

demonstrate that a value network trained on multiple environments is more likely to memorize the

training data and cannot generalize to unvisited states within the training environments, which can be

detrimental to not only training performance but also test performance on unseen environments. In

addition, we ﬁnd that regularization techniques that penalize large estimates of the value network,

originally developed for preventing memorization in the single-environment setting, are also beneﬁcial

for improving both training and test performance in the multi-environment setting. However, this

beneﬁt comes at the cost of premature convergence, which hinders further performance enhancement.

To address this, we propose a new model-free policy gradient algorithm named Delayed-Critic Policy

Gradient (DCPG), which trains the value network with lower update frequency but with more training

data than the policy network. We ﬁnd that the value network with delayed updates suffers less from

the memorization problem and signiﬁcantly improves training and test performance. In addition, we

demonstrate that it provides better state representations to the policy network using a single uniﬁed

network architecture, unlike the prior methods. Moreover, we introduce a simple self-supervised task

that learns the forward and inverse dynamics of environments using a single discriminator on top of

DCPG. Our algorithms achieve state-of-the-art observational generalization performance and sample

efﬁciency compared to prior model-free methods on the Procgen benchmark [10].

2 Preliminaries

2.1 Observational Generalization in RL

We consider a collection of environments

formulated as Markov Decision Processes (MDPs). Each

environment

m∈ M

is described as a tuple

(Sm,A, Tm, rm, ρm, γ)

, where

is the image-based

state space,

is the action space shared across all environments,

Tm:Sm× A → P(Sm)

is the

transition function,

rm:Sm× A → R

is the reward function,

ρm

is the initial state distribution, and

γ∈[0,1]

is the discount factor. We assume that the state space has visual variations between different

environments. While the transition and reward functions are deﬁned as speciﬁc to an environment, we

assume that they exhibit some common structures across all environments. A policy π:S → P(A)

is trained on a ﬁnite number of training environments

Mtrain ={mi}n

i=1

, where

is the set of all

possible states in M. Our goal is to learn a generalizable policy that maximizes the expected return

on unseen test environments Mtest =M\Mtrain.

In this paper, we utilize the Procgen benchmark as a testbed for observational generalization [

]. It is

a collection of 16 video games with high diversity comparable to the ALE benchmark [

]. Each game

consists of procedurally generated environment instances with visually different layouts, backgrounds,

and game entities (e.g., the spawn locations and times for enemies), also called levels. The standard

evaluation protocol on the Procgen benchmark is to train a policy on a ﬁnite set of training levels and

evaluate its performance on held-out test levels [10].

2.2 Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a powerful model-free policy gradient algorithm that learns a

policy

πθ

and value function

Vφ

parameterized by deep neural networks [

]. For training, PPO ﬁrst

collects trajectories

using the old policy network

πθold

right before the update. Then, the policy

network is trained with the collected trajectories for several epochs to maximize the following clipped

surrogate policy objective Jπdesigned to constrain the size of policy update:

Jπ(θ) = Est,at∼τmin πθ(at|st)

πθold (at|st)ˆ

At,clip πθ(at|st)

πθold (at|st),1−, 1 + ˆ

At,

where

is an estimate of the advantage function at timestep

. Concurrently, the value network is

trained with the collected trajectories to minimize the following value objective JV:

JV(φ) = Est∼τ1

2Vφ(st)−ˆ

Rt2,

where

Rt=ˆ

At+Vφ(st)

is the value function target. It is used to compute the advantage estimates

via generalized advantage estimator (GAE) [38].

Encoder

Policy

πθ(· | s)

Jπ(θ)

Cϕ(θ)

Policy phase

Aux phase

Value

Vθ(s)

JV(θ)

CV(θ)

(a) PPO network

Policy

encoder

Policy

πθ(· | s)

Jπ(θ)

Cπ(θ)

Aux value

Vθ(s)

Jaux (θ)

Value

encoder

Value

Vϕ(s)

JV(ϕ)

(b) PPG network

Encoder

Policy

πθ(· | s)

Jπ(θ)

Cπ(θ)

Value

Vθ(s)

CV(θ)

JV(θ)

Dynamics

fθ(s, a, s′)

CV(θ)

Jf(θ)

s, s′

Figure 1: Network architectures for PPO, PPG, and DDCPG. The objectives

Jπ

Jaux

, and

denote the policy, value, auxiliary value, and dynamics objectives, respectively. The regularizers

Cπ

and

denote the policy and value regularizers, respectively. The blue and red terms represent

optimization problems during the policy and auxiliary phases, respectively.

In practice, the policy and value networks are jointly optimized with shared parameters (i.e.,

θ=φ

especially in image-based RL [

]. For example, they can be implemented using a shared encoder

followed by separate linear heads, as shown in Figure 1a. Sharing parameters is advantageous in that

representations learned by each objective can be beneﬁcial to the other. It also reduces memory costs

and accelerates training time. However, a shared network architecture complicates the optimization

as a single encoder should be optimized over multiple objectives whose gradients may have varying

scales and directions. It also constrains the policy and value networks to be optimized under the same

training hyperparameter setting, such as batch size and the number of epochs, severely limiting the

ﬂexibility of PPO.

2.3 Phasic Policy Gradient

Phasic Policy Gradient (PPG) is an algorithm built upon PPO that signiﬁcantly improves observational

generalization by addressing the problems of sharing parameters [

]. More speciﬁcally, PPG employs

separate encoders for the policy and value networks, as shown in Figure 1b. In addition, it introduces

an auxiliary value head

Vθ

on top of the policy encoder in order to distill useful representations from

the value network into the encoder. For training, PPG alternates between policy and auxiliary phases.

During the policy phase, which is repeated

Nπ

times, the policy and value networks are trained with

newly-collected trajectories to optimize the policy and value objectives from PPO, respectively. Then,

all states and value function targets in the trajectories are stored in a buffer

. During the auxiliary

phase, the auxiliary value head and the policy network are jointly trained with all data in the buffer to

optimize the following auxiliary value objective Jaux and policy regularizer Cπ:

Jaux(θ) = Est∼B 1

2Vθ(st)−ˆ

Rt2, Cπ(θ) = Est∼B [DKL(πθold (· | st)kπθ(· | st))] ,

where

πθold

is the policy network right before the auxiliary phase and

DKL

denotes the KL divergence.

In other words, the value network is distilled into the policy encoder while maintaining the outputs of

the policy network unchanged. Moreover, the value network is additionally trained with all data in the

buffer to optimize the value objective from PPO to obtain a more accurate value function. It is worth

noting that the training data size in the auxiliary phase is

Nπ

times larger than the policy phase. It has

been claimed that the distillation of a better-trained value network with a separate architecture and

the additional training for a more accurate value network can improve observational generalization

performance and sample efﬁciency [11].

1 2 10 50 200

0.2

0.4

0.6

# training levels

Stiffness

BigFish

1 2 10 50 200

0.2

0.4

# training levels

Stiffness

Chaser

1 2 10 50 200

0.2

0.4

# training levels

Stiffness

Climber

1 2 10 50 200

0.2

0.4

0.6

# training levels

Stiffness

StarPilot

PPG DCPG

Figure 2: Average stiffness of value networks for PPG and DCPG on 4 Procgen games while varying

the number of training levels.

3 Motivation

3.1 Difﬁculty of Training Value Network on Multiple Training Environments

We begin by investigating the difﬁculty of obtaining an accurate value network across multiple training

environments. Indeed, learning a value network that better approximates the true value function on

the given training environments can result in improved training performance [

]. However, even in

a simple setting where an agent is trained on a single environment, it has been shown that a value

network is likely to memorize the training data and unable to extrapolate well to unseen states even in

the same training environment [

]. This problem can be exacerbated when the number of

training environments increases. Intuitively, given the ﬁxed number of environment steps, the value

network will be provided fewer training samples per environment and rely more on memorization.

To corroborate this claim, we measure the stiffness of the value network between states

(s, s0)

while

varying the number of training environments [18, 6], which is deﬁned by

ρ(s, s0) = ∇φJV(φ;s)|∇φJV(φ;s0)

k∇φJV(φ;s)k2k∇φJV(φ;s0)k2

Low stiffness indicates that updating the network parameters toward minimizing the value objective

for one state will have a negative effect on the minimization of the value objective for other states [

That is, the value network is less able to adjust its parameter to predict the true value function across

different states and instead tends to memorize only the states it has encountered. More speciﬁcally,

we train PPG agents on the Procgen games while increasing the number of training levels from 0

to 200 and compute the average stiffness across all state pairs in a mini-batch of size

214

(=16,384)

throughout training. The detailed experimental settings and results can be found in Appendix A.

The green lines in Figure 2 show that the stiffness of the value network decreases as the number of

training environments increases, as expected. It implies that the value network trained on multiple

environments is more likely to memorize the training data and cannot accurately predict the values of

unvisited states from the training environments. This memorization problem brings us to train a value

network with sufﬁcient regularization.

3.2 Training Value Network with Explicit Regularization

Next, we examine the effectiveness of value network regularization in the multi-environment setting.

We consider applying two existing regularization techniques developed to prevent the memorization

problem in the single-environment setting, especially when training data is limited. The ﬁrst method

is discount regularization (DR), which trains a myopic value network with a lower discount factor

γ0

[

]. The second method is activation regularization (AR), which optimizes a value network with

penalty on its outputs:

Jreg

V(φ) = Est∼τ1

2Vφ(st)−ˆ

Rt2

+α

2Vφ(st)2,

0 5 10 15 20 25

Environment step (106)

Average return

BigFish, Train

0 5 10 15 20 25

Environment step (106)

Average return

BigFish, Test

PPG PPG+DR PPG+AR

(a) Average training and test returns.

05 10 15 20 25

Environment step (106)

Value

PPG+DR, BigFish

0 5 10 15 20 25

Environment step (106)

Value

PPG+AR, BigFish

True Predicted

(b) True and predicted values.

Figure 3: (a) Training and test performance curves of PPG, PPG+DR, and PPG+AR on BigFish. (b)

True and predicted values measured at the initial states of training environments for PPG+DR and

PPG+AR on BigFish. The mean is computed over 10 different runs.

where

α > 0

is the regularization coefﬁcient [

]. We train PPG agents with each of these two methods

using 200 training levels on the Procgen games. We reduce the discount factor from

γ= 0.999

γ0= 0.995

for PPG+DR and use

α= 0.05

for PPG+AR. We measure the average training and test

returns to evaluate the training performance and its transferability to unseen test environments.

As shown in Figure 3a, the value network regularization improves the training and test performance

of PPG on BigFish to some extent. It implies that explicitly suppressing the value network also helps

to mitigate memorization in the multi-environment setting. We also observe that these regularization

methods improve the training and test performance across all Procgen games on average. The detailed

experimental setting and results can be found in Appendix B.

Despite its effectiveness, explicit value network regularization can lead to a suboptimal solution as

the number of environment steps increases. Figure 3b shows the true and predicted values measured

at the initial states of the training environments for PPG+DR and PPG+AR on BigFish. The predicted

values with explicit regularization reach a plateau too quickly, suggesting that excessive regularization

later hinders the value network from learning an accurate value function. This motivates us to develop

a more ﬂexible regularization method that boosts training and test performance while allowing the

value network to converge to true values.

4 Delayed-Critic Policy Gradient

In this section, we present a novel model-free policy gradient algorithm called Delayed-Critic Policy

Gradient (DCPG), which effectively addresses the memorization problem of the value network in a

simple and ﬂexible manner. The key idea is that the value network should be optimized with a larger

amount of training data to avoid memorizing a small number of recently visited states, based on the

stiffness analysis in Section 3.1. Furthermore, the value network should be optimized with a delay

compared to the policy network to implicitly suppress the value estimate, based on the regularization

analysis in Section 3.2.

4.1 Algorithm

DCPG follows a similar procedure to PPG by alternating policy and auxiliary phases. Still, it employs

a shared network architecture in the same way as PPO and does not require any additional auxiliary

head, as shown in Figure 1c. During the policy phase, which occurs more frequently but with less

training data than the auxiliary phase, the policy network is trained with newly-collected trajectories

to optimize the policy objective from PPO. In contrast, the value network is constrained to preserve

its outputs by optimizing the following value regularizer CV:

CV(θ) = Est∼τ1

2(Vθ(st)−Vθold (st))2,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RethinkingValueFunctionLearningforGeneralizationinReinforcementLearningSeungyongMoon1;2,JunYeongLee1;2,HyunOhSong1;2;31SeoulNationalUniversity,2NeuralProcessingResearchCenter,3DeepMetrics{symoon11,mascheroni99,hyunoh}@mllab.snu.ac.krAbstractOurworkfocusesontrainingRLagentsonmultiplevisuallydiversee...

展开>> 收起<<

Rethinking Value Function Learning for Generalization in Reinforcement Learning Seungyong Moon12 JunYeong Lee12 Hyun Oh Song123.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Rethinking Value Function Learning for Generalization in Reinforcement Learning Seungyong Moon12 JunYeong Lee12 Hyun Oh Song123

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: