On Many-Actions Policy Gradient

2025-05-06 0 0 1.09MB 21 页 10玖币

侵权投诉

Michal Nauman 1 2 Marek Cygan 1 3

Abstract

We study the variance of stochastic policy gra-

dients (SPGs) with many action samples per

state. We derive a many-actions optimality condi-

tion, which determines when many-actions SPG

yields lower variance as compared to a single-

action agent with proportionally extended trajec-

tory. We propose Model-Based Many-Actions

(MBMA), an approach leveraging dynamics mod-

els for many-actions sampling in the context of

SPG. MBMA addresses issues associated with

existing implementations of many-actions SPG

and yields lower bias and comparable variance

to SPG estimated from states in model-simulated

rollouts. We ﬁnd that MBMA bias and variance

structure matches that predicted by theory. As

a result, MBMA achieves improved sample efﬁ-

ciency and higher returns on a range of continuous

action environments as compared to model-free,

many-actions, and model-based on-policy SPG

baselines.

1. Introduction

Stochastic policy gradient (SPG) is a method of optimizing

stochastic policy through gradient ascent in the context of

reinforcement learning (RL) (Williams,1992;Sutton et al.,

1999;Peters & Schaal,2006). When paired with powerful

function approximators, SPG-based algorithms have proven

to be one of the most effective methods for achieving op-

timal performance in Markov Decision Processes (MDPs)

with unknown transition dynamics (Schulman et al.,2017).

Unfortunately, the exact calculation of the gradient is un-

feasible and thus the objective has to be estimated (Sutton

et al.,1999). Resulting variance is known to impact learning

speed, as well as performance of the trained agent (Konda

& Tsitsiklis,1999;Tucker et al.,2018).

Informatics Institute, University of Warsaw

Ideas National

Centre for Research and Development

Nomagic. Correspondence

to: Michal Nauman <nauman.mic@gmail.com>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

On-policy sample efﬁciency (ie. the number of environment

interactions needed to achieve a certain performance level)

is particularly affected by variance, as the gradient must be

evaluated over long sequences in order to produce a sufﬁ-

cient quality of the SPG estimate (Mnih et al.,2016). As

such, a variety of methods for SPG variance reduction have

been proposed. The most widely used is baseline variance

reduction, which has been shown to improve algorithms

stability and became indispensable to contemporary SPG

implementations (Peters & Schaal,2006;Schulman et al.,

2015b). Alternative approaches include Q-value bootstrap-

ping (Gu et al.,2017), reducing the effect of long-horizon

stochasticity via small discount (Baxter & Bartlett,2001), in-

creasing number of samples via parallel agents (Mnih et al.,

2016) or using many-actions estimator (Asadi et al.,2017;

Kool et al.,2019b;Petit et al.,2019;Ciosek & Whiteson,

2020).

In many-actions SPG (MA), the gradient is calculated using

more than one action sample per state, without including the

follow-up states of additional actions. The method builds

upon conditional Monte-Carlo and yields variance that is

smaller or equal to that of single-action SPG given ﬁxed

trajectory length (Bratley et al.,2011). These additional

action samples can be drawn with (Ciosek & Whiteson,

2020) or without replacement (Kool et al.,2019b) and can

be generated through rewinding the environment (Schulman

et al.,2015a) or using a parametrized Q-value approximator

(Asadi et al.,2017). However, drawing additional action

samples from the environment is unacceptable in certain

settings, while using a Q-network may introduce bias to

the gradient estimate. Furthermore, a diminishing variance

reduction effect can be achieved by extending the trajectory.

This leads to the following questions:

Given ﬁxed trajectory length and cost of sampling ac-

tions, is SPG variance more favorable when sampling

additional actions or extending the trajectory?

Given that more samples translate to smaller variance,

what is the bias associated with simulating such addi-

tional samples via neural networks?

The contributions of this paper are twofold. Firstly, we ana-

lyze SPG variance theoretically. We quantify the variance

reduction stemming from sampling multiple actions per state

arXiv:2210.13011v5 [cs.LG] 30 Oct 2023

On Many-Actions Policy Gradient

(a) Expected steps to solve (thousands) (b) Expected reward gain after update

Figure 1.

Variance reduction leads to better sample efﬁciency. We train a CartPole Actor-Critic agent with different batch sizes and

many action samples per state (denoted as

). In Figures 1a and 1b X-axis denotes batch size (ie. trajectory length) and Y-axis denotes

thousands of steps and average performance gain resulting from a single policy update. Increasing batch size leads to better gradient

quality at the cost of fewer updates during training. Sampling more actions yields better gradient quality with fewer environment steps.

as compared to extending the trajectory of a single-action

agent. We calculate conditions under which adopting MA

estimation leads to greater variance reduction than extend-

ing trajectory length. We show that the conditions are often

met in RL, but are impossible for contextual bandits. Sec-

ondly, we propose an implementation of MA, which we re-

fer to as the Model-Based Many-Actions module (MBMA).

The module leverages a learned dynamics model to sample

state-action gradients and can be used in conjunction with

any on-policy SPG algorithm. MBMA yields a favorable

bias/variance structure as compared to learning from states

simulated in the dynamics model rollout (Janner et al.,2019;

Kaiser et al.,2019;Hafner et al.,2019) in the context of on-

policy SPG. We validate our approach and show empirically

that using MBMA alongside PPO (Schulman et al.,2017)

yields better sample efﬁciency and higher reward sums on a

variety of continuous action environments as compared to

many-actions, model-based and model-free PPO baselines.

2. Background

A Markov Decision Process (MDP) (Puterman,2014) is a

tuple

(S, A, R, p, γ)

, where

is a countable set of states,

is a countable set of actions,

R(s, a)

is the state-action

reward,

p(s′|s, a)

is a transition kernel (with the initial state

distribution denoted as

) and

γ∈(0,1]

is a discount factor.

A policy

π(a|s)

is a state-conditioned action distribution.

Given a policy

, MDP becomes a Markov reward process

with a transition kernel

pπ(s′|s) = Raπ(a|s)p(s′|s, a)da

which we refer to as the underlying Markov chain. The

underlying Markov chain is assumed to have ﬁnite variance,

a unique stationary distribution denoted as

pπ

(Ross et al.,

1996;Konda & Tsitsiklis,1999),

-step stationary transition

kernel

pπ

and a unique discounted stationary distribution

denoted as

pπ

∗

. Interactions with the MDP according to

some policy

are called trajectories and are denoted as

τπ

T(st) = ((st, at, rt), ..., (st+T, at+T, rt+T))

, where

at∼

π(at|st)

rt∼R(s, a)

and

st+1 ∼p(st+1|st, at)

. The

value function

Vπ(s)=Eτπ

∞(s)[P∞

t=0 γtR(st, at)]

and Q-

value function

Qπ(s, a)=Eτπ

∞(s|a)[P∞

t=0 γtR(st, at)] =

R(s, a) + γEs′∼p(s′|s,a)[Vπ(s′)]

sample

according to

some ﬁxed policy

. State-action advantage is deﬁned as

Aπ(s, a) = Qπ(s, a)−Vπ(s)

. An optimal policy is a policy

that maximizes discounted total return

J=Rs0Vπ(s0)ds0

2.1. On-policy SPG

Given a policy parametrized by

, the values of

can be

updated via SPG

θ←θ+∇θJ

. Since we are interested

only in gradient wrt.

, we drop it from the gradient notation

in further uses. The SPG is given by (Sutton & Barto,2018):

∇J=E

s∼pπ

∗

a∼πQπ(s, a)∇log π(a|s)(1)

As such, SPG is proportional to a double expectation of

Qπ(s, a)∇θlog π(a|s)

, with the outer expectation taken

wrt. the discounted stationary distribution

pπ

∗

and the in-

ner expectation taken wrt. policy

. The gradient can be

estimated via a trajectory sampled according to the policy

(Nota & Thomas,2020;Wu et al.,2022). We denote

∇ˆ

the estimator,

∇J(st, at) = Qπ(st, at)∇log π(at|st)

with

st, at∼pπ

t, π. Then, SPG can be calculated:

∇ˆ

J=1

T−1

t=0

γt∇J(st, at)(2)

In the setup above, the outer expectation of Equation 1is

estimated via Monte-Carlo (Metropolis & Ulam,1949) with

state samples drawn from the non-discounted stationary

distribution

pπ

; and the inner expectation is estimated with

a single action per state drawn from the policy

π(a|s)

. The

resulting variance can be reduced to a certain degree by a

control variate, with state value being a popular choice for

such baseline (Schulman et al.,2015b). Then, the Q-value

from Equation 1is replaced by

Aπ(st, at)

. If the state value

On Many-Actions Policy Gradient

is learned by a parametrized approximator, it is referred to

as the critic. Critic bootstrapping (Gu et al.,2017) is deﬁned

Qπ(s, a) = R(s, a) + γV π(s′)

with

s′∼p(s′|s, a)

and

can be used to balance the bias-variance tradeoff of Q-value

approximations.

2.2. On-policy Many-Actions SPG

Given a control variate, the variance of policy gradient can

be further reduced by approximating the inner integral of

Equation 2with a quadrature of

N > 1

action samples.

Then, ∇ˆ

Jis equal to:

∇ˆ

J=1

T−1

t=0

γt1

N−1

n=0

∇J(st, an

| {z }

Nactions per state

| {z }

Tstate samples in a trajectory

(3)

Where

denotes the

nth

action sampled at state

. Fur-

thermore, MDP transitions are conditioned only on the ﬁrst

action performed (ie.

pπ(st+1|st, an

t) = pπ(st+1|st)⇐⇒

n̸= 0

). Implementations of such an approach were called

all-action policy gradient or expected policy gradient (Asadi

et al.,2017;Petit et al.,2019;Ciosek & Whiteson,2020).

As follows from the law of iterated expectations, the many-

actions (MA) estimator is unbiased and yields lower or

equal variance as compared to single-action SPG with equal

trajectory length (Petit et al.,2019). Since the policy log

probabilities are known, using MA requires approximating

the Q-values of additional action samples. As such, MA

is often implemented by performing rollouts in a rewinded

environment (Schulman et al.,2015a;Kool et al.,2019a;b)

or by leveraging a Q-network at the cost of bias (Asadi et al.,

2017;Petit et al.,2019;Ciosek & Whiteson,2020). The vari-

ance reduction stemming from using MA has been shown

to increase both performance and sample efﬁciency of SPG

algorithms (Schulman et al.,2015a;Kool et al.,2019b).

3. Variance of Stochastic Policy Gradients

Throughout the section, we assume no stochasticity induced

by learning Q-values and we treat Q-values as known. Fur-

thermore, when referring to SPG variance, we refer to the

diagonal of the policy parameter variance-covariance ma-

trix. Finally, to unburden the notation, we deﬁne

Υt=

γt∇J(st, at)

and

Υt=γtEa∼π∇J(st, at)

, where we

skip the superscript when

t= 0

. Similarly, we use

Oa(·) =

Oa∼π(·)

Os(·) = Os∼pπ

0(·)

and

Os,a(·) = Ost,at∼pπ

t,π(·)

where

denotes expected value, variance and covariance

operators. As shown, given ﬁxed trajectory length

, MA-

SPG variance is smaller or equal to the variance of single-

action agent Petit et al. (2019); Ciosek & Whiteson (2020).

However, approximating the inner expectation of SPG al-

ways uses resources (ie. compute or environment interac-

tions), which could be used to reduce the SPG variance

through other means (eg. extending the trajectory length).

To this end, we extend existing results (Petit et al.,2019;

Ciosek & Whiteson,2020) by comparing the variance re-

duction stemming from employing MA as opposed to using

regular single-action SPG with an extended trajectory length.

If the underlying Markov chain is ergodic the variance of

SPG, denoted as

, can be calculated via Markov chain

Central Limit Theorem (Jones,2004;Brooks et al.,2011):

V=1

TVar

s,a Υ+ 2

T−1

t=1

T−t

T2Cov

s,a Υ,Υt(4)

The states underlying

and

Υt

are sampled from the undis-

counted stationary distribution

pπ

and the t-step stationary

transition kernel

pπ

respectively. As follows from the er-

godic theorem (Norris & Norris,1998), conditional probabil-

ity of visiting state

given starting in state

with action

approaches the undiscounted stationary distribution

pπ

ex-

ponentially fast as

grows

limt→∞ p(st|s0, a1

0) = pπ

0(st)

Therefore,

Covt≥Covt+1

, as well as

limt→∞ Covt= 0

Equation 4shows the well-known result that increasing the

trajectory length

decreases

. This result contextualizes

the success of parallel SPG (Mnih et al.,2016). Unfortu-

nately, the form above assumes single action per state.

3.1. Variance Decomposition

To quantify variance reduction stemming from many ac-

tion samples, we decompose

into sub-components. We

include derivations in Appendix A.1.

Lemma 3.1. Given ergodic MDP, SPG with

action sam-

ples per state and Tstates, Vcan be decomposed into:

Var

s,a Υ= Var

s¯

Υ+1

sVar

aΥ

Cov

s,a Υ,Υt= Cov

s,a ˆ

Υ,ˆ

Υt+1

sCov

s,a Υ,Υt(5)

Combining Lemma 3.1 with Equation 4yields an expression

for decomposed SPG variance, where we group components

according to dependence on N:

TV= Var

sˆ

Υ+ 2

T−1

t=1

T−t

TCov

s,a ˆ

Υ,ˆ

Υt

| {z }

Marginalized policy variance

sVar

aΥ+ 2

T−1

t=1

T−t

TCov

s,a Υ,Υt

| {z }

Policy-dependent variance

(6)

On Many-Actions Policy Gradient

Table 1.

Decomposed trace of variance-covariance matrix divided by the number of parameters. The components were estimated by

marginalizing Q-values, with Equation 3and Lemma 3.1 using

125000

non-baselined interactions. The last two columns record the

best performance during

500k

environment steps (average performance shown in the brackets). The performance of SPG variants was

measured during

500k

training steps with additional action samples drawn from the environment. With most variance depending on the

policy, MA often yields better performance than single-action agents with extended trajectories. We detail the setting in Appendix B.

VARIANCE COMPONENT PERFORMANCE

TASK MARGINALIZED POLICY POLICY-DEPENDENT (T, N ) = (1024,2) (T, N ) = (2048,1)

BALL CATCH 0.026 (3%) 0.819 (97%) 905 (708) 920 (715)

CART SWINGUP 0.006 (1%) 5.736 (99%) 837 (670) 801 (669)

CHEETAH RUN 0.006 (1%) 1.615 (99%) 208 (131) 204 (126)

FINGER SPIN 0.026 (18%) 0.122 (82%) 304 (187) 281 (179)

REACHER EASY 2.269 (39%) 3.565 (61%) 428 (262) 776 (488)

WALKER WALK 0.081 (1%) 11.786 (99%) 509 (315) 465 (287)

Given

N= 1

, the variance simpliﬁes to a single-action

case. The statement shows that SPG variance can be de-

composed into: marginalized policy variance, which stems

from the underlying Markov chain and is decreased only by

trajectory length (

); and policy-dependent variance, which

indeed is reduced by both sampling more actions per state

(

) and increasing trajectory length (

). Table 1shows es-

timated variance components and performance of two SPG

estimators (

T= 1024; N= 2

and

T= 2048; N= 1

) for

6 Deepmind Control Suite (DMC) environments. In partic-

ular, the table shows that with Q-values marginalized, the

policy is responsible for around

90%

of SPG variance in

tested environments.

3.2. Measuring Variance Reduction

We proceed with the analytical analysis of the variance

reduction stemming from increasing Nand T.

Lemma 3.2. Given ergodic MDP, SPG with

action sam-

ples per state and

states, variance reduction stemming

from increasing

(denoted as

∆N

) and variance re-

duction stemming from increasing the trajectory length to

T+δT with δ∈(0,∞)(denoted as ∆T) are equal to:

∆N

αN

sVar

aΥ+ 2

T−1

t=1

T−t

TCov

s,a Υ,Υt

∆T

αT

= Var

s,a Υ+ 2

T−1

t=1 T−t

T−t

T+δT Cov

s,a Υ,Υt

αN=−1

T(N2+N)and αT=−δ

T+δT

(7)

Derivation of Lemma 3.2 is detailed in Appendix A.2.

Lemma 3.2 shows the diminishing variance reduction stem-

ming from increasing

δT

. Incorporating

captures the notion of relative costs of increasing

and

δ= 1

, then the cost of increasing

(sampling one

more action per state in trajectory) is equal to doubling the

trajectory length. Now, it follows that many-actions yield

better variance reduction than increasing trajectory length

only if ∆N≤∆Tfor given values of N,T, and δ.

Theorem 3.3. Given ergodic MDP, SPG with

action

samples per

states, variance reduction stemming from in-

creasing

is bigger than variance reduction stemming

from increasing Tby δT for δ= 1 and N= 1 when:

T−1

t=1

TCov

s,a Υ,Υt≥Var

sˆ

Υ+ 2

T−1

t=1

T−t

TCov

s,a ˆ

Υ,ˆ

Υt

(8)

For derivation with

N≥1

and

δ∈(0,∞)

see Equation 14

in Appendix A.3. The theorem represents a condition under

which optimal to switch from regular SPG (MA-SPG with

N= 1

) to MA-SPG with

N= 2

. Surprisingly, the optimal-

ity condition for

δ= 1

and

N= 1

is dependent solely on the

covariance structure of the data. As follows from Theorem

3.3, MA is optimal when the weighted sum of MDP covari-

ances exceeds the variance of the Markov Chain underlying

the MDP. As follows, MA is most effective in problems

where action-dependent covariance constitutes a sizeable

portion of the total SPG variance (ie. problems where future

outcomes largely depend on actions taken in the past and

consequently, ∇θJ(st+k, at+k)largely depends upon at).

Corollary 3.4. Given ergodic MDP, SPG with

action

samples per state and

states, the SPG variance reduction

from increasing

∆N= 1

is bigger than SPG variance

reduction from ∆T=δT when:

Var

sˆ

Υ

sVar

aΥ≤1−δN

δ(N2+N)(9)

On Many-Actions Policy Gradient

The corollary above is a speciﬁc case of Theorem 3.3. By

assuming a contextual bandit problem, the covariances are

equal to zero and the optimality condition is vastly simpli-

ﬁed. As follows from the deﬁnition of variance, the LHS

of Equation 9is greater or equal to

. However, the RHS

becomes negative when

δN > 1

. Since

N≥1

, it fol-

lows that MA is never optimal for bandits if

δ≥1

(ie. the

cost of acquiring an additional action sample is equal to or

greater than the cost of acquiring an additional state sam-

ple). Whereas the efﬁciency of MA for contextual bandits

is restricted, Theorem 3.3 shows that MA can be a prefer-

able strategy for gradient estimation in MDPs. We leave

researching the optimality condition for setting with sam-

pled Q-values or deterministic policy gradients for future

work.

4. Model-Based Many-Actions SPG

Given a ﬁxed amount of interactions with the environment,

our theoretical analysis is related to two notions in on-policy

SPG algorithms: achieving better quality gradients through

MA via Q-network (QMA) (Asadi et al.,2017;Petit et al.,

2019;Ciosek & Whiteson,2020); and achieving better qual-

ity gradients through simulating additional transitions via

dynamics model in model-based SPG (MB-SPG) (Janner

et al.,2019). Building on theoretical insights, we propose

Model-Based Many-Actions (MBMA), an approach that

bridges the two themes described above. MBMA lever-

ages a learned dynamics model in the context of MA-SPG.

As such, MBMA allows for MA estimation by calculat-

ing Q-values of additional action samples by simulating

a critic-bootstrapped trajectory within a dynamics model,

consisting of transition and reward networks (Ha & Schmid-

huber,2018;Hafner et al.,2019;Kaiser et al.,2019;Gelada

et al.,2019;Schrittwieser et al.,2020) which we explain in

Appendix E. MBMA can be used in conjunction with any

on-policy SPG algorithm.

4.1. MBMA and MA-SPG

In contrast to existing implementations of MA-SPG, MBMA

does not require Q-network for MA estimation. Using a

Q-network to approximate additional action samples yields

bias. Whereas the bias can theoretically be reduced to zero,

the conditions required for such bias annihilation are unreal-

istic (Petit et al.,2019). Q-network learns a non-stationary

target (Van Hasselt et al.,2016) that is dependent on the cur-

rent policy. Furthermore, generating informative samples

for multiple actions is challenging given single-action su-

pervision. This results in unstable training when Q-network

is used to bootstrap the policy gradient (Mnih et al.,2015;

Van Hasselt et al.,2016;Gu et al.,2017;Haarnoja et al.,

2018). The advantage of MBMA when compared to QMA

is that both reward and transition networks learn stationary

targets throughout training, thus offering better convergence

properties and lower bias. Such bias reduction comes at

the cost of additional computation. Whereas QMA approxi-

mates Q-values within a single forward calculation, MBMA

sequentially unrolls the dynamics model for a ﬁxed amount

of steps.

4.2. MBMA and MB-SPG

From the perspective of model-based on-policy SPG,

MBMA builds upon on-policy Model-Based Policy Op-

timization (MPBO) (Janner et al.,2019) but introduces

the distinction between two roles for simulated transitions:

whereas MBPO calculates gradient at simulated states, we

propose to use information from the dynamics model by

backpropagating from real states with simulated actions (i.e.

simulating Q-values of those actions). As such, we deﬁne

MBMA as an idea that we do not calculate gradients at sim-

ulated states, but instead use the dynamics model to reﬁne

the SPG estimator through MA variance reduction. Not

calculating gradients at simulated states greatly affects the

resulting SPG bias. When backpropagating SPG through

simulated states, SPG is biased by two approximates: the Q-

value of simulated action; and log-probability calculated at

the output of the transition network. The accumulated error

of state prediction anchors the gradient on log probabilities

which should be associated with different states. MBPO

tries to reduce the detrimental effect of compounded dy-

namics bias by simulating short-horizon trajectories starting

from real states. In contrast to that, by calculating gradi-

ents at real states, MBMA biases the SPG only through

its Q-value approximates, allowing it to omit the effects of

biased log probabilities. Such perspective is supported by

Lipschitz continuity analysis of approximate MDP models

(Asadi et al.,2018;Gelada et al.,2019). We investigate bias

stemming from strategies employed by QMA, MBMA, and

MBPO in the table below. In light of the above arguments

and our theoretical analysis, we hypothesize that using the

dynamics model for MA estimation might yield a more

favorable bias-variance tradeoff as compared to using the

dynamics model to sample additional states given a ﬁxed

simulation budget.

Table 2.

SPG per-parameter bias associated with action (MA) and

state (MS) sample simulation.

and

denote the true Q-value

and approximate Q-value of a given state-action pair respectively;

s∗

denotes the output of the transition model; and

denotes the

Lipschitz norm of

fs=∇log π(a|s)

. For MS the bias is an upper

bound. We include extended calculations in Appendix A.4.

∇J(s, a)− ∇ ˆ

J(s, a)

MA =fs(Q−ˆ

MS ≤fs(Q−ˆ

Q) + p(K(s−ˆs))2+f2

s(Q2−Q)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnMany-ActionsPolicyGradientMichalNauman12MarekCygan13AbstractWestudythevarianceofstochasticpolicygra-dients(SPGs)withmanyactionsamplesperstate.Wederiveamany-actionsoptimalitycondi-tion,whichdetermineswhenmany-actionsSPGyieldslowervarianceascomparedtoasingle-actionagentwithproportionallyextendedtraj...

展开>> 收起<<

On Many-Actions Policy Gradient.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On Many-Actions Policy Gradient

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: