On Many-Actions Policy Gradient

2025-05-06 0 0 1.09MB 21 页 10玖币
侵权投诉
On Many-Actions Policy Gradient
Michal Nauman 1 2 Marek Cygan 1 3
Abstract
We study the variance of stochastic policy gra-
dients (SPGs) with many action samples per
state. We derive a many-actions optimality condi-
tion, which determines when many-actions SPG
yields lower variance as compared to a single-
action agent with proportionally extended trajec-
tory. We propose Model-Based Many-Actions
(MBMA), an approach leveraging dynamics mod-
els for many-actions sampling in the context of
SPG. MBMA addresses issues associated with
existing implementations of many-actions SPG
and yields lower bias and comparable variance
to SPG estimated from states in model-simulated
rollouts. We find that MBMA bias and variance
structure matches that predicted by theory. As
a result, MBMA achieves improved sample effi-
ciency and higher returns on a range of continuous
action environments as compared to model-free,
many-actions, and model-based on-policy SPG
baselines.
1. Introduction
Stochastic policy gradient (SPG) is a method of optimizing
stochastic policy through gradient ascent in the context of
reinforcement learning (RL) (Williams,1992;Sutton et al.,
1999;Peters & Schaal,2006). When paired with powerful
function approximators, SPG-based algorithms have proven
to be one of the most effective methods for achieving op-
timal performance in Markov Decision Processes (MDPs)
with unknown transition dynamics (Schulman et al.,2017).
Unfortunately, the exact calculation of the gradient is un-
feasible and thus the objective has to be estimated (Sutton
et al.,1999). Resulting variance is known to impact learning
speed, as well as performance of the trained agent (Konda
& Tsitsiklis,1999;Tucker et al.,2018).
1
Informatics Institute, University of Warsaw
2
Ideas National
Centre for Research and Development
3
Nomagic. Correspondence
to: Michal Nauman <nauman.mic@gmail.com>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
On-policy sample efficiency (ie. the number of environment
interactions needed to achieve a certain performance level)
is particularly affected by variance, as the gradient must be
evaluated over long sequences in order to produce a suffi-
cient quality of the SPG estimate (Mnih et al.,2016). As
such, a variety of methods for SPG variance reduction have
been proposed. The most widely used is baseline variance
reduction, which has been shown to improve algorithms
stability and became indispensable to contemporary SPG
implementations (Peters & Schaal,2006;Schulman et al.,
2015b). Alternative approaches include Q-value bootstrap-
ping (Gu et al.,2017), reducing the effect of long-horizon
stochasticity via small discount (Baxter & Bartlett,2001), in-
creasing number of samples via parallel agents (Mnih et al.,
2016) or using many-actions estimator (Asadi et al.,2017;
Kool et al.,2019b;Petit et al.,2019;Ciosek & Whiteson,
2020).
In many-actions SPG (MA), the gradient is calculated using
more than one action sample per state, without including the
follow-up states of additional actions. The method builds
upon conditional Monte-Carlo and yields variance that is
smaller or equal to that of single-action SPG given fixed
trajectory length (Bratley et al.,2011). These additional
action samples can be drawn with (Ciosek & Whiteson,
2020) or without replacement (Kool et al.,2019b) and can
be generated through rewinding the environment (Schulman
et al.,2015a) or using a parametrized Q-value approximator
(Asadi et al.,2017). However, drawing additional action
samples from the environment is unacceptable in certain
settings, while using a Q-network may introduce bias to
the gradient estimate. Furthermore, a diminishing variance
reduction effect can be achieved by extending the trajectory.
This leads to the following questions:
1.
Given fixed trajectory length and cost of sampling ac-
tions, is SPG variance more favorable when sampling
additional actions or extending the trajectory?
2.
Given that more samples translate to smaller variance,
what is the bias associated with simulating such addi-
tional samples via neural networks?
The contributions of this paper are twofold. Firstly, we ana-
lyze SPG variance theoretically. We quantify the variance
reduction stemming from sampling multiple actions per state
1
arXiv:2210.13011v5 [cs.LG] 30 Oct 2023
On Many-Actions Policy Gradient
(a) Expected steps to solve (thousands) (b) Expected reward gain after update
Figure 1.
Variance reduction leads to better sample efficiency. We train a CartPole Actor-Critic agent with different batch sizes and
many action samples per state (denoted as
N
). In Figures 1a and 1b X-axis denotes batch size (ie. trajectory length) and Y-axis denotes
thousands of steps and average performance gain resulting from a single policy update. Increasing batch size leads to better gradient
quality at the cost of fewer updates during training. Sampling more actions yields better gradient quality with fewer environment steps.
as compared to extending the trajectory of a single-action
agent. We calculate conditions under which adopting MA
estimation leads to greater variance reduction than extend-
ing trajectory length. We show that the conditions are often
met in RL, but are impossible for contextual bandits. Sec-
ondly, we propose an implementation of MA, which we re-
fer to as the Model-Based Many-Actions module (MBMA).
The module leverages a learned dynamics model to sample
state-action gradients and can be used in conjunction with
any on-policy SPG algorithm. MBMA yields a favorable
bias/variance structure as compared to learning from states
simulated in the dynamics model rollout (Janner et al.,2019;
Kaiser et al.,2019;Hafner et al.,2019) in the context of on-
policy SPG. We validate our approach and show empirically
that using MBMA alongside PPO (Schulman et al.,2017)
yields better sample efficiency and higher reward sums on a
variety of continuous action environments as compared to
many-actions, model-based and model-free PPO baselines.
2. Background
A Markov Decision Process (MDP) (Puterman,2014) is a
tuple
(S, A, R, p, γ)
, where
S
is a countable set of states,
A
is a countable set of actions,
R(s, a)
is the state-action
reward,
p(s|s, a)
is a transition kernel (with the initial state
distribution denoted as
p0
) and
γ(0,1]
is a discount factor.
A policy
π(a|s)
is a state-conditioned action distribution.
Given a policy
π
, MDP becomes a Markov reward process
with a transition kernel
pπ(s|s) = Raπ(a|s)p(s|s, a)da
,
which we refer to as the underlying Markov chain. The
underlying Markov chain is assumed to have finite variance,
a unique stationary distribution denoted as
pπ
0
(Ross et al.,
1996;Konda & Tsitsiklis,1999),
t
-step stationary transition
kernel
pπ
t
and a unique discounted stationary distribution
denoted as
pπ
. Interactions with the MDP according to
some policy
π
are called trajectories and are denoted as
τπ
T(st) = ((st, at, rt), ..., (st+T, at+T, rt+T))
, where
at
π(at|st)
,
rtR(s, a)
and
st+1 p(st+1|st, at)
. The
value function
Vπ(s)=Eτπ
(s)[P
t=0 γtR(st, at)]
and Q-
value function
Qπ(s, a)=Eτπ
(s|a)[P
t=0 γtR(st, at)] =
R(s, a) + γEsp(s|s,a)[Vπ(s)]
sample
at
according to
some fixed policy
π
. State-action advantage is defined as
Aπ(s, a) = Qπ(s, a)Vπ(s)
. An optimal policy is a policy
that maximizes discounted total return
J=Rs0Vπ(s0)ds0
.
2.1. On-policy SPG
Given a policy parametrized by
θ
, the values of
θ
can be
updated via SPG
θθ+θJ
. Since we are interested
only in gradient wrt.
θ
, we drop it from the gradient notation
in further uses. The SPG is given by (Sutton & Barto,2018):
J=E
spπ
E
aπQπ(s, a)log π(a|s)(1)
As such, SPG is proportional to a double expectation of
Qπ(s, a)θlog π(a|s)
, with the outer expectation taken
wrt. the discounted stationary distribution
pπ
and the in-
ner expectation taken wrt. policy
π
. The gradient can be
estimated via a trajectory sampled according to the policy
(Nota & Thomas,2020;Wu et al.,2022). We denote
ˆ
J
as
the estimator,
J(st, at) = Qπ(st, at)log π(at|st)
with
st, atpπ
t, π. Then, SPG can be calculated:
ˆ
J=1
T
T1
X
t=0
γtJ(st, at)(2)
In the setup above, the outer expectation of Equation 1is
estimated via Monte-Carlo (Metropolis & Ulam,1949) with
T
state samples drawn from the non-discounted stationary
distribution
pπ
0
; and the inner expectation is estimated with
a single action per state drawn from the policy
π(a|s)
. The
resulting variance can be reduced to a certain degree by a
control variate, with state value being a popular choice for
such baseline (Schulman et al.,2015b). Then, the Q-value
from Equation 1is replaced by
Aπ(st, at)
. If the state value
2
On Many-Actions Policy Gradient
is learned by a parametrized approximator, it is referred to
as the critic. Critic bootstrapping (Gu et al.,2017) is defined
as
Qπ(s, a) = R(s, a) + γV π(s)
with
sp(s|s, a)
and
can be used to balance the bias-variance tradeoff of Q-value
approximations.
2.2. On-policy Many-Actions SPG
Given a control variate, the variance of policy gradient can
be further reduced by approximating the inner integral of
Equation 2with a quadrature of
N > 1
action samples.
Then, ˆ
Jis equal to:
ˆ
J=1
T
T1
X
t=0
γt1
N
N1
X
n=0
J(st, an
t)
| {z }
Nactions per state
| {z }
Tstate samples in a trajectory
(3)
Where
an
t
denotes the
nth
action sampled at state
st
. Fur-
thermore, MDP transitions are conditioned only on the first
action performed (ie.
pπ(st+1|st, an
t) = pπ(st+1|st)
n̸= 0
). Implementations of such an approach were called
all-action policy gradient or expected policy gradient (Asadi
et al.,2017;Petit et al.,2019;Ciosek & Whiteson,2020).
As follows from the law of iterated expectations, the many-
actions (MA) estimator is unbiased and yields lower or
equal variance as compared to single-action SPG with equal
trajectory length (Petit et al.,2019). Since the policy log
probabilities are known, using MA requires approximating
the Q-values of additional action samples. As such, MA
is often implemented by performing rollouts in a rewinded
environment (Schulman et al.,2015a;Kool et al.,2019a;b)
or by leveraging a Q-network at the cost of bias (Asadi et al.,
2017;Petit et al.,2019;Ciosek & Whiteson,2020). The vari-
ance reduction stemming from using MA has been shown
to increase both performance and sample efficiency of SPG
algorithms (Schulman et al.,2015a;Kool et al.,2019b).
3. Variance of Stochastic Policy Gradients
Throughout the section, we assume no stochasticity induced
by learning Q-values and we treat Q-values as known. Fur-
thermore, when referring to SPG variance, we refer to the
diagonal of the policy parameter variance-covariance ma-
trix. Finally, to unburden the notation, we define
Υt=
γtJ(st, at)
and
¯
Υt=γtEaπJ(st, at)
, where we
skip the superscript when
t= 0
. Similarly, we use
Oa(·) =
Oaπ(·)
,
Os(·) = Ospπ
0(·)
and
Os,a(·) = Ost,atpπ
t(·)
,
where
O
denotes expected value, variance and covariance
operators. As shown, given fixed trajectory length
T
, MA-
SPG variance is smaller or equal to the variance of single-
action agent Petit et al. (2019); Ciosek & Whiteson (2020).
However, approximating the inner expectation of SPG al-
ways uses resources (ie. compute or environment interac-
tions), which could be used to reduce the SPG variance
through other means (eg. extending the trajectory length).
To this end, we extend existing results (Petit et al.,2019;
Ciosek & Whiteson,2020) by comparing the variance re-
duction stemming from employing MA as opposed to using
regular single-action SPG with an extended trajectory length.
If the underlying Markov chain is ergodic the variance of
SPG, denoted as
V
, can be calculated via Markov chain
Central Limit Theorem (Jones,2004;Brooks et al.,2011):
V=1
TVar
s,a Υ+ 2
T1
X
t=1
Tt
T2Cov
s,a Υ,Υt(4)
The states underlying
Υ
and
Υt
are sampled from the undis-
counted stationary distribution
pπ
0
and the t-step stationary
transition kernel
pπ
t
respectively. As follows from the er-
godic theorem (Norris & Norris,1998), conditional probabil-
ity of visiting state
st
given starting in state
s0
with action
a0
0
approaches the undiscounted stationary distribution
pπ
0
ex-
ponentially fast as
t
grows
limt→∞ p(st|s0, a1
0) = pπ
0(st)
.
Therefore,
CovtCovt+1
, as well as
limt→∞ Covt= 0
.
Equation 4shows the well-known result that increasing the
trajectory length
T
decreases
V
. This result contextualizes
the success of parallel SPG (Mnih et al.,2016). Unfortu-
nately, the form above assumes single action per state.
3.1. Variance Decomposition
To quantify variance reduction stemming from many ac-
tion samples, we decompose
V
into sub-components. We
include derivations in Appendix A.1.
Lemma 3.1. Given ergodic MDP, SPG with
N
action sam-
ples per state and Tstates, Vcan be decomposed into:
Var
s,a Υ= Var
s¯
Υ+1
NE
sVar
aΥ
Cov
s,a Υ,Υt= Cov
s,a ˆ
Υ,ˆ
Υt+1
NE
sCov
s,a Υ,Υt(5)
Combining Lemma 3.1 with Equation 4yields an expression
for decomposed SPG variance, where we group components
according to dependence on N:
TV= Var
sˆ
Υ+ 2
T1
X
t=1
Tt
TCov
s,a ˆ
Υ,ˆ
Υt
| {z }
Marginalized policy variance
+1
NE
sVar
aΥ+ 2
T1
X
t=1
Tt
TCov
s,a Υ,Υt
| {z }
Policy-dependent variance
(6)
3
On Many-Actions Policy Gradient
Table 1.
Decomposed trace of variance-covariance matrix divided by the number of parameters. The components were estimated by
marginalizing Q-values, with Equation 3and Lemma 3.1 using
125000
non-baselined interactions. The last two columns record the
best performance during
500k
environment steps (average performance shown in the brackets). The performance of SPG variants was
measured during
500k
training steps with additional action samples drawn from the environment. With most variance depending on the
policy, MA often yields better performance than single-action agents with extended trajectories. We detail the setting in Appendix B.
VARIANCE COMPONENT PERFORMANCE
TASK MARGINALIZED POLICY POLICY-DEPENDENT (T, N ) = (1024,2) (T, N ) = (2048,1)
BALL CATCH 0.026 (3%) 0.819 (97%) 905 (708) 920 (715)
CART SWINGUP 0.006 (1%) 5.736 (99%) 837 (670) 801 (669)
CHEETAH RUN 0.006 (1%) 1.615 (99%) 208 (131) 204 (126)
FINGER SPIN 0.026 (18%) 0.122 (82%) 304 (187) 281 (179)
REACHER EASY 2.269 (39%) 3.565 (61%) 428 (262) 776 (488)
WALKER WALK 0.081 (1%) 11.786 (99%) 509 (315) 465 (287)
Given
N= 1
, the variance simplifies to a single-action
case. The statement shows that SPG variance can be de-
composed into: marginalized policy variance, which stems
from the underlying Markov chain and is decreased only by
trajectory length (
T
); and policy-dependent variance, which
indeed is reduced by both sampling more actions per state
(
N
) and increasing trajectory length (
T
). Table 1shows es-
timated variance components and performance of two SPG
estimators (
T= 1024; N= 2
and
T= 2048; N= 1
) for
6 Deepmind Control Suite (DMC) environments. In partic-
ular, the table shows that with Q-values marginalized, the
policy is responsible for around
90%
of SPG variance in
tested environments.
3.2. Measuring Variance Reduction
We proceed with the analytical analysis of the variance
reduction stemming from increasing Nand T.
Lemma 3.2. Given ergodic MDP, SPG with
N
action sam-
ples per state and
T
states, variance reduction stemming
from increasing
N
by
1
(denoted as
N
) and variance re-
duction stemming from increasing the trajectory length to
T+δT with δ(0,)(denoted as T) are equal to:
N
αN
=E
sVar
aΥ+ 2
T1
X
t=1
Tt
TCov
s,a Υ,Υt
T
αT
= Var
s,a Υ+ 2
T1
X
t=1 Tt
Tt
T+δT Cov
s,a Υ,Υt
αN=1
T(N2+N)and αT=δ
T+δT
(7)
Derivation of Lemma 3.2 is detailed in Appendix A.2.
Lemma 3.2 shows the diminishing variance reduction stem-
ming from increasing
N
by
1
or
T
by
δT
. Incorporating
δ
captures the notion of relative costs of increasing
N
and
T
.
If
δ= 1
, then the cost of increasing
N
by
1
(sampling one
more action per state in trajectory) is equal to doubling the
trajectory length. Now, it follows that many-actions yield
better variance reduction than increasing trajectory length
only if NTfor given values of N,T, and δ.
Theorem 3.3. Given ergodic MDP, SPG with
N
action
samples per
T
states, variance reduction stemming from in-
creasing
N
by
1
is bigger than variance reduction stemming
from increasing Tby δT for δ= 1 and N= 1 when:
T1
X
t=1
t
TCov
s,a Υ,ΥtVar
sˆ
Υ+ 2
T1
X
t=1
Tt
TCov
s,a ˆ
Υ,ˆ
Υt
(8)
For derivation with
N1
and
δ(0,)
see Equation 14
in Appendix A.3. The theorem represents a condition under
which optimal to switch from regular SPG (MA-SPG with
N= 1
) to MA-SPG with
N= 2
. Surprisingly, the optimal-
ity condition for
δ= 1
and
N= 1
is dependent solely on the
covariance structure of the data. As follows from Theorem
3.3, MA is optimal when the weighted sum of MDP covari-
ances exceeds the variance of the Markov Chain underlying
the MDP. As follows, MA is most effective in problems
where action-dependent covariance constitutes a sizeable
portion of the total SPG variance (ie. problems where future
outcomes largely depend on actions taken in the past and
consequently, θJ(st+k, at+k)largely depends upon at).
Corollary 3.4. Given ergodic MDP, SPG with
N
action
samples per state and
T
states, the SPG variance reduction
from increasing
N= 1
is bigger than SPG variance
reduction from T=δT when:
Var
sˆ
Υ
E
sVar
aΥ1δN
δ(N2+N)(9)
4
On Many-Actions Policy Gradient
The corollary above is a specific case of Theorem 3.3. By
assuming a contextual bandit problem, the covariances are
equal to zero and the optimality condition is vastly simpli-
fied. As follows from the definition of variance, the LHS
of Equation 9is greater or equal to
0
. However, the RHS
becomes negative when
δN > 1
. Since
N1
, it fol-
lows that MA is never optimal for bandits if
δ1
(ie. the
cost of acquiring an additional action sample is equal to or
greater than the cost of acquiring an additional state sam-
ple). Whereas the efficiency of MA for contextual bandits
is restricted, Theorem 3.3 shows that MA can be a prefer-
able strategy for gradient estimation in MDPs. We leave
researching the optimality condition for setting with sam-
pled Q-values or deterministic policy gradients for future
work.
4. Model-Based Many-Actions SPG
Given a fixed amount of interactions with the environment,
our theoretical analysis is related to two notions in on-policy
SPG algorithms: achieving better quality gradients through
MA via Q-network (QMA) (Asadi et al.,2017;Petit et al.,
2019;Ciosek & Whiteson,2020); and achieving better qual-
ity gradients through simulating additional transitions via
dynamics model in model-based SPG (MB-SPG) (Janner
et al.,2019). Building on theoretical insights, we propose
Model-Based Many-Actions (MBMA), an approach that
bridges the two themes described above. MBMA lever-
ages a learned dynamics model in the context of MA-SPG.
As such, MBMA allows for MA estimation by calculat-
ing Q-values of additional action samples by simulating
a critic-bootstrapped trajectory within a dynamics model,
consisting of transition and reward networks (Ha & Schmid-
huber,2018;Hafner et al.,2019;Kaiser et al.,2019;Gelada
et al.,2019;Schrittwieser et al.,2020) which we explain in
Appendix E. MBMA can be used in conjunction with any
on-policy SPG algorithm.
4.1. MBMA and MA-SPG
In contrast to existing implementations of MA-SPG, MBMA
does not require Q-network for MA estimation. Using a
Q-network to approximate additional action samples yields
bias. Whereas the bias can theoretically be reduced to zero,
the conditions required for such bias annihilation are unreal-
istic (Petit et al.,2019). Q-network learns a non-stationary
target (Van Hasselt et al.,2016) that is dependent on the cur-
rent policy. Furthermore, generating informative samples
for multiple actions is challenging given single-action su-
pervision. This results in unstable training when Q-network
is used to bootstrap the policy gradient (Mnih et al.,2015;
Van Hasselt et al.,2016;Gu et al.,2017;Haarnoja et al.,
2018). The advantage of MBMA when compared to QMA
is that both reward and transition networks learn stationary
targets throughout training, thus offering better convergence
properties and lower bias. Such bias reduction comes at
the cost of additional computation. Whereas QMA approxi-
mates Q-values within a single forward calculation, MBMA
sequentially unrolls the dynamics model for a fixed amount
of steps.
4.2. MBMA and MB-SPG
From the perspective of model-based on-policy SPG,
MBMA builds upon on-policy Model-Based Policy Op-
timization (MPBO) (Janner et al.,2019) but introduces
the distinction between two roles for simulated transitions:
whereas MBPO calculates gradient at simulated states, we
propose to use information from the dynamics model by
backpropagating from real states with simulated actions (i.e.
simulating Q-values of those actions). As such, we define
MBMA as an idea that we do not calculate gradients at sim-
ulated states, but instead use the dynamics model to refine
the SPG estimator through MA variance reduction. Not
calculating gradients at simulated states greatly affects the
resulting SPG bias. When backpropagating SPG through
simulated states, SPG is biased by two approximates: the Q-
value of simulated action; and log-probability calculated at
the output of the transition network. The accumulated error
of state prediction anchors the gradient on log probabilities
which should be associated with different states. MBPO
tries to reduce the detrimental effect of compounded dy-
namics bias by simulating short-horizon trajectories starting
from real states. In contrast to that, by calculating gradi-
ents at real states, MBMA biases the SPG only through
its Q-value approximates, allowing it to omit the effects of
biased log probabilities. Such perspective is supported by
Lipschitz continuity analysis of approximate MDP models
(Asadi et al.,2018;Gelada et al.,2019). We investigate bias
stemming from strategies employed by QMA, MBMA, and
MBPO in the table below. In light of the above arguments
and our theoretical analysis, we hypothesize that using the
dynamics model for MA estimation might yield a more
favorable bias-variance tradeoff as compared to using the
dynamics model to sample additional states given a fixed
simulation budget.
Table 2.
SPG per-parameter bias associated with action (MA) and
state (MS) sample simulation.
Q
and
ˆ
Q
denote the true Q-value
and approximate Q-value of a given state-action pair respectively;
s
denotes the output of the transition model; and
K
denotes the
Lipschitz norm of
fs=log π(a|s)
. For MS the bias is an upper
bound. We include extended calculations in Appendix A.4.
J(s, a)− ∇ ˆ
J(s, a)
MA =fs(Qˆ
Q)
MS fs(Qˆ
Q) + p(K(sˆs))2+f2
s(Q2Q)
5
摘要:

OnMany-ActionsPolicyGradientMichalNauman12MarekCygan13AbstractWestudythevarianceofstochasticpolicygra-dients(SPGs)withmanyactionsamplesperstate.Wederiveamany-actionsoptimalitycondi-tion,whichdetermineswhenmany-actionsSPGyieldslowervarianceascomparedtoasingle-actionagentwithproportionallyextendedtraj...

展开>> 收起<<
On Many-Actions Policy Gradient.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.09MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注