
On Many-Actions Policy Gradient
The corollary above is a specific case of Theorem 3.3. By
assuming a contextual bandit problem, the covariances are
equal to zero and the optimality condition is vastly simpli-
fied. As follows from the definition of variance, the LHS
of Equation 9is greater or equal to
0
. However, the RHS
becomes negative when
δN > 1
. Since
N≥1
, it fol-
lows that MA is never optimal for bandits if
δ≥1
(ie. the
cost of acquiring an additional action sample is equal to or
greater than the cost of acquiring an additional state sam-
ple). Whereas the efficiency of MA for contextual bandits
is restricted, Theorem 3.3 shows that MA can be a prefer-
able strategy for gradient estimation in MDPs. We leave
researching the optimality condition for setting with sam-
pled Q-values or deterministic policy gradients for future
work.
4. Model-Based Many-Actions SPG
Given a fixed amount of interactions with the environment,
our theoretical analysis is related to two notions in on-policy
SPG algorithms: achieving better quality gradients through
MA via Q-network (QMA) (Asadi et al.,2017;Petit et al.,
2019;Ciosek & Whiteson,2020); and achieving better qual-
ity gradients through simulating additional transitions via
dynamics model in model-based SPG (MB-SPG) (Janner
et al.,2019). Building on theoretical insights, we propose
Model-Based Many-Actions (MBMA), an approach that
bridges the two themes described above. MBMA lever-
ages a learned dynamics model in the context of MA-SPG.
As such, MBMA allows for MA estimation by calculat-
ing Q-values of additional action samples by simulating
a critic-bootstrapped trajectory within a dynamics model,
consisting of transition and reward networks (Ha & Schmid-
huber,2018;Hafner et al.,2019;Kaiser et al.,2019;Gelada
et al.,2019;Schrittwieser et al.,2020) which we explain in
Appendix E. MBMA can be used in conjunction with any
on-policy SPG algorithm.
4.1. MBMA and MA-SPG
In contrast to existing implementations of MA-SPG, MBMA
does not require Q-network for MA estimation. Using a
Q-network to approximate additional action samples yields
bias. Whereas the bias can theoretically be reduced to zero,
the conditions required for such bias annihilation are unreal-
istic (Petit et al.,2019). Q-network learns a non-stationary
target (Van Hasselt et al.,2016) that is dependent on the cur-
rent policy. Furthermore, generating informative samples
for multiple actions is challenging given single-action su-
pervision. This results in unstable training when Q-network
is used to bootstrap the policy gradient (Mnih et al.,2015;
Van Hasselt et al.,2016;Gu et al.,2017;Haarnoja et al.,
2018). The advantage of MBMA when compared to QMA
is that both reward and transition networks learn stationary
targets throughout training, thus offering better convergence
properties and lower bias. Such bias reduction comes at
the cost of additional computation. Whereas QMA approxi-
mates Q-values within a single forward calculation, MBMA
sequentially unrolls the dynamics model for a fixed amount
of steps.
4.2. MBMA and MB-SPG
From the perspective of model-based on-policy SPG,
MBMA builds upon on-policy Model-Based Policy Op-
timization (MPBO) (Janner et al.,2019) but introduces
the distinction between two roles for simulated transitions:
whereas MBPO calculates gradient at simulated states, we
propose to use information from the dynamics model by
backpropagating from real states with simulated actions (i.e.
simulating Q-values of those actions). As such, we define
MBMA as an idea that we do not calculate gradients at sim-
ulated states, but instead use the dynamics model to refine
the SPG estimator through MA variance reduction. Not
calculating gradients at simulated states greatly affects the
resulting SPG bias. When backpropagating SPG through
simulated states, SPG is biased by two approximates: the Q-
value of simulated action; and log-probability calculated at
the output of the transition network. The accumulated error
of state prediction anchors the gradient on log probabilities
which should be associated with different states. MBPO
tries to reduce the detrimental effect of compounded dy-
namics bias by simulating short-horizon trajectories starting
from real states. In contrast to that, by calculating gradi-
ents at real states, MBMA biases the SPG only through
its Q-value approximates, allowing it to omit the effects of
biased log probabilities. Such perspective is supported by
Lipschitz continuity analysis of approximate MDP models
(Asadi et al.,2018;Gelada et al.,2019). We investigate bias
stemming from strategies employed by QMA, MBMA, and
MBPO in the table below. In light of the above arguments
and our theoretical analysis, we hypothesize that using the
dynamics model for MA estimation might yield a more
favorable bias-variance tradeoff as compared to using the
dynamics model to sample additional states given a fixed
simulation budget.
Table 2.
SPG per-parameter bias associated with action (MA) and
state (MS) sample simulation.
Q
and
ˆ
Q
denote the true Q-value
and approximate Q-value of a given state-action pair respectively;
s∗
denotes the output of the transition model; and
K
denotes the
Lipschitz norm of
fs=∇log π(a|s)
. For MS the bias is an upper
bound. We include extended calculations in Appendix A.4.
∇J(s, a)− ∇ ˆ
J(s, a)
MA =fs(Q−ˆ
Q)
MS ≤fs(Q−ˆ
Q) + p(K(s−ˆs))2+f2
s(Q2−Q)
5