Actor-Critic or Critic-Actor A Tale of Two Time Scales Shalabh Bhatnagar1 Vivek S. Borkar2 and Soumyajit Guin1 Abstract

2025-04-27 0 0 1.52MB 19 页 10玖币
侵权投诉
Actor-Critic or Critic-Actor? A Tale of Two Time Scales
Shalabh Bhatnagar1, Vivek S. Borkar2, and Soumyajit Guin1
Abstract
We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approxi-
mation with value function computed on a faster time-scale and policy computed on a slower time-scale. This
emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is
a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without
function approximation (with both linear and nonlinear function approximators) and observe that our proposed
critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.
Keywords: Reinforcement learning; approximate dynamic programming; critic-actor algorithm; two time-scale
stochastic approximation.
I. INTRODUCTION
The actor-critic algorithm of Barto et al. [1] is one of the foremost reinforcement learning algorithms
for data-driven approximate dynamic programming for Markov decision processes. Its rigorous analysis
as a two time-scale stochastic approximation was initiated in [14] and [15], first in tabular form, then
with linear function approximation, respectively. The algorithm has a faster time scale component for
value function evaluation (the ‘critic’), with policy evaluation on a slower time scale (the ‘actor’). Using
two time-scale philosophy, the former sees the latter as quasi-static, i.e., varying slowly enough on the
time-scale of the critic that critic can treat the actor’s output essentially as a constant. In turn, the actor
sees the critic as quasi-equilibrated, i.e., tracking the value function corresponding to the current policy
estimate of the actor. Thus the scheme emulates policy iteration, wherein one alternates between value
function computation for a fixed policy and policy update based on this value function by minimization
of the associated ‘Q-value’. Another interpretation is that of a Stackelberg game between the actor and
the critic, with the actor leading and being followed by the critic. That is, the actor tunes the policy
slowly and the critic reacts to it rapidly, so that the actor can factor in the critic’s reaction in her update.
This raises the issue as to what would happen if the time scales of the actor and the critic are reversed,
rendering critic the leader. We observe below that in this case, the scheme emulates value iteration,
making it a valid reinforcement learning scheme. We call it the ‘critic-actor’ algorithm. This structure as
SB was supported by a J. C. Bose Fellowship, Project No. DFTM/ 02/ 3125/M/04/AIR-04 from DRDO under DIA-RCOE, a project
from DST-ICPS, and the RBCCPS, IISc.
VSB was supported by the S. S. Bhatnagar Fellowship from the Council of Scientific and Industrial Research, Government of India.
1Department of Computer Science and Automation, Indian Institute of Science, Bengaluru 560012, India shalabh@iisc.ac.in,
gsoumyajit@iisc.ac.in
2Department of Electrical Engineering, Indian Institute of Technology Bombay, Mumbai 400076, India borkar.vs@gmail.com
arXiv:2210.04470v6 [cs.LG] 14 Jun 2024
such is not novel and was explored in the context of Q-learning in [6], [8] with a different motivation.
Its incorporation into the actor-critic facilitates a dimensionality reduction and a clean convergence
analysis. This is so because we work with value and not Q-value functions.
While there are multiple variations of the actor-critic based on the exact scheme used for policy
updates, we stick to one of the three proposed in [14] to make our point. It may be recalled that one of
the purported advantages of the actor-critic algorithm is that because of the slow time scale of the policy
update which renders it quasi-static, the value function update is ‘essentially’ a linear operation and
therefore one can legitimately use schemes such as TD(λ)[21], [23] that are based on linear function
aproximation. This is in contrast with, e.g., Q-learning [24] where the nonlinearity of the iterate is not
linear function approximation friendly. This problem returns to actor-critic if we interchange the time
scales. Nevertheless, given the current trend towards using neural networks for function approximation,
this theoretical advantage is no longer there. In particular, this puts the actor-critic and critic-actor
schemes a priori on equal footing as far as nonlinear function approximation is concerned.
II. THE BASIC FRAMEWORK
Our Markov decision process (MDP) is a random process Xn, n 0, in a finite state space S
satisfying, a.s.,
P(Xn+1 =j|Xk, Ak, k n) = p(Xn, An, j),n0.
Here AnU(Xn)is the action at time nwhen the state is Xnwhere U(i) := the finite set of actions
admissible in state i. Also, p(i, a, j)denotes the transition probability from state ito jwhen a feasible
action ais chosen in state i.φ:= {An}with each Anas above will be said to be admissible. For
simplicity, we take U(·)U. A stationary deterministic policy (SDP) fis one where An=f(Xn)for
some f:SU. Similarly, if given Xn,Anis conditionally independent of {Xm, Am, m < n}and
has the same conditional law π:S→ P(U)n, then πis called a stationary randomised policy (SRP).
Here P(U) := the space of probability vectors on U. We shall denote the probability vector π(i)as
π(i) = (π(i, a), a U(i))T. Let g:S×U×SRdenote the single-stage cost function and γ(0,1)
the discount factor.
For a given admissible sequence φ={An}, consider the infinite horizon discounted cost
Vφ(i)
=E"
X
n=0
γng(Xn, An, Xn+1)|X0=i#, i S. (1)
The function Vπ:= Vφfor φan SRP πis called the value function under the SRP π. The corresponding
(linear) Bellman equation takes the form
Vπ(i) = X
aU(i)
π(i, a)X
jS
p(i, a, j)(g(i, a, j) + γVπ(j)).(2)
Define the ‘value function’
V(i) = min
φVφ(i).(3)
This satisfies the Bellman equation
V(i) = min
aU(i) X
jS
p(i, a, j)(g(i, a, j) + γV (j)!.(4)
Define the Q-value function Q(·,·) : S×U→ R so that Q(i, a) := the total cost incurred if starting
in state i, action ais chosen and the optimal action is chosen in each state visited subsequently, i.e.,
Q(i, a) = X
jS
p(i, a, j)(g(i, a, j) + γV (j)).(5)
Then an SRP πwould be optimal if support(π(i)) arg min Q(i, ·). In particular, an optimal SDP
(hence SRP) exists and is given by any choice of minimiser of Q(i, ·)for each i. See [20] for an
extensive treatment.
Policy iteration (PI) and value iteration (VI) are two of the numerical procedures for solving the
Bellman equation. Whereas actor-critic algorithms [14] emulate PI, the algorithm with the timescales
of the actor-critic algorithm reversed will be seen to mimic VI.
III. THE PROPOSED CRITIC-ACTOR ALGORITHM
Let Vn(·)and πn(·,·)denote the estimates at instant nof the value function and the optimal SRP.
We consider here an off-policy setting where states and state-action tuples are sampled from a priori
given distributions in order to decide the particular state whose value is updated next, as well as the
action-probability estimate (in the SRP) of the sampled state-action tuple to be updated. This is more
general than the on-policy setting commonly considered in standard actor-critic algorithms such as [15],
[10], as the latter turns out to be a special case of the off-policy setting we consider.
Let {Yn}and {Zn}be Sand S×Uvalued processes such that if Yn=iS, then the value of state
i, denoted by Vn(i), is updated at the nth iterate. Likewise if Zn= (i, a), then the policy component
corresponding to the (i, a)-tuple is updated. Define {ν1(i, n)},{ν2(i, a, n)}by (for n > 0):
ν1(i, n) =
n1
X
m=0
I{Ym=i},with ν1(i, 0) = 0,
ν2(i, a, n) =
n1
X
m=0
I{Zm= (i, a)},with ν2(i, a, 0) = 0.
As with policy gradient schemes [22], we parameterize the policy. Specifically we consider a param-
eterised Boltzmann or Gibbs form for the SRP as below.
πθ(i, a) = exp(θ(i, a))
Pbexp(θ(i, b)), i S, a U. (6)
For a given θ00, let Γθ0:R[θ0, θ0]denote the projection map. Let {a(n)},{b(n)}be positive
step size sequences satisfying conditions we specify later.
The Critic-Actor Algorithm
The proposed critic-actor algorithm has similar set of updates as Algorithm 3 of [14] but with the
timescales reversed, i.e., a(n) = o(b(n)). Let {ξn(i, a)}and {ηn(i, a)},iS,aUbe independent
families of i.i.d random variables with law p(i, a, ·). Let {ϕn(i)}be a sequence of U-valued i.i.d random
variables with the conditional law of ϕn(i)given the sigma-algebra σ(Vm(·), πm(·), ξm(·,·),ηm(·,·), m
n)generated by the random variables realised till n, being denoted by πn(i, ·). The critic and actor
recursions now take the following form:
Vn+1(i) = Vn(i) + a(ν1(i, n))[g(i, ϕn(i), ξn(i, ϕn(i)))
+γVn(ξn(i, ϕn(i))) Vn(i)]I{Yn=i},
(7)
θn+1(i, a) = Γθ0(θn(i, a) + b(ν2(i, a, n))[Vn(i)
g(i, a, ηn(i, a)) γVn(ηn(i, a))]I{Zn= (i, a)}).
(8)
The Vnupdate is governed by the step-size sequence a(n), n 0, while the θnupdate is governed by
b(n), n 0. From Assumption 1 below, this implies that the Vnupdate proceeds on a slower timescale
in comparison to the θnupdate.
When compared to Q-learning, the actor here replaces the exact minimisation in each Q-learning
iterate, not always easy, by a recursive scheme for the same and can be seen as a subroutine for
摘要:

Actor-CriticorCritic-Actor?ATaleofTwoTimeScalesShalabhBhatnagar1,VivekS.Borkar2,andSoumyajitGuin1AbstractWerevisitthestandardformulationoftabularactor-criticalgorithmasatwotime-scalestochasticapproxi-mationwithvaluefunctioncomputedonafastertime-scaleandpolicycomputedonaslowertime-scale.Thisemulatesp...

展开>> 收起<<
Actor-Critic or Critic-Actor A Tale of Two Time Scales Shalabh Bhatnagar1 Vivek S. Borkar2 and Soumyajit Guin1 Abstract.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.52MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注