Actor-Critic or Critic-Actor A Tale of Two Time Scales Shalabh Bhatnagar1 Vivek S. Borkar2 and Soumyajit Guin1 Abstract

2025-04-27 0 0 1.52MB 19 页 10玖币

侵权投诉

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Shalabh Bhatnagar1, Vivek S. Borkar2, and Soumyajit Guin1

Abstract

We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approxi-

mation with value function computed on a faster time-scale and policy computed on a slower time-scale. This

emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is

a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without

function approximation (with both linear and nonlinear function approximators) and observe that our proposed

critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

Keywords: Reinforcement learning; approximate dynamic programming; critic-actor algorithm; two time-scale

stochastic approximation.

I. INTRODUCTION

The actor-critic algorithm of Barto et al. [1] is one of the foremost reinforcement learning algorithms

for data-driven approximate dynamic programming for Markov decision processes. Its rigorous analysis

as a two time-scale stochastic approximation was initiated in [14] and [15], ﬁrst in tabular form, then

with linear function approximation, respectively. The algorithm has a faster time scale component for

value function evaluation (the ‘critic’), with policy evaluation on a slower time scale (the ‘actor’). Using

two time-scale philosophy, the former sees the latter as quasi-static, i.e., varying slowly enough on the

time-scale of the critic that critic can treat the actor’s output essentially as a constant. In turn, the actor

sees the critic as quasi-equilibrated, i.e., tracking the value function corresponding to the current policy

estimate of the actor. Thus the scheme emulates policy iteration, wherein one alternates between value

function computation for a ﬁxed policy and policy update based on this value function by minimization

of the associated ‘Q-value’. Another interpretation is that of a Stackelberg game between the actor and

the critic, with the actor leading and being followed by the critic. That is, the actor tunes the policy

slowly and the critic reacts to it rapidly, so that the actor can factor in the critic’s reaction in her update.

This raises the issue as to what would happen if the time scales of the actor and the critic are reversed,

rendering critic the leader. We observe below that in this case, the scheme emulates value iteration,

making it a valid reinforcement learning scheme. We call it the ‘critic-actor’ algorithm. This structure as

SB was supported by a J. C. Bose Fellowship, Project No. DFTM/ 02/ 3125/M/04/AIR-04 from DRDO under DIA-RCOE, a project

from DST-ICPS, and the RBCCPS, IISc.

VSB was supported by the S. S. Bhatnagar Fellowship from the Council of Scientiﬁc and Industrial Research, Government of India.

1Department of Computer Science and Automation, Indian Institute of Science, Bengaluru 560012, India shalabh@iisc.ac.in,

gsoumyajit@iisc.ac.in

2Department of Electrical Engineering, Indian Institute of Technology Bombay, Mumbai 400076, India borkar.vs@gmail.com

arXiv:2210.04470v6 [cs.LG] 14 Jun 2024

such is not novel and was explored in the context of Q-learning in [6], [8] with a different motivation.

Its incorporation into the actor-critic facilitates a dimensionality reduction and a clean convergence

analysis. This is so because we work with value and not Q-value functions.

While there are multiple variations of the actor-critic based on the exact scheme used for policy

updates, we stick to one of the three proposed in [14] to make our point. It may be recalled that one of

the purported advantages of the actor-critic algorithm is that because of the slow time scale of the policy

update which renders it quasi-static, the value function update is ‘essentially’ a linear operation and

therefore one can legitimately use schemes such as TD(λ)[21], [23] that are based on linear function

aproximation. This is in contrast with, e.g., Q-learning [24] where the nonlinearity of the iterate is not

linear function approximation friendly. This problem returns to actor-critic if we interchange the time

scales. Nevertheless, given the current trend towards using neural networks for function approximation,

this theoretical advantage is no longer there. In particular, this puts the actor-critic and critic-actor

schemes a priori on equal footing as far as nonlinear function approximation is concerned.

II. THE BASIC FRAMEWORK

Our Markov decision process (MDP) is a random process Xn, n ≥0, in a ﬁnite state space S

satisfying, a.s.,

P(Xn+1 =j|Xk, Ak, k ≤n) = p(Xn, An, j),∀n≥0.

Here An∈U(Xn)is the action at time nwhen the state is Xnwhere U(i) := the ﬁnite set of actions

admissible in state i. Also, p(i, a, j)denotes the transition probability from state ito jwhen a feasible

action ais chosen in state i.φ:= {An}with each Anas above will be said to be admissible. For

simplicity, we take U(·)≡U. A stationary deterministic policy (SDP) fis one where An=f(Xn)for

some f:S→U. Similarly, if given Xn,Anis conditionally independent of {Xm, Am, m < n}and

has the same conditional law π:S→ P(U)∀n, then πis called a stationary randomised policy (SRP).

Here P(U) := the space of probability vectors on U. We shall denote the probability vector π(i)as

π(i) = (π(i, a), a ∈U(i))T. Let g:S×U×S→Rdenote the single-stage cost function and γ∈(0,1)

the discount factor.

For a given admissible sequence φ={An}, consider the inﬁnite horizon discounted cost

Vφ(i)△

=E"∞

n=0

γng(Xn, An, Xn+1)|X0=i#, i ∈S. (1)

The function Vπ:= Vφfor φ≈an SRP πis called the value function under the SRP π. The corresponding

(linear) Bellman equation takes the form

Vπ(i) = X

a∈U(i)

π(i, a)X

j∈S

p(i, a, j)(g(i, a, j) + γVπ(j)).(2)

Deﬁne the ‘value function’

V∗(i) = min

φVφ(i).(3)

This satisﬁes the Bellman equation

V∗(i) = min

a∈U(i) X

j∈S

p(i, a, j)(g(i, a, j) + γV ∗(j)!.(4)

Deﬁne the Q-value function Q∗(·,·) : S×U→ R so that Q∗(i, a) := the total cost incurred if starting

in state i, action ais chosen and the optimal action is chosen in each state visited subsequently, i.e.,

Q∗(i, a) = X

j∈S

p(i, a, j)(g(i, a, j) + γV ∗(j)).(5)

Then an SRP π∗would be optimal if support(π∗(i)) ⊂arg min Q∗(i, ·). In particular, an optimal SDP

(hence SRP) exists and is given by any choice of minimiser of Q∗(i, ·)for each i. See [20] for an

extensive treatment.

Policy iteration (PI) and value iteration (VI) are two of the numerical procedures for solving the

Bellman equation. Whereas actor-critic algorithms [14] emulate PI, the algorithm with the timescales

of the actor-critic algorithm reversed will be seen to mimic VI.

III. THE PROPOSED CRITIC-ACTOR ALGORITHM

Let Vn(·)and πn(·,·)denote the estimates at instant nof the value function and the optimal SRP.

We consider here an off-policy setting where states and state-action tuples are sampled from a priori

given distributions in order to decide the particular state whose value is updated next, as well as the

action-probability estimate (in the SRP) of the sampled state-action tuple to be updated. This is more

general than the on-policy setting commonly considered in standard actor-critic algorithms such as [15],

[10], as the latter turns out to be a special case of the off-policy setting we consider.

Let {Yn}and {Zn}be Sand S×Uvalued processes such that if Yn=i∈S, then the value of state

i, denoted by Vn(i), is updated at the nth iterate. Likewise if Zn= (i, a), then the policy component

corresponding to the (i, a)-tuple is updated. Deﬁne {ν1(i, n)},{ν2(i, a, n)}by (for n > 0):

ν1(i, n) =

n−1

m=0

I{Ym=i},with ν1(i, 0) = 0,

ν2(i, a, n) =

n−1

m=0

I{Zm= (i, a)},with ν2(i, a, 0) = 0.

As with policy gradient schemes [22], we parameterize the policy. Speciﬁcally we consider a param-

eterised Boltzmann or Gibbs form for the SRP as below.

πθ(i, a) = exp(θ(i, a))

Pbexp(θ(i, b)), i ∈S, a ∈U. (6)

For a given θ0≫0, let Γθ0:R→[−θ0, θ0]denote the projection map. Let {a(n)},{b(n)}be positive

step size sequences satisfying conditions we specify later.

The Critic-Actor Algorithm

The proposed critic-actor algorithm has similar set of updates as Algorithm 3 of [14] but with the

timescales reversed, i.e., a(n) = o(b(n)). Let {ξn(i, a)}and {ηn(i, a)},i∈S,a∈Ube independent

families of i.i.d random variables with law p(i, a, ·). Let {ϕn(i)}be a sequence of U-valued i.i.d random

variables with the conditional law of ϕn(i)given the sigma-algebra σ(Vm(·), πm(·), ξm(·,·),ηm(·,·), m ≤

n)generated by the random variables realised till n, being denoted by πn(i, ·). The critic and actor

recursions now take the following form:

Vn+1(i) = Vn(i) + a(ν1(i, n))[g(i, ϕn(i), ξn(i, ϕn(i)))

+γVn(ξn(i, ϕn(i))) −Vn(i)]I{Yn=i},

(7)

θn+1(i, a) = Γθ0(θn(i, a) + b(ν2(i, a, n))[Vn(i)

−g(i, a, ηn(i, a)) −γVn(ηn(i, a))]I{Zn= (i, a)}).

(8)

The Vnupdate is governed by the step-size sequence a(n), n ≥0, while the θnupdate is governed by

b(n), n ≥0. From Assumption 1 below, this implies that the Vnupdate proceeds on a slower timescale

in comparison to the θnupdate.

When compared to Q-learning, the actor here replaces the exact minimisation in each Q-learning

iterate, not always easy, by a recursive scheme for the same and can be seen as a subroutine for

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Actor-CriticorCritic-Actor?ATaleofTwoTimeScalesShalabhBhatnagar1,VivekS.Borkar2,andSoumyajitGuin1AbstractWerevisitthestandardformulationoftabularactor-criticalgorithmasatwotime-scalestochasticapproxi-mationwithvaluefunctioncomputedonafastertime-scaleandpolicycomputedonaslowertime-scale.Thisemulatesp...

展开>> 收起<<

Actor-Critic or Critic-Actor A Tale of Two Time Scales Shalabh Bhatnagar1 Vivek S. Borkar2 and Soumyajit Guin1 Abstract.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Actor-Critic or Critic-Actor A Tale of Two Time Scales Shalabh Bhatnagar1 Vivek S. Borkar2 and Soumyajit Guin1 Abstract

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: