Policy Gradient With Serial Markov Chain Reasoning Edoardo Cetin Department of Engineering

2025-04-26 0 0 8.59MB 35 页 10玖币

侵权投诉

Policy Gradient With Serial Markov Chain Reasoning

Edoardo Cetin

Department of Engineering

King’s College London

edoardo.cetin@kcl.ac.uk

Oya Celiktutan

Department of Engineering

King’s College London

oya.celiktutan@kcl.ac.uk

Abstract

We introduce a new framework that performs decision-making in reinforcement

learning (RL) as an iterative reasoning process. We model agent behavior as

the steady-state distribution of a parameterized reasoning Markov chain (RMC),

optimized with a new tractable estimate of the policy gradient. We perform action

selection by simulating the RMC for enough reasoning steps to approach its steady-

state distribution. We show our framework has several useful properties that are

inherently missing from traditional RL. For instance, it allows agent behavior to

approximate any continuous distribution over actions by parameterizing the RMC

with a simple Gaussian transition function. Moreover, the number of reasoning

steps to reach convergence can scale adaptively with the difﬁculty of each action

selection decision and can be accelerated by re-using past solutions. Our resulting

algorithm achieves state-of-the-art performance in popular Mujoco and DeepMind

Control benchmarks, both for proprioceptive and pixel-based tasks.

1 Introduction

Reinforcement learning (RL) has the potential to provide a general and effective solution to many

modern challenges. Recently, this class of methods achieved numerous impressive milestones in

different problem domains, such as games [

–

], robotics [

–

], and other meaningful real-world

applications [

–

]. However, all these achievements relied on massive amounts of data, controlled

environments, and domain-speciﬁc tuning. These commonalities highlight some of the current

practical limitations that prevent RL to be widely applicable [10].

Figure 1: Depiction of agent decision-

making with serial Markov chain reasoning.

In the deep RL framework, practitioners train agents with

the end goal of obtaining optimal behavior. Traditionally,

agent behavior is modeled with feed-forward policies re-

gressing from any state to a corresponding distribution

over actions. Such formulation yields practical training ob-

jectives in both off-policy [

–

] and on-policy settings

[

–

]. However, we identify three inherent properties

of this rigid representation of behavior that could consid-

erably impact expressivity and efﬁciency in continuous

control tasks. First, agent behavior is restricted to a class

of tractable distributions, which might fail to capture the

necessary complexity and multi-modality of a task. Sec-

ond, the policy performs a ﬁxed reasoning process with

a feed-forward computation, which potency cannot adapt

to the varying complexity of individual action selection

problems. Third, decision-making is performed every time from scratch, without re-using any past

information that might still inform and facilitate the current action selection problem.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06766v1 [cs.LG] 13 Oct 2022

Unlike RL policies, human reasoning does not appear to follow a rigid feed-forward structure. In fact,

a range of popular psychological models characterize human decision-making as a sequential process

with adaptive temporal dynamics [

–

]. Many of these models have found empirical groundings in

neuroscience [

–

] and have shown to effectively complement RL for capturing human behavior

in experimental settings [

]. Partly inspired by these works, we attempt to reframe the deep RL

framework by making use of a similar ﬂexible model of agent behavior, in order to counteract its

aforementioned limitations.

We introduce serial Markov chain reasoning - a new powerful framework for representing agent

behavior. Our framework treats decision-making as an adaptive reasoning process, where the agent

sequentially updates its beliefs regarding which action to execute in a series of reasoning steps. We

model this process by replacing the traditional policy with a parameterized transition function, which

deﬁnes a reasoning Markov chain (RMC). The steady-state distribution of the RMC represents the

distribution of agent behavior after performing enough reasoning for decision-making. Our framework

naturally overcomes the aforementioned limitations of traditional RL. In particular, we show that our

agent’s behavior can approximate any arbitrary distribution even with simple parameterized transition

functions. Moreover, the required number of reasoning steps adaptively scales with the difﬁculty of

individual action selection problems and can be accelerated by re-using samples from similar RMCs.

To optimize behavior modeled by the steady-state distribution of the RMC, we derive a new tractable

method to estimate the policy gradient. Hence, we implement a new effective off-policy algorithm

for maximum entropy reinforcement learning (MaxEnt RL) [

], named Steady-State Policy

Gradient (SSPG). Using SSPG, we empirically validate the conceptual properties of our framework

over traditional MaxEnt RL. Moreover, we obtain state-of-the-art results for popular benchmarks

from the OpenAI Gym Mujoco suite [29] and the DeepMind Control suite from pixels [30].

In summary, this work makes the following key contributions:

We propose serial Markov Chain reasoning a framework to represent agent behavior that can

overcome expressivity and efﬁciency limitations inherent to traditional reinforcement learning.

2. Based on our framework, we derive SSPG, a new tractable off-policy algorithm for MaxEnt RL.

We provide experimental results validating theorized properties of serial Markov Chain reasoning

and displaying state-of-the-art performance on the Mujoco and DeepMind Control suites.

2 Background

2.1 Reinforcement learning problem

We consider the classical formulation of the reinforcement learning (RL) problem setting as a Markov

Decision Process (MDP) [

], deﬁned by the tuple

(S, A, P, p0, r, γ)

. In particular, at each discrete

time step

the agent experiences a state from the environment’s state-space,

st∈S

, based on which

it selects an action from its own action space,

at∈A

. In continuous control problems (considered

in this work), the action space is typically a compact subset of an Euclidean space

Rdim(A)

. The

evolution of the environment’s state through time is determined by the transition dynamics and initial

state distribution,

and

. Lastly, the reward function

represents the immediate level of progress

for any state-action tuple towards solving a target task. The agent’s behavior is represented by a

state-conditioned parameterized policy distribution

πθ

. Hence, its interaction with the environment

produces trajectories,

τ= (s0, a0, s1, ..., sT, aT)

, according to a factored joint distribution

pπθ(τ) =

p0(s0)QT

t=0 πθ(at|st)P(st+1|st, at)

. The RL objective is to optimize agent behavior as to maximize

the discounted sum of expected future rewards: arg maxθEpπθ(τ)hPT

t=0 γtr(st, at)i.

2.2 Maximum entropy reinforcement learning and inference

Maximum entropy reinforcement learning (MaxEnt RL) [

] considers optimizing agent behavior for

a different objective that naturally arises when formulating action selection as an inference problem

[

–

]. Following Levine

[28]

, we consider modeling a set of binary optimality random variables

with realization probability proportional to the exponentiated rewards scaled by the temperature

p(Ot|st, at)∝exp( 1

αr(st, at))

. The goal of MaxEnt RL is to minimize the KL-divergence between

trajectories stemming from agent behavior, pπθ(τ), and the inferred optimal behavior, p(τ|O0:T):

DKL (pπθ(τ)||p(τ|O0:T)) = Epπθ(τ)"log p0(s0)QT

t=0 πθ(at|st)P(st+1|st, at)

p0(s0)QT

t=0 exp( 1

αr(st, at))P(st+1|st, at)#

=−Epπθ(τ)"T

t=0

r(st, at) + αH(π(·|st))#.

(1)

The resulting entropy-regularized objective introduces an explicit trade-off between exploitation and

exploration, regulated by the temperature parameter

scaling the policy’s entropy. An effective

choice to optimize this objective is to learn an auxiliary parameterized soft Q-function [37]:

Qπ

φ(st, at) = Epπθ(τ|st,at)"r(st, at) +

t0=t+1

r(st0, at0) + αH(π(at0|st0)#.(2)

Given some state,

Qπ

φ(s, ·)

represents an energy-function based on the expected immediate reward and

the agent’s future likelihood of optimality from performing any action. Thus, we can locally optimize

the MaxEnt objective by reducing the KL-divergence between

and the canonical distribution of

its current soft Q-function. This is equivalent to maximizing the expected soft Q-function’s value

corrected by the policy’s entropy, resembling a regularized policy gradient objective [11, 12]:

arg max

Es,a∼πθ(·|s)Qπ

φ(s, a) + αH(πθ(a|s)).(3)

The policy is usually modeled with a neural network outputting the parameters of some tractable

distribution, such as a factorized Gaussian,

πθ(·|s) = N(µθ(s); Σθ(s))

. This practice allows to

efﬁciently approximate the gradients from Eqn. 3 via the reparameterization trick [

]. We consider

the off-policy RL setting, where the agent alternates learning with storing new experience in a data

buffer, D. We refer the reader to Haarnoja et al. [13, 39] for further derivation and practical details.

3 Policy Gradient with serial reasoning

3.1 Reasoning as a Markov chain

We introduce Serial Markov Chain Reasoning, a new framework to model agent behavior, based on

conceptualizing action selection as an adaptive, sequential process which we refer to as reasoning.

Instead of using a traditional policy, the agent selects which action to execute by maintaining an

internal action-belief and a belief transition (BT-) policy,

πb(a0|a, s)

. During the reasoning process,

the agent updates its action-belief for a series of reasoning steps by sampling a new action with

the BT-policy

πb

taking both environment state and previous action-belief as input. We naturally

represent this process with a reasoning Markov chain (RMC), a discrete-time Markov chain over

different action-beliefs, with transition dynamics given by the BT-policy. Hence, for any input

environment state

and initial action-belief

, the n-step transition probabilities of the RMC for

future reasoning steps n= 1,2,3, ... are deﬁned as:

πb

n(a|a0, s) = ZA

πb(a|a0, s)πb

n−1(a0|a0, s)da0,for n > 1, and πb

1=πb.(4)

Given a compact action space and a BT-policy with a non-zero inﬁmum density, we can ensure that

as the number of reasoning steps grows, the probability of any action-belief in the RMC converges

to some steady-state probability which is independent of the initial action-belief.

We denote this

implicit probability distribution as the steady-state (SS-) policy, symbolized by πs(a|s):

Lemma 3.1. Steady-state convergence. For any environment state s, consider a reasoning Markov

chain (RMC) deﬁned on a compact action space

with transition probabilities given by

πb(a0|a, s)

Suppose that

inf{πb(a0|a, s) : a0, a ∈A}>0

. Then there exists a steady-state probability distribu-

tion function πs(·|s)such that:

lim

n→∞ πb

n(a|a0, s)→πs(a|s)for all a∈A. (5)

Proof. See Appendix A.

Figure 2: (

Left

) BT-policy transition probabilities and quantized RMC state diagram. (

Right

) Sample approxi-

mation of related steady-state distribution as compared to canonical distribution of the soft Q-function.

The RMC’s steady-state probabilities can be interpreted as representing the distribution of agent’s

behavior after an appropriate number of reasoning steps are performed. In this work, we strive

to optimize the agent’s behavior following the MaxEnt RL framework described in Section 2. In

particular, we consider learning a parameterized BT-policy,

πb

, to produce appropriate transition

probabilities for each environment state such that the SS-policy,

πs

, from the resulting RMC optimizes:

arg max

J(θ) = Es,a∼πs

θ(·|s)Qs

φ(s, a) + αH(πs

θ(a|s)).(6)

Here,

is a parameterized soft Q-function for the agent’s behavior from

πs

, which we learn by

minimizing a squared soft Bellman loss utilizing delayed parameters φ0and samples from πs

θ:

arg min

J(φ) = Es,a,s0h(Qs

φ(s, a)−r(s, a) + γEa0∼πs

θ(·|s)Qs

φ0(s0, a0) + αH(πs

θ(a0|s0)i2.(7)

In Fig. 2, we illustrate the relationship between a learned BT-policy, the corresponding SS-policy,

and the soft Q-function in a 1-dimensional toy task (see App. C for details). In this example, the

BT-policy is parameterized as a simple squashed Gaussian distribution, with unimodal transitions

between consecutive action beliefs (Fig. 2, Left). We obtain samples of agent behavior (the SS-policy)

by performing a series of reasoning steps, using the BT-policy to simulate the RMC until we approach

steady-state convergence. By plotting the resulting empirical distribution of agent behavior, we see

it closely matches the multi-modal, non-Gaussian canonical distribution from its soft Q-function

(Fig. 2, Right). This example shows how the expressive power of agent behavior in our framework

can go far beyond the BT-policy’s simple parameterization, enabling for the effective maximization

of complex and multi-modal MaxEnt objectives.

3.2 Learning the belief transition policy

We propose a new method to estimate the policy gradient of the BT-policy,

πb

, for optimizing the

steady-state MaxEnt objective described in Section 3.1. We note that the gradient from Eq. 6 involves

differentiating through an expectation of the steady-state policy,

πs

. However,

πs

is only implicitly

deﬁned, and its connection with the actual BT-policy or its parameters does not have a tractable

closed-form expression. To approach this problem, we introduce a family of n-step extensions to the

soft Q-function, Qs

n:S×A7→ Rfor n= 0,1,2, . . . , deﬁned as:

n(s, a) = ZA

πb

n(a0|a, s)Qs

φ(s, a0)da0,with ∇θQs

n(s, a) = 0.(8)

Intuitively, each n-step soft Q-function

n(s, a)

outputs the expected soft Q-value after performing

reasoning steps in the RMC from the initial action-belief

. However, we treat the output of each

n-step soft Q-function as being independent of the actual parameters of the BT-policy,

. Hence, we

can interpret computing

n(s, a)

as simulating the RMC with a ﬁxed and immutable copy of the

current

πb

. We use this deﬁnition to provide a convenient notation in the following new Theorem that

expresses the policy gradient without differentiating through πs

θ:

1This is unrelated to the steady-state distribution for inﬁnite-horizon MDPs considered in prior work [40].

Theorem 3.2. Steady-state policy gradient.

Let

πb

θ(·|a, s)

be a parameterized belief transition policy

which deﬁnes a reasoning Markov chain with a stationary distribution given by the steady-state

policy

πs

θ(·|s)

. Let

be a real function deﬁned on

S×A

, with a family of n-step extensions

{Qs

as deﬁned in Eq. 8. Suppose

πb

and their gradient with respect to

(denoted

∇θ

) are continuous

and bounded functions. Then

∇θEa∼πs

θ(·|s)[Qs(s, a)] = Ea∼πs

θ(·|s)"lim

N→∞

n=0

∇θEa0∼πb

θ(·|a,s)[Qs

n(s, a0)]#.(9)

Proof. See Appendix A.

Using Lemma 3.1 (steady-state convergence), we can approximate the policy gradient expression

in Eq. 9 with an arbitrarily small expected error using a ﬁnite number of n-step soft Q-functions,

i.e.,

(see App. A). An intuition for this property follows from the fact that for large enough

, Lemma 3.1 implies that

πb

n(a|a0, s)≈πs

θ(a|s)

and, thus,

n(s, a0)≈RAπb

θs(a|s)Qs

φ(s, a)da

Therefore, the value of each

n(s, a0)

will be independent of the BT-policy’s action

, such that

∇θEa0∼πb

θ(·|a,s)[Qs

n(s, a0)] ≈0

. In other words, each subsequent step in the RMC introduces

additional randomness that is independent of

, causing a warranted vanishing gradient phenomenon

[

] which culminates with converging to

πs

. Using a similar notation as Haarnoja et al.

[39]

, we

apply the reparameterization trick [38] to express the BT-policy in terms of a deterministic function

θ(a, s, )

, taking as input a Gaussian noise vector



. This allows to rewrite the gradient in each inner

expectation in the sum from Eq. 9 as:

∇θEa0∼πb

θ(·|a,s)[Qs

n(s, a0)] = E0∼N(0,1) ∇a0Qs

n(s, a0)∇θfb

θ(a, s, 0),(10)

where

a0=fb

θ(a, s, 0)

. We can apply the same reparameterization for all n-step soft Q-functions, to

establish a new relationship between the gradient terms ∇a0Qs

n(s, a0):

∇a0Qs

n(s, a0) = ∇a0ZA

πb

n(an|a0, s)Qs

φ(s, an)dan=∇a0ZA

πb(a1|a0, s)Qs

n−1(s, a1)da1

=E1∇a1Qs

n−1(s, a1)∇a0fb(a0, s, 1),where, a1=f(a0, s, 1).(11)

In Eq. 11, we purposefully omit the dependence of

and

πb

from

since each

term is a local

approximation of the RMC that does not depend on

(as deﬁned in Eq. 8). By recursively applying

this relationship (Eq. 11) to ∇a1Qs

n−1(s, a1)and all subsequent gradient terms we obtain:

∇a0Qs

n(s, a0) = E1,...,n"∇anQs

φ(s, an)

n−1

i=0

∇aifb(ai, s, i+1)#,(12)

where

ai=fb(ai−1, s, i)

for

i= 1, . . . , n

. By combining Eq. 10 and Eq. 12, we can thus

reparameterize and express the whole sum in Eq. 9 as:

∇θEa∼πs

θ(·|s)[Qs(s, a)] ≈

n=0

∇θEa0∼πb

θ(·|a,s)[Qs

n(s, a0)]

=E0,...,N"N

n=0

∇anQs

φ(s, an) n−1

i=0

∇aifb(ai, s, i+1)!∇θfb

θ(a, s, 0)#.

(13)

Eq. 13 intuitively corresponds to differentiating through each

n(s, a0)

term by reparameterizing the

RMC. Hence, to get a sample estimate of the policy gradient we can simulate the reparameterized

RMC for

reasoning steps to obtain

a1, ..., aN

, compute each

φ(s, an)

term, and backpropagate

(e.g., with autodifferentiation). Following Haarnoja et al.

[13

39]

, we can apply Theorem 3.2 and

easily extend the same methodology to estimate the MaxEnt policy gradient from Eq. 6 that also

involves an extra entropy term. We include this alternative derivation in App. A for completeness.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PolicyGradientWithSerialMarkovChainReasoningEdoardoCetinDepartmentofEngineeringKing'sCollegeLondonedoardo.cetin@kcl.ac.ukOyaCeliktutanDepartmentofEngineeringKing'sCollegeLondonoya.celiktutan@kcl.ac.ukAbstractWeintroduceanewframeworkthatperformsdecision-makinginreinforcementlearning(RL)asaniterativer...

展开>> 收起<<

Policy Gradient With Serial Markov Chain Reasoning Edoardo Cetin Department of Engineering.pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Policy Gradient With Serial Markov Chain Reasoning Edoardo Cetin Department of Engineering

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: