Policy Gradient With Serial Markov Chain Reasoning Edoardo Cetin Department of Engineering

2025-04-26 0 0 8.59MB 35 页 10玖币
侵权投诉
Policy Gradient With Serial Markov Chain Reasoning
Edoardo Cetin
Department of Engineering
King’s College London
edoardo.cetin@kcl.ac.uk
Oya Celiktutan
Department of Engineering
King’s College London
oya.celiktutan@kcl.ac.uk
Abstract
We introduce a new framework that performs decision-making in reinforcement
learning (RL) as an iterative reasoning process. We model agent behavior as
the steady-state distribution of a parameterized reasoning Markov chain (RMC),
optimized with a new tractable estimate of the policy gradient. We perform action
selection by simulating the RMC for enough reasoning steps to approach its steady-
state distribution. We show our framework has several useful properties that are
inherently missing from traditional RL. For instance, it allows agent behavior to
approximate any continuous distribution over actions by parameterizing the RMC
with a simple Gaussian transition function. Moreover, the number of reasoning
steps to reach convergence can scale adaptively with the difficulty of each action
selection decision and can be accelerated by re-using past solutions. Our resulting
algorithm achieves state-of-the-art performance in popular Mujoco and DeepMind
Control benchmarks, both for proprioceptive and pixel-based tasks.
1 Introduction
Reinforcement learning (RL) has the potential to provide a general and effective solution to many
modern challenges. Recently, this class of methods achieved numerous impressive milestones in
different problem domains, such as games [
1
3
], robotics [
4
6
], and other meaningful real-world
applications [
7
9
]. However, all these achievements relied on massive amounts of data, controlled
environments, and domain-specific tuning. These commonalities highlight some of the current
practical limitations that prevent RL to be widely applicable [10].
Figure 1: Depiction of agent decision-
making with serial Markov chain reasoning.
In the deep RL framework, practitioners train agents with
the end goal of obtaining optimal behavior. Traditionally,
agent behavior is modeled with feed-forward policies re-
gressing from any state to a corresponding distribution
over actions. Such formulation yields practical training ob-
jectives in both off-policy [
11
13
] and on-policy settings
[
14
16
]. However, we identify three inherent properties
of this rigid representation of behavior that could consid-
erably impact expressivity and efficiency in continuous
control tasks. First, agent behavior is restricted to a class
of tractable distributions, which might fail to capture the
necessary complexity and multi-modality of a task. Sec-
ond, the policy performs a fixed reasoning process with
a feed-forward computation, which potency cannot adapt
to the varying complexity of individual action selection
problems. Third, decision-making is performed every time from scratch, without re-using any past
information that might still inform and facilitate the current action selection problem.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06766v1 [cs.LG] 13 Oct 2022
Unlike RL policies, human reasoning does not appear to follow a rigid feed-forward structure. In fact,
a range of popular psychological models characterize human decision-making as a sequential process
with adaptive temporal dynamics [
17
20
]. Many of these models have found empirical groundings in
neuroscience [
21
24
] and have shown to effectively complement RL for capturing human behavior
in experimental settings [
25
,
26
]. Partly inspired by these works, we attempt to reframe the deep RL
framework by making use of a similar flexible model of agent behavior, in order to counteract its
aforementioned limitations.
We introduce serial Markov chain reasoning - a new powerful framework for representing agent
behavior. Our framework treats decision-making as an adaptive reasoning process, where the agent
sequentially updates its beliefs regarding which action to execute in a series of reasoning steps. We
model this process by replacing the traditional policy with a parameterized transition function, which
defines a reasoning Markov chain (RMC). The steady-state distribution of the RMC represents the
distribution of agent behavior after performing enough reasoning for decision-making. Our framework
naturally overcomes the aforementioned limitations of traditional RL. In particular, we show that our
agent’s behavior can approximate any arbitrary distribution even with simple parameterized transition
functions. Moreover, the required number of reasoning steps adaptively scales with the difficulty of
individual action selection problems and can be accelerated by re-using samples from similar RMCs.
To optimize behavior modeled by the steady-state distribution of the RMC, we derive a new tractable
method to estimate the policy gradient. Hence, we implement a new effective off-policy algorithm
for maximum entropy reinforcement learning (MaxEnt RL) [
27
,
28
], named Steady-State Policy
Gradient (SSPG). Using SSPG, we empirically validate the conceptual properties of our framework
over traditional MaxEnt RL. Moreover, we obtain state-of-the-art results for popular benchmarks
from the OpenAI Gym Mujoco suite [29] and the DeepMind Control suite from pixels [30].
In summary, this work makes the following key contributions:
1.
We propose serial Markov Chain reasoning a framework to represent agent behavior that can
overcome expressivity and efficiency limitations inherent to traditional reinforcement learning.
2. Based on our framework, we derive SSPG, a new tractable off-policy algorithm for MaxEnt RL.
3.
We provide experimental results validating theorized properties of serial Markov Chain reasoning
and displaying state-of-the-art performance on the Mujoco and DeepMind Control suites.
2 Background
2.1 Reinforcement learning problem
We consider the classical formulation of the reinforcement learning (RL) problem setting as a Markov
Decision Process (MDP) [
31
], defined by the tuple
(S, A, P, p0, r, γ)
. In particular, at each discrete
time step
t
the agent experiences a state from the environment’s state-space,
stS
, based on which
it selects an action from its own action space,
atA
. In continuous control problems (considered
in this work), the action space is typically a compact subset of an Euclidean space
Rdim(A)
. The
evolution of the environment’s state through time is determined by the transition dynamics and initial
state distribution,
P
and
p0
. Lastly, the reward function
r
represents the immediate level of progress
for any state-action tuple towards solving a target task. The agent’s behavior is represented by a
state-conditioned parameterized policy distribution
πθ
. Hence, its interaction with the environment
produces trajectories,
τ= (s0, a0, s1, ..., sT, aT)
, according to a factored joint distribution
pπθ(τ) =
p0(s0)QT
t=0 πθ(at|st)P(st+1|st, at)
. The RL objective is to optimize agent behavior as to maximize
the discounted sum of expected future rewards: arg maxθEpπθ(τ)hPT
t=0 γtr(st, at)i.
2.2 Maximum entropy reinforcement learning and inference
Maximum entropy reinforcement learning (MaxEnt RL) [
32
] considers optimizing agent behavior for
a different objective that naturally arises when formulating action selection as an inference problem
[
33
36
]. Following Levine
[28]
, we consider modeling a set of binary optimality random variables
with realization probability proportional to the exponentiated rewards scaled by the temperature
α
,
p(Ot|st, at)exp( 1
αr(st, at))
. The goal of MaxEnt RL is to minimize the KL-divergence between
2
trajectories stemming from agent behavior, pπθ(τ), and the inferred optimal behavior, p(τ|O0:T):
DKL (pπθ(τ)||p(τ|O0:T)) = Epπθ(τ)"log p0(s0)QT
t=0 πθ(at|st)P(st+1|st, at)
p0(s0)QT
t=0 exp( 1
αr(st, at))P(st+1|st, at)#
=Epπθ(τ)"T
X
t=0
r(st, at) + αH(π(·|st))#.
(1)
The resulting entropy-regularized objective introduces an explicit trade-off between exploitation and
exploration, regulated by the temperature parameter
α
scaling the policy’s entropy. An effective
choice to optimize this objective is to learn an auxiliary parameterized soft Q-function [37]:
Qπ
φ(st, at) = Epπθ(τ|st,at)"r(st, at) +
T
X
t0=t+1
r(st0, at0) + αH(π(at0|st0)#.(2)
Given some state,
Qπ
φ(s, ·)
represents an energy-function based on the expected immediate reward and
the agent’s future likelihood of optimality from performing any action. Thus, we can locally optimize
the MaxEnt objective by reducing the KL-divergence between
π
and the canonical distribution of
its current soft Q-function. This is equivalent to maximizing the expected soft Q-function’s value
corrected by the policy’s entropy, resembling a regularized policy gradient objective [11, 12]:
arg max
θ
Es,aπθ(·|s)Qπ
φ(s, a) + αH(πθ(a|s)).(3)
The policy is usually modeled with a neural network outputting the parameters of some tractable
distribution, such as a factorized Gaussian,
πθ(·|s) = N(µθ(s); Σθ(s))
. This practice allows to
efficiently approximate the gradients from Eqn. 3 via the reparameterization trick [
38
]. We consider
the off-policy RL setting, where the agent alternates learning with storing new experience in a data
buffer, D. We refer the reader to Haarnoja et al. [13, 39] for further derivation and practical details.
3 Policy Gradient with serial reasoning
3.1 Reasoning as a Markov chain
We introduce Serial Markov Chain Reasoning, a new framework to model agent behavior, based on
conceptualizing action selection as an adaptive, sequential process which we refer to as reasoning.
Instead of using a traditional policy, the agent selects which action to execute by maintaining an
internal action-belief and a belief transition (BT-) policy,
πb(a0|a, s)
. During the reasoning process,
the agent updates its action-belief for a series of reasoning steps by sampling a new action with
the BT-policy
πb
taking both environment state and previous action-belief as input. We naturally
represent this process with a reasoning Markov chain (RMC), a discrete-time Markov chain over
different action-beliefs, with transition dynamics given by the BT-policy. Hence, for any input
environment state
s
and initial action-belief
a0
, the n-step transition probabilities of the RMC for
future reasoning steps n= 1,2,3, ... are defined as:
πb
n(a|a0, s) = ZA
πb(a|a0, s)πb
n1(a0|a0, s)da0,for n > 1, and πb
1=πb.(4)
Given a compact action space and a BT-policy with a non-zero infimum density, we can ensure that
as the number of reasoning steps grows, the probability of any action-belief in the RMC converges
to some steady-state probability which is independent of the initial action-belief.
1
We denote this
implicit probability distribution as the steady-state (SS-) policy, symbolized by πs(a|s):
Lemma 3.1. Steady-state convergence. For any environment state s, consider a reasoning Markov
chain (RMC) defined on a compact action space
A
with transition probabilities given by
πb(a0|a, s)
.
Suppose that
inf{πb(a0|a, s) : a0, a A}>0
. Then there exists a steady-state probability distribu-
tion function πs(·|s)such that:
lim
n→∞ πb
n(a|a0, s)πs(a|s)for all aA. (5)
Proof. See Appendix A.
3
Figure 2: (
Left
) BT-policy transition probabilities and quantized RMC state diagram. (
Right
) Sample approxi-
mation of related steady-state distribution as compared to canonical distribution of the soft Q-function.
The RMC’s steady-state probabilities can be interpreted as representing the distribution of agent’s
behavior after an appropriate number of reasoning steps are performed. In this work, we strive
to optimize the agent’s behavior following the MaxEnt RL framework described in Section 2. In
particular, we consider learning a parameterized BT-policy,
πb
θ
, to produce appropriate transition
probabilities for each environment state such that the SS-policy,
πs
θ
, from the resulting RMC optimizes:
arg max
θ
J(θ) = Es,aπs
θ(·|s)Qs
φ(s, a) + αH(πs
θ(a|s)).(6)
Here,
Qs
φ
is a parameterized soft Q-function for the agent’s behavior from
πs
, which we learn by
minimizing a squared soft Bellman loss utilizing delayed parameters φ0and samples from πs
θ:
arg min
φ
J(φ) = Es,a,s0h(Qs
φ(s, a)r(s, a) + γEa0πs
θ(·|s)Qs
φ0(s0, a0) + αH(πs
θ(a0|s0)i2.(7)
In Fig. 2, we illustrate the relationship between a learned BT-policy, the corresponding SS-policy,
and the soft Q-function in a 1-dimensional toy task (see App. C for details). In this example, the
BT-policy is parameterized as a simple squashed Gaussian distribution, with unimodal transitions
between consecutive action beliefs (Fig. 2, Left). We obtain samples of agent behavior (the SS-policy)
by performing a series of reasoning steps, using the BT-policy to simulate the RMC until we approach
steady-state convergence. By plotting the resulting empirical distribution of agent behavior, we see
it closely matches the multi-modal, non-Gaussian canonical distribution from its soft Q-function
(Fig. 2, Right). This example shows how the expressive power of agent behavior in our framework
can go far beyond the BT-policy’s simple parameterization, enabling for the effective maximization
of complex and multi-modal MaxEnt objectives.
3.2 Learning the belief transition policy
We propose a new method to estimate the policy gradient of the BT-policy,
πb
θ
, for optimizing the
steady-state MaxEnt objective described in Section 3.1. We note that the gradient from Eq. 6 involves
differentiating through an expectation of the steady-state policy,
πs
θ
. However,
πs
θ
is only implicitly
defined, and its connection with the actual BT-policy or its parameters does not have a tractable
closed-form expression. To approach this problem, we introduce a family of n-step extensions to the
soft Q-function, Qs
n:S×A7→ Rfor n= 0,1,2, . . . , defined as:
Qs
n(s, a) = ZA
πb
n(a0|a, s)Qs
φ(s, a0)da0,with θQs
n(s, a) = 0.(8)
Intuitively, each n-step soft Q-function
Qs
n(s, a)
outputs the expected soft Q-value after performing
n
reasoning steps in the RMC from the initial action-belief
a
. However, we treat the output of each
n-step soft Q-function as being independent of the actual parameters of the BT-policy,
θ
. Hence, we
can interpret computing
Qs
n(s, a)
as simulating the RMC with a fixed and immutable copy of the
current
πb
θ
. We use this definition to provide a convenient notation in the following new Theorem that
expresses the policy gradient without differentiating through πs
θ:
1This is unrelated to the steady-state distribution for infinite-horizon MDPs considered in prior work [40].
4
Theorem 3.2. Steady-state policy gradient.
Let
πb
θ(·|a, s)
be a parameterized belief transition policy
which defines a reasoning Markov chain with a stationary distribution given by the steady-state
policy
πs
θ(·|s)
. Let
Qs
be a real function defined on
S×A
, with a family of n-step extensions
{Qs
n}
as defined in Eq. 8. Suppose
πb
,
Qs
and their gradient with respect to
θ
(denoted
θ
) are continuous
and bounded functions. Then
θEaπs
θ(·|s)[Qs(s, a)] = Eaπs
θ(·|s)"lim
N→∞
N
X
n=0
θEa0πb
θ(·|a,s)[Qs
n(s, a0)]#.(9)
Proof. See Appendix A.
Using Lemma 3.1 (steady-state convergence), we can approximate the policy gradient expression
in Eq. 9 with an arbitrarily small expected error using a finite number of n-step soft Q-functions,
i.e.,
N
(see App. A). An intuition for this property follows from the fact that for large enough
n
, Lemma 3.1 implies that
πb
n(a|a0, s)πs
θ(a|s)
and, thus,
Qs
n(s, a0)RAπb
θs(a|s)Qs
φ(s, a)da
.
Therefore, the value of each
Qs
n(s, a0)
will be independent of the BT-policy’s action
a0
, such that
θEa0πb
θ(·|a,s)[Qs
n(s, a0)] 0
. In other words, each subsequent step in the RMC introduces
additional randomness that is independent of
a0
, causing a warranted vanishing gradient phenomenon
[
41
] which culminates with converging to
πs
θ
. Using a similar notation as Haarnoja et al.
[39]
, we
apply the reparameterization trick [38] to express the BT-policy in terms of a deterministic function
fb
θ(a, s, )
, taking as input a Gaussian noise vector
. This allows to rewrite the gradient in each inner
expectation in the sum from Eq. 9 as:
θEa0πb
θ(·|a,s)[Qs
n(s, a0)] = E0N(0,1) a0Qs
n(s, a0)θfb
θ(a, s, 0),(10)
where
a0=fb
θ(a, s, 0)
. We can apply the same reparameterization for all n-step soft Q-functions, to
establish a new relationship between the gradient terms a0Qs
n(s, a0):
a0Qs
n(s, a0) = a0ZA
πb
n(an|a0, s)Qs
φ(s, an)dan=a0ZA
πb(a1|a0, s)Qs
n1(s, a1)da1
=E1a1Qs
n1(s, a1)a0fb(a0, s, 1),where, a1=f(a0, s, 1).(11)
In Eq. 11, we purposefully omit the dependence of
fb
and
πb
from
θ
since each
Qs
n
term is a local
approximation of the RMC that does not depend on
θ
(as defined in Eq. 8). By recursively applying
this relationship (Eq. 11) to a1Qs
n1(s, a1)and all subsequent gradient terms we obtain:
a0Qs
n(s, a0) = E1,...,n"anQs
φ(s, an)
n1
Y
i=0
aifb(ai, s, i+1)#,(12)
where
ai=fb(ai1, s, i)
for
i= 1, . . . , n
. By combining Eq. 10 and Eq. 12, we can thus
reparameterize and express the whole sum in Eq. 9 as:
θEaπs
θ(·|s)[Qs(s, a)]
N
X
n=0
θEa0πb
θ(·|a,s)[Qs
n(s, a0)]
=E0,...,N"N
X
n=0
anQs
φ(s, an) n1
Y
i=0
aifb(ai, s, i+1)!θfb
θ(a, s, 0)#.
(13)
Eq. 13 intuitively corresponds to differentiating through each
Qs
n(s, a0)
term by reparameterizing the
RMC. Hence, to get a sample estimate of the policy gradient we can simulate the reparameterized
RMC for
N
reasoning steps to obtain
a1, ..., aN
, compute each
Qs
φ(s, an)
term, and backpropagate
(e.g., with autodifferentiation). Following Haarnoja et al.
[13
,
39]
, we can apply Theorem 3.2 and
easily extend the same methodology to estimate the MaxEnt policy gradient from Eq. 6 that also
involves an extra entropy term. We include this alternative derivation in App. A for completeness.
5
摘要:

PolicyGradientWithSerialMarkovChainReasoningEdoardoCetinDepartmentofEngineeringKing'sCollegeLondonedoardo.cetin@kcl.ac.ukOyaCeliktutanDepartmentofEngineeringKing'sCollegeLondonoya.celiktutan@kcl.ac.ukAbstractWeintroduceanewframeworkthatperformsdecision-makinginreinforcementlearning(RL)asaniterativer...

展开>> 收起<<
Policy Gradient With Serial Markov Chain Reasoning Edoardo Cetin Department of Engineering.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:8.59MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注