
Unlike RL policies, human reasoning does not appear to follow a rigid feed-forward structure. In fact,
a range of popular psychological models characterize human decision-making as a sequential process
with adaptive temporal dynamics [
17
–
20
]. Many of these models have found empirical groundings in
neuroscience [
21
–
24
] and have shown to effectively complement RL for capturing human behavior
in experimental settings [
25
,
26
]. Partly inspired by these works, we attempt to reframe the deep RL
framework by making use of a similar flexible model of agent behavior, in order to counteract its
aforementioned limitations.
We introduce serial Markov chain reasoning - a new powerful framework for representing agent
behavior. Our framework treats decision-making as an adaptive reasoning process, where the agent
sequentially updates its beliefs regarding which action to execute in a series of reasoning steps. We
model this process by replacing the traditional policy with a parameterized transition function, which
defines a reasoning Markov chain (RMC). The steady-state distribution of the RMC represents the
distribution of agent behavior after performing enough reasoning for decision-making. Our framework
naturally overcomes the aforementioned limitations of traditional RL. In particular, we show that our
agent’s behavior can approximate any arbitrary distribution even with simple parameterized transition
functions. Moreover, the required number of reasoning steps adaptively scales with the difficulty of
individual action selection problems and can be accelerated by re-using samples from similar RMCs.
To optimize behavior modeled by the steady-state distribution of the RMC, we derive a new tractable
method to estimate the policy gradient. Hence, we implement a new effective off-policy algorithm
for maximum entropy reinforcement learning (MaxEnt RL) [
27
,
28
], named Steady-State Policy
Gradient (SSPG). Using SSPG, we empirically validate the conceptual properties of our framework
over traditional MaxEnt RL. Moreover, we obtain state-of-the-art results for popular benchmarks
from the OpenAI Gym Mujoco suite [29] and the DeepMind Control suite from pixels [30].
In summary, this work makes the following key contributions:
1.
We propose serial Markov Chain reasoning a framework to represent agent behavior that can
overcome expressivity and efficiency limitations inherent to traditional reinforcement learning.
2. Based on our framework, we derive SSPG, a new tractable off-policy algorithm for MaxEnt RL.
3.
We provide experimental results validating theorized properties of serial Markov Chain reasoning
and displaying state-of-the-art performance on the Mujoco and DeepMind Control suites.
2 Background
2.1 Reinforcement learning problem
We consider the classical formulation of the reinforcement learning (RL) problem setting as a Markov
Decision Process (MDP) [
31
], defined by the tuple
(S, A, P, p0, r, γ)
. In particular, at each discrete
time step
t
the agent experiences a state from the environment’s state-space,
st∈S
, based on which
it selects an action from its own action space,
at∈A
. In continuous control problems (considered
in this work), the action space is typically a compact subset of an Euclidean space
Rdim(A)
. The
evolution of the environment’s state through time is determined by the transition dynamics and initial
state distribution,
P
and
p0
. Lastly, the reward function
r
represents the immediate level of progress
for any state-action tuple towards solving a target task. The agent’s behavior is represented by a
state-conditioned parameterized policy distribution
πθ
. Hence, its interaction with the environment
produces trajectories,
τ= (s0, a0, s1, ..., sT, aT)
, according to a factored joint distribution
pπθ(τ) =
p0(s0)QT
t=0 πθ(at|st)P(st+1|st, at)
. The RL objective is to optimize agent behavior as to maximize
the discounted sum of expected future rewards: arg maxθEpπθ(τ)hPT
t=0 γtr(st, at)i.
2.2 Maximum entropy reinforcement learning and inference
Maximum entropy reinforcement learning (MaxEnt RL) [
32
] considers optimizing agent behavior for
a different objective that naturally arises when formulating action selection as an inference problem
[
33
–
36
]. Following Levine
[28]
, we consider modeling a set of binary optimality random variables
with realization probability proportional to the exponentiated rewards scaled by the temperature
α
,
p(Ot|st, at)∝exp( 1
αr(st, at))
. The goal of MaxEnt RL is to minimize the KL-divergence between
2