
to the original RTDP algorithm for MDPs, we’ll start by in-
troducing some of those systems.
RTDP
Real-Time Dynamic Programming (RTDP) (Barto, Bradtke,
and Singh 1995) is an early DP algorithm for solving MDPs
that combines heuristic search with asynchronous value
function updates. This approach provided significant bene-
fit over existing DP algorithms by performing value updates
only on the portion of the state space that is relevant to solv-
ing the problem (assuming a good initial heuristic). RTDP is
an anytime algorithm that generally produces good policies
fast, but its convergence time can be slow. Several exten-
sion to the original RTDP algorithm have been proposed to
improve its convergence time, all of which attempt to bet-
ter focus the exploration and value updates towards the most
“fruitful” parts of the state space.
Labeled-RTDP (Bonet and Geffner 2003) (LRTDP) intro-
duced a method of marking states as solved once their values
and those of all states reachable from them had converged.
Solved states would subsequently be avoided in the RTDP’s
exploration of the state space.
Bounded-RTDP (BRTDP) (McMahan, Likhachev, and
Gordon 2005) introduced using a bounded value function,
with both an upper and lower value attached to each state.
BRTDP takes advantage of this in two ways, for search it-
eration termination and search exploration guidance. The
BRTDP exploration strategy is motivated to explore areas
of large value gaps or uncertainty in the value function to
quickly “collapse” its boundaries onto the optimal value V∗.
Several other algorithms have been proposed to improve
RTDP using a bounded value function, each of which pro-
viding different search exploration heuristics and iteration
termination criteria (Sanner et al. 2009) (Smith and Sim-
mons 2006).
POMDPs
The RTDP algorithm solves MDP problems but they are
limited to only modeling uncertainty in the transition func-
tion and assume that the agent knows its state with certainty
at all times. This is clearly not the case with agents in the
real world and therefore we turn to more expressive model,
namely the POMDP.
A fully specified discrete POMDP is defined by: a finite
set of states and actions Sand A, a transition function that
defines the transition probabilities T(s, a, s0), a set of obser-
vations O, observation function Ω(a, s0, o), a reward func-
tion R(s, a), and a discount factor γ. The transition func-
tion Tencodes the dynamics of the environment. It specifies
how each action will (possibly stochastically) transition the
current state to the next one. As an example, for a wheeled
mobile robot this might model the uncertainty about how
its wheels might slip on different surfaces. The observation
function models the uncertainty in the agent’s perception,
for the wheeled mobile robot this could be used to model the
uncertainty about it’s camera-based localization. The reward
function Rcan be used to encode the goal of the planning
task or more generally to specify states and/or actions that
are desirable.
In a partially observable domain, the agent cannot directly
observe its state and therefore needs to simultaneously do
state estimation with planning. A planning agent represents
its uncertainty about its current state as a probability distri-
bution over possible states which we will refer to as a belief
bwhere b(s) := P r(s|b). Given the transition and obser-
vation functions, a Bayesian belief update can be calculated
where the agent determines what next belief bo
ato hold if it
originally holds belief b, takes action a, and receives obser-
vation o.
bo
a(s0) = Ω(a, s0, o)Ps∈ST(s, a, s0)b(s)
P r(o|b, a)(1)
P r(o|b, a) = X
s0∈S
Ω(a, s0, o)X
s∈S
T(s, a, s0)b(s)(2)
Using equations 1 and 2, we can restate the Bellman DP
value update equations for POMDPs as such:
Q(b, a) := X
s∈S
b(s)R(s, a) + γX
o∈O
P r(o|b, a)V(bo
a)(3)
V(b) := max
a∈AQ(b, a)(4)
These equations recursively define the value of a belief
V(b)and the value of taking an action in a belief Q(b, a).
The latter is defined as the expected reward of taking that
action plus the discounted successive belief value expec-
tation, and the former as the maximum Q(b, a)for all ac-
tions a. The optimal action policy can therefore be defined
as always choosing the action with the highest Qvalue, or
π∗(b) := argmaxa∈AQ(b, a).
RTDP-Bel
Geffner and Bonet presented an extension to the RTDP
called RTDP-Bel which is able to handle partially observ-
able domains (Geffner and Bonet 1998). The most signifi-
cant difference between RTDP and RTDP-Bel (in addition
to the fact that the former searches over state-action graphs
and the latter over belief-action-observation graphs) is the
way in which the value function is represented.
Value functions over beliefs are much more challeng-
ing to implement than for discrete states as for any finite
set of states there is an infinite set of beliefs. One of the
most commonly used representations (attributed to Sondik
(Sondik 1971)) maintains a set of αvectors, each of di-
mension |S|, where V(b) = maxαα·b. RTDP-Bel uses
a function-approximation scheme which discretizes the be-
liefs and stores their values in a hashtable which uses the
discretized belief as key.
ˆ
b(s) = ceil(D·b(s)) (5)
The discretization parameter Dtherefore determines how
“close” two beliefs should be in their assignment of proba-
bilities to states, so that their values should be tconsidered
the same. The value function for RTDP-Bel is therefore de-
fined as follows: