B3RTDP A Belief Branch and Bound Real-Time Dynamic Programming Approach to Solving POMDPs Sigurdur Orn Adalgeirsson

2025-05-02 0 0 607.43KB 11 页 10玖币
侵权投诉
B3RTDP: A Belief Branch and Bound Real-Time
Dynamic Programming Approach to Solving POMDPs
Sigurdur Orn Adalgeirsson
MIT Media Lab
Cynthia Breazeal
MIT Media Lab
Abstract
Partially Observable Markov Decision Processes
(POMDPs) offer a promising world representation for
autonomous agents, as they can model both transitional
and perceptual uncertainties. Calculating the optimal
solution to POMDP problems can be computationally
expensive as they require reasoning over the (possibly
infinite) space of beliefs. Several approaches have
been proposed to overcome this difficulty, such as
discretizing the belief space, point-based belief sam-
pling, and Monte Carlo tree search. The Real-Time
Dynamic Programming approach of the RTDP-Bel
algorithm approximates the value function by storing it
in a hashtable with discretized belief keys. We propose
an extension to the RTDP-Bel algorithm which we
call Belief Branch and Bound RTDP (B3RTDP). Our
algorithm uses a bounded value function representation
and takes advantage of this in two novel ways: a
search-bounding technique based on action selection
convergence probabilities, and a method for leveraging
early action convergence called the Convergence
Frontier. Lastly, we empirically demonstrate that
B3RTDP can achieve greater returns in less time than
the state-of-the-art SARSOP solver on known POMDP
problems.
Introduction
As autonomous agents and robots face increasingly chal-
lenging planning and reasoning tasks, it is important that
they can adequately represent and reason about their ca-
pabilities and surroundings. Markov Decision Processes
(MDPs) (Bellman 1957) have been heavily used as a core
representation for probabilistic planning problems and have
proven sufficiently powerful for many domains, but lack ex-
pressivity to represent uncertainty about one’s state or per-
ception. Partially Observable MDPs (POMDPs) (Kaelbling,
Littman, and Cassandra 1998) allow for state uncertainty
by reasoning about beliefs (probability distributions over
states), at their core and using the concept of observations
and observation functions.
The major problem for the use of POMDPs on practical
real-world planning domains has been that because of the
expressivity of the model, finding an optimal solution can be
very computationally expensive. This is in most part due to
the “curse of dimensionality” as POMDP algorithms must
reason about the (possibly infinite) space of beliefs. Sev-
eral approaches have been proposed for making this problem
more tractable, such as point-based belief sampling (Pineau
et al. 2003) (Kurniawati, Hsu, and Lee 2008), Monte-Carlo
tree search (Silver and Veness 2010), and belief discretiza-
tion (Geffner and Bonet 1998) (Bonet and Geffner 2009).
We had been experimenting with Real-Time Dynamic
Programming (RTDP) algorithms for solving MDP prob-
lems and believed there was great potential in applying a
similar technique to solving POMDPs. An algorithm RTDP-
Bel (Geffner and Bonet 1998) had been proposed early for
applying the RTDP technique to POMDPs, but hadn’t re-
ceived much attention by the community, despite a more
recent demonstration by the original authors (Bonet and
Geffner 2009) showing that it was competitive with the
state-of-the-art at the time. We believed that based on this
there was great potential for extending and improving the
RTDP-Bel algorithms with similar techniques that had been
used to improve the convergence time of the original RTDP
system.
In this paper our extention to RTDP-Bel which we call
Beleif Branch and Bound RTDP (B3RTDP). This algorithm
employs a bounded value function representation and em-
phasizes exploration towards areas of higher value uncer-
tainty to speed up convergence. We also introduce a novel
method of pruning action selection by calculating the proba-
bility action convergence and pruning when that probability
exceeds a threshold. Lastly we experiment with a novel con-
cept of a Convergence Frontier (CF) which takes advantage
of action convergence near the root of the search tree which
effectively shortens search trials.
We demonstrate empirically on known POMDP bench-
marking problems Rocksample (Smith and Simmons 2004)
and Tag (Pineau et al. 2003) that B3RTDP can acquire higher
Aver Discounted Reward (ADR) and converge faster than
the state-of-the-art SARSOP planner (Kurniawati, Hsu, and
Lee 2008) and furthermore that its parameters allow the user
flexibility trading solution quality with convergence time.
Background
Since the proposed B3RTDP algorithm is based on the
RTDP algorithm, and incorporates ideas that were either
adapted from, or inspired by, advancements that were made
arXiv:2210.12556v1 [cs.AI] 22 Oct 2022
to the original RTDP algorithm for MDPs, we’ll start by in-
troducing some of those systems.
RTDP
Real-Time Dynamic Programming (RTDP) (Barto, Bradtke,
and Singh 1995) is an early DP algorithm for solving MDPs
that combines heuristic search with asynchronous value
function updates. This approach provided significant bene-
fit over existing DP algorithms by performing value updates
only on the portion of the state space that is relevant to solv-
ing the problem (assuming a good initial heuristic). RTDP is
an anytime algorithm that generally produces good policies
fast, but its convergence time can be slow. Several exten-
sion to the original RTDP algorithm have been proposed to
improve its convergence time, all of which attempt to bet-
ter focus the exploration and value updates towards the most
“fruitful” parts of the state space.
Labeled-RTDP (Bonet and Geffner 2003) (LRTDP) intro-
duced a method of marking states as solved once their values
and those of all states reachable from them had converged.
Solved states would subsequently be avoided in the RTDP’s
exploration of the state space.
Bounded-RTDP (BRTDP) (McMahan, Likhachev, and
Gordon 2005) introduced using a bounded value function,
with both an upper and lower value attached to each state.
BRTDP takes advantage of this in two ways, for search it-
eration termination and search exploration guidance. The
BRTDP exploration strategy is motivated to explore areas
of large value gaps or uncertainty in the value function to
quickly “collapse” its boundaries onto the optimal value V.
Several other algorithms have been proposed to improve
RTDP using a bounded value function, each of which pro-
viding different search exploration heuristics and iteration
termination criteria (Sanner et al. 2009) (Smith and Sim-
mons 2006).
POMDPs
The RTDP algorithm solves MDP problems but they are
limited to only modeling uncertainty in the transition func-
tion and assume that the agent knows its state with certainty
at all times. This is clearly not the case with agents in the
real world and therefore we turn to more expressive model,
namely the POMDP.
A fully specified discrete POMDP is defined by: a finite
set of states and actions Sand A, a transition function that
defines the transition probabilities T(s, a, s0), a set of obser-
vations O, observation function Ω(a, s0, o), a reward func-
tion R(s, a), and a discount factor γ. The transition func-
tion Tencodes the dynamics of the environment. It specifies
how each action will (possibly stochastically) transition the
current state to the next one. As an example, for a wheeled
mobile robot this might model the uncertainty about how
its wheels might slip on different surfaces. The observation
function models the uncertainty in the agent’s perception,
for the wheeled mobile robot this could be used to model the
uncertainty about it’s camera-based localization. The reward
function Rcan be used to encode the goal of the planning
task or more generally to specify states and/or actions that
are desirable.
In a partially observable domain, the agent cannot directly
observe its state and therefore needs to simultaneously do
state estimation with planning. A planning agent represents
its uncertainty about its current state as a probability distri-
bution over possible states which we will refer to as a belief
bwhere b(s) := P r(s|b). Given the transition and obser-
vation functions, a Bayesian belief update can be calculated
where the agent determines what next belief bo
ato hold if it
originally holds belief b, takes action a, and receives obser-
vation o.
bo
a(s0) = Ω(a, s0, o)PsST(s, a, s0)b(s)
P r(o|b, a)(1)
P r(o|b, a) = X
s0S
Ω(a, s0, o)X
sS
T(s, a, s0)b(s)(2)
Using equations 1 and 2, we can restate the Bellman DP
value update equations for POMDPs as such:
Q(b, a) := X
sS
b(s)R(s, a) + γX
oO
P r(o|b, a)V(bo
a)(3)
V(b) := max
aAQ(b, a)(4)
These equations recursively define the value of a belief
V(b)and the value of taking an action in a belief Q(b, a).
The latter is defined as the expected reward of taking that
action plus the discounted successive belief value expec-
tation, and the former as the maximum Q(b, a)for all ac-
tions a. The optimal action policy can therefore be defined
as always choosing the action with the highest Qvalue, or
π(b) := argmaxaAQ(b, a).
RTDP-Bel
Geffner and Bonet presented an extension to the RTDP
called RTDP-Bel which is able to handle partially observ-
able domains (Geffner and Bonet 1998). The most signifi-
cant difference between RTDP and RTDP-Bel (in addition
to the fact that the former searches over state-action graphs
and the latter over belief-action-observation graphs) is the
way in which the value function is represented.
Value functions over beliefs are much more challeng-
ing to implement than for discrete states as for any finite
set of states there is an infinite set of beliefs. One of the
most commonly used representations (attributed to Sondik
(Sondik 1971)) maintains a set of αvectors, each of di-
mension |S|, where V(b) = maxαα·b. RTDP-Bel uses
a function-approximation scheme which discretizes the be-
liefs and stores their values in a hashtable which uses the
discretized belief as key.
ˆ
b(s) = ceil(D·b(s)) (5)
The discretization parameter Dtherefore determines how
“close” two beliefs should be in their assignment of proba-
bilities to states, so that their values should be tconsidered
the same. The value function for RTDP-Bel is therefore de-
fined as follows:
ˆ
V(b) = h(b)|if ˆ
b /HASHTABLE
HASHTABLE(ˆ
b)|otherwise
(6)
Where h(b)is defined as Psbb(s)h(s)and his an ad-
missible heuristic for this problem.
Algorithm 1: The RTDP-Bel algorithm
1RTDP-BEL (bI:Belief)
2while not TERMINATE do
44 b:= bI;sb;
5while b6=GOAL do
77 a:= argmina0Ab
ˆ
Q(b, a0);
99 ˆ
V(b) := ˆ
Q(b, a);
1111 s0∼ T (s, a, ·);
1313 oΩ(a, s0,·);
1515 b:= bo
a;s:= s0;
One critical issue with RTDP-Bel is that it offers no reli-
able termination criteria, one can terminate the algorithm at
any time and extract an action policy but without any guar-
antees that it has converged.
Transforming Goal POMDPs to General POMDPs
One limitation of the RTDP algorithm is that it is only appli-
cable to so-called Stochastic Shortest Path MDP problems.
Similarly, RTDP-Bel is only applicable to Goal POMDPs
which satisfy the following criteria: All actions rewards need
to be strictly negative and a set of fully observable goal states
exist that are absorbing and terminal.
These constraints seem on first sight quite restrictive and
would threaten to limit RTDP-Bel to only be applicable to
a small subset of all possible POMDP problems. But this
is not the case as Bonet et al. demonstrate in (Bonet and
Geffner 2009) that any general discounted POMDP can be
transformed into a goal POMDP.
Other POMDP Planning Approaches
Algorithms exist to solve for the optimal value function of
POMDPs (Sondik 1971), but this is rarely a good idea as
the belief space is infinitely large and only a small sub-
set of it relevant to the planning problem. A discovery by
Sondik about the value function’s piece-wise linear and con-
vex properties led to a popular value function representation
which consists of maintaining a set of |S|-dimensional α
vectors, each representing a hyperplane over the state space.
This representation is named after its author Sondik and
many algorithms, both exact and approximate, take advan-
tage of it.
The Heuristic Search Value Iteration (HSVI) algorithm
(Smith and Simmons 2004) extends ideas of employing
heuristic search from (Geffner and Bonet 1998) and com-
bines them with the Sondik value function representation but
with upper and lower bound estimates. It employs informa-
tion seeking observation sampling technique akin to that of
BRTDP (introduced below) but aimed at minimizing excess
uncertainty. The Point-based Value Iteration (PBVI) algo-
rithm (Pineau et al. 2003) doesn’t use a bounded value func-
tion but introduced a novel concept of maintaining a finite
set of relevant belief-points and only perform value updates
for those sampled beliefs.
Lastly the SARSOP algorithm (Kurniawati, Hsu, and Lee
2008) combines the techniques of HSVI and PBVI, perform-
ing updates to its bounded Sondik-style value function over
a finite set of sampled belief points. Additionally, SARSOP
uses a novel observation sampling method which uses a sim-
ple learning technique to predict which beliefs should have
higher sampling probabilities to close the value gap on the
initial belief faster.
Belief Branch and Bound RTDP
In this section we present a novel planning algorithm for
POMDPS called Belief Branch and Bound Real-Time Dy-
namic Programming (B3RTDP ). The algorithm extends the
RTDP-Bel system with a bounded value function, a Branch
and Bound style search tree pruning and influences from ex-
isting extensions to the original RTDP algorithm for MDPs.
Bounded Belief Value Function
As was previously mentioned, B3RTDP maintains a
bounded value function over the beliefs. In the following
discussion, I will refer to these boundaries as separate value
functions ˆ
VL(b)and ˆ
VH(b)but the implementation actu-
ally stores a two-dimensional vector of values for each dis-
cretized belief in the hash-table (see equation 6) and so only
requires a single lookup operation for the value retrieval of
both boundaries.
It is desirable to initialize the lower bound of the value
function to an admissible heuristic for the planning problem.
This requirement needs to hold for an optimality guarantee.
It is easy to convince oneself of why this is, imagine that at
belief b1, all successor beliefs to taking the optimal action a1
have been improperly assigned inadmissible heuristic val-
ues (that are too high). This will result in an artificially high
Q(b1, a1)value, resulting in the search algorithm choosing
to take a2. If we assume that the successor beliefs of action
a2were initialized to an admissible heuristic then after some
number of iterations we can expect to have learned the true
value of Q(b1, a2), but if that value is still lower than the
(incorrect) Q(b1, a1), then we will never actually choose to
explore a1and never learn that it is in fact the optimal action
to take.
It is equally desirable to initialize the upper boundary
ˆ
VH(b)to a value that overestimates the cost of getting to
the goal. This becomes evident when we start talking about
the bounding nature of the B3RTDP algorithm, namely that
it will prune actions whose values are dominated with a cer-
tain confidence threshold. For that calculation, we require
that the upper boundary be an admissible upper heuristic for
the problem.
In the previous section, we discussed a method of how
to transform discounted POMDPs to goal POMDPs. This
transformation is particularly useful when working with
goal POMDPs we get theoretical boundaries on the value
摘要:

B3RTDP:ABeliefBranchandBoundReal-TimeDynamicProgrammingApproachtoSolvingPOMDPsSigurdurOrnAdalgeirssonMITMediaLabCynthiaBreazealMITMediaLabAbstractPartiallyObservableMarkovDecisionProcesses(POMDPs)offerapromisingworldrepresentationforautonomousagents,astheycanmodelbothtransitionalandperceptualuncerta...

展开>> 收起<<
B3RTDP A Belief Branch and Bound Real-Time Dynamic Programming Approach to Solving POMDPs Sigurdur Orn Adalgeirsson.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:607.43KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注