B3RTDP A Belief Branch and Bound Real-Time Dynamic Programming Approach to Solving POMDPs Sigurdur Orn Adalgeirsson

2025-05-02 0 0 607.43KB 11 页 10玖币

侵权投诉

B3RTDP: A Belief Branch and Bound Real-Time

Dynamic Programming Approach to Solving POMDPs

Sigurdur Orn Adalgeirsson

MIT Media Lab

Cynthia Breazeal

MIT Media Lab

Abstract

Partially Observable Markov Decision Processes

(POMDPs) offer a promising world representation for

autonomous agents, as they can model both transitional

and perceptual uncertainties. Calculating the optimal

solution to POMDP problems can be computationally

expensive as they require reasoning over the (possibly

inﬁnite) space of beliefs. Several approaches have

been proposed to overcome this difﬁculty, such as

discretizing the belief space, point-based belief sam-

pling, and Monte Carlo tree search. The Real-Time

Dynamic Programming approach of the RTDP-Bel

algorithm approximates the value function by storing it

in a hashtable with discretized belief keys. We propose

an extension to the RTDP-Bel algorithm which we

call Belief Branch and Bound RTDP (B3RTDP). Our

algorithm uses a bounded value function representation

and takes advantage of this in two novel ways: a

search-bounding technique based on action selection

convergence probabilities, and a method for leveraging

early action convergence called the Convergence

Frontier. Lastly, we empirically demonstrate that

B3RTDP can achieve greater returns in less time than

the state-of-the-art SARSOP solver on known POMDP

problems.

Introduction

As autonomous agents and robots face increasingly chal-

lenging planning and reasoning tasks, it is important that

they can adequately represent and reason about their ca-

pabilities and surroundings. Markov Decision Processes

(MDPs) (Bellman 1957) have been heavily used as a core

representation for probabilistic planning problems and have

proven sufﬁciently powerful for many domains, but lack ex-

pressivity to represent uncertainty about one’s state or per-

ception. Partially Observable MDPs (POMDPs) (Kaelbling,

Littman, and Cassandra 1998) allow for state uncertainty

by reasoning about beliefs (probability distributions over

states), at their core and using the concept of observations

and observation functions.

The major problem for the use of POMDPs on practical

real-world planning domains has been that because of the

expressivity of the model, ﬁnding an optimal solution can be

very computationally expensive. This is in most part due to

the “curse of dimensionality” as POMDP algorithms must

reason about the (possibly inﬁnite) space of beliefs. Sev-

eral approaches have been proposed for making this problem

more tractable, such as point-based belief sampling (Pineau

et al. 2003) (Kurniawati, Hsu, and Lee 2008), Monte-Carlo

tree search (Silver and Veness 2010), and belief discretiza-

tion (Geffner and Bonet 1998) (Bonet and Geffner 2009).

We had been experimenting with Real-Time Dynamic

Programming (RTDP) algorithms for solving MDP prob-

lems and believed there was great potential in applying a

similar technique to solving POMDPs. An algorithm RTDP-

Bel (Geffner and Bonet 1998) had been proposed early for

applying the RTDP technique to POMDPs, but hadn’t re-

ceived much attention by the community, despite a more

recent demonstration by the original authors (Bonet and

Geffner 2009) showing that it was competitive with the

state-of-the-art at the time. We believed that based on this

there was great potential for extending and improving the

RTDP-Bel algorithms with similar techniques that had been

used to improve the convergence time of the original RTDP

system.

In this paper our extention to RTDP-Bel which we call

Beleif Branch and Bound RTDP (B3RTDP). This algorithm

employs a bounded value function representation and em-

phasizes exploration towards areas of higher value uncer-

tainty to speed up convergence. We also introduce a novel

method of pruning action selection by calculating the proba-

bility action convergence and pruning when that probability

exceeds a threshold. Lastly we experiment with a novel con-

cept of a Convergence Frontier (CF) which takes advantage

of action convergence near the root of the search tree which

effectively shortens search trials.

We demonstrate empirically on known POMDP bench-

marking problems Rocksample (Smith and Simmons 2004)

and Tag (Pineau et al. 2003) that B3RTDP can acquire higher

Aver Discounted Reward (ADR) and converge faster than

the state-of-the-art SARSOP planner (Kurniawati, Hsu, and

Lee 2008) and furthermore that its parameters allow the user

ﬂexibility trading solution quality with convergence time.

Background

Since the proposed B3RTDP algorithm is based on the

RTDP algorithm, and incorporates ideas that were either

adapted from, or inspired by, advancements that were made

arXiv:2210.12556v1 [cs.AI] 22 Oct 2022

to the original RTDP algorithm for MDPs, we’ll start by in-

troducing some of those systems.

RTDP

Real-Time Dynamic Programming (RTDP) (Barto, Bradtke,

and Singh 1995) is an early DP algorithm for solving MDPs

that combines heuristic search with asynchronous value

function updates. This approach provided signiﬁcant bene-

ﬁt over existing DP algorithms by performing value updates

only on the portion of the state space that is relevant to solv-

ing the problem (assuming a good initial heuristic). RTDP is

an anytime algorithm that generally produces good policies

fast, but its convergence time can be slow. Several exten-

sion to the original RTDP algorithm have been proposed to

improve its convergence time, all of which attempt to bet-

ter focus the exploration and value updates towards the most

“fruitful” parts of the state space.

Labeled-RTDP (Bonet and Geffner 2003) (LRTDP) intro-

duced a method of marking states as solved once their values

and those of all states reachable from them had converged.

Solved states would subsequently be avoided in the RTDP’s

exploration of the state space.

Bounded-RTDP (BRTDP) (McMahan, Likhachev, and

Gordon 2005) introduced using a bounded value function,

with both an upper and lower value attached to each state.

BRTDP takes advantage of this in two ways, for search it-

eration termination and search exploration guidance. The

BRTDP exploration strategy is motivated to explore areas

of large value gaps or uncertainty in the value function to

quickly “collapse” its boundaries onto the optimal value V∗.

Several other algorithms have been proposed to improve

RTDP using a bounded value function, each of which pro-

viding different search exploration heuristics and iteration

termination criteria (Sanner et al. 2009) (Smith and Sim-

mons 2006).

POMDPs

The RTDP algorithm solves MDP problems but they are

limited to only modeling uncertainty in the transition func-

tion and assume that the agent knows its state with certainty

at all times. This is clearly not the case with agents in the

real world and therefore we turn to more expressive model,

namely the POMDP.

A fully speciﬁed discrete POMDP is deﬁned by: a ﬁnite

set of states and actions Sand A, a transition function that

deﬁnes the transition probabilities T(s, a, s0), a set of obser-

vations O, observation function Ω(a, s0, o), a reward func-

tion R(s, a), and a discount factor γ. The transition func-

tion Tencodes the dynamics of the environment. It speciﬁes

how each action will (possibly stochastically) transition the

current state to the next one. As an example, for a wheeled

mobile robot this might model the uncertainty about how

its wheels might slip on different surfaces. The observation

function models the uncertainty in the agent’s perception,

for the wheeled mobile robot this could be used to model the

uncertainty about it’s camera-based localization. The reward

function Rcan be used to encode the goal of the planning

task or more generally to specify states and/or actions that

are desirable.

In a partially observable domain, the agent cannot directly

observe its state and therefore needs to simultaneously do

state estimation with planning. A planning agent represents

its uncertainty about its current state as a probability distri-

bution over possible states which we will refer to as a belief

bwhere b(s) := P r(s|b). Given the transition and obser-

vation functions, a Bayesian belief update can be calculated

where the agent determines what next belief bo

ato hold if it

originally holds belief b, takes action a, and receives obser-

vation o.

a(s0) = Ω(a, s0, o)Ps∈ST(s, a, s0)b(s)

P r(o|b, a)(1)

P r(o|b, a) = X

s0∈S

Ω(a, s0, o)X

s∈S

T(s, a, s0)b(s)(2)

Using equations 1 and 2, we can restate the Bellman DP

value update equations for POMDPs as such:

Q(b, a) := X

s∈S

b(s)R(s, a) + γX

o∈O

P r(o|b, a)V(bo

a)(3)

V(b) := max

a∈AQ(b, a)(4)

These equations recursively deﬁne the value of a belief

V(b)and the value of taking an action in a belief Q(b, a).

The latter is deﬁned as the expected reward of taking that

action plus the discounted successive belief value expec-

tation, and the former as the maximum Q(b, a)for all ac-

tions a. The optimal action policy can therefore be deﬁned

as always choosing the action with the highest Qvalue, or

π∗(b) := argmaxa∈AQ(b, a).

RTDP-Bel

Geffner and Bonet presented an extension to the RTDP

called RTDP-Bel which is able to handle partially observ-

able domains (Geffner and Bonet 1998). The most signiﬁ-

cant difference between RTDP and RTDP-Bel (in addition

to the fact that the former searches over state-action graphs

and the latter over belief-action-observation graphs) is the

way in which the value function is represented.

Value functions over beliefs are much more challeng-

ing to implement than for discrete states as for any ﬁnite

set of states there is an inﬁnite set of beliefs. One of the

most commonly used representations (attributed to Sondik

(Sondik 1971)) maintains a set of αvectors, each of di-

mension |S|, where V(b) = maxαα·b. RTDP-Bel uses

a function-approximation scheme which discretizes the be-

liefs and stores their values in a hashtable which uses the

discretized belief as key.

b(s) = ceil(D·b(s)) (5)

The discretization parameter Dtherefore determines how

“close” two beliefs should be in their assignment of proba-

bilities to states, so that their values should be tconsidered

the same. The value function for RTDP-Bel is therefore de-

ﬁned as follows:

V(b) = h(b)|if ˆ

b /∈HASHTABLE

HASHTABLE(ˆ

b)|otherwise

(6)

Where h(b)is deﬁned as Ps∈bb(s)h(s)and his an ad-

missible heuristic for this problem.

Algorithm 1: The RTDP-Bel algorithm

1RTDP-BEL (bI:Belief)

2while not TERMINATE do

44 b:= bI;s∼b;

5while b6=GOAL do

77 a:= argmina0∈Ab

Q(b, a0);

99 ˆ

V(b) := ˆ

Q(b, a);

1111 s0∼ T (s, a, ·);

1313 o∼Ω(a, s0,·);

1515 b:= bo

a;s:= s0;

One critical issue with RTDP-Bel is that it offers no reli-

able termination criteria, one can terminate the algorithm at

any time and extract an action policy but without any guar-

antees that it has converged.

Transforming Goal POMDPs to General POMDPs

One limitation of the RTDP algorithm is that it is only appli-

cable to so-called Stochastic Shortest Path MDP problems.

Similarly, RTDP-Bel is only applicable to Goal POMDPs

which satisfy the following criteria: All actions rewards need

to be strictly negative and a set of fully observable goal states

exist that are absorbing and terminal.

These constraints seem on ﬁrst sight quite restrictive and

would threaten to limit RTDP-Bel to only be applicable to

a small subset of all possible POMDP problems. But this

is not the case as Bonet et al. demonstrate in (Bonet and

Geffner 2009) that any general discounted POMDP can be

transformed into a goal POMDP.

Other POMDP Planning Approaches

Algorithms exist to solve for the optimal value function of

POMDPs (Sondik 1971), but this is rarely a good idea as

the belief space is inﬁnitely large and only a small sub-

set of it relevant to the planning problem. A discovery by

Sondik about the value function’s piece-wise linear and con-

vex properties led to a popular value function representation

which consists of maintaining a set of |S|-dimensional α

vectors, each representing a hyperplane over the state space.

This representation is named after its author Sondik and

many algorithms, both exact and approximate, take advan-

tage of it.

The Heuristic Search Value Iteration (HSVI) algorithm

(Smith and Simmons 2004) extends ideas of employing

heuristic search from (Geffner and Bonet 1998) and com-

bines them with the Sondik value function representation but

with upper and lower bound estimates. It employs informa-

tion seeking observation sampling technique akin to that of

BRTDP (introduced below) but aimed at minimizing excess

uncertainty. The Point-based Value Iteration (PBVI) algo-

rithm (Pineau et al. 2003) doesn’t use a bounded value func-

tion but introduced a novel concept of maintaining a ﬁnite

set of relevant belief-points and only perform value updates

for those sampled beliefs.

Lastly the SARSOP algorithm (Kurniawati, Hsu, and Lee

2008) combines the techniques of HSVI and PBVI, perform-

ing updates to its bounded Sondik-style value function over

a ﬁnite set of sampled belief points. Additionally, SARSOP

uses a novel observation sampling method which uses a sim-

ple learning technique to predict which beliefs should have

higher sampling probabilities to close the value gap on the

initial belief faster.

Belief Branch and Bound RTDP

In this section we present a novel planning algorithm for

POMDPS called Belief Branch and Bound Real-Time Dy-

namic Programming (B3RTDP ). The algorithm extends the

RTDP-Bel system with a bounded value function, a Branch

and Bound style search tree pruning and inﬂuences from ex-

isting extensions to the original RTDP algorithm for MDPs.

Bounded Belief Value Function

As was previously mentioned, B3RTDP maintains a

bounded value function over the beliefs. In the following

discussion, I will refer to these boundaries as separate value

functions ˆ

VL(b)and ˆ

VH(b)but the implementation actu-

ally stores a two-dimensional vector of values for each dis-

cretized belief in the hash-table (see equation 6) and so only

requires a single lookup operation for the value retrieval of

both boundaries.

It is desirable to initialize the lower bound of the value

function to an admissible heuristic for the planning problem.

This requirement needs to hold for an optimality guarantee.

It is easy to convince oneself of why this is, imagine that at

belief b1, all successor beliefs to taking the optimal action a1

have been improperly assigned inadmissible heuristic val-

ues (that are too high). This will result in an artiﬁcially high

Q(b1, a1)value, resulting in the search algorithm choosing

to take a2. If we assume that the successor beliefs of action

a2were initialized to an admissible heuristic then after some

number of iterations we can expect to have learned the true

value of Q∗(b1, a2), but if that value is still lower than the

(incorrect) Q(b1, a1), then we will never actually choose to

explore a1and never learn that it is in fact the optimal action

to take.

It is equally desirable to initialize the upper boundary

VH(b)to a value that overestimates the cost of getting to

the goal. This becomes evident when we start talking about

the bounding nature of the B3RTDP algorithm, namely that

it will prune actions whose values are dominated with a cer-

tain conﬁdence threshold. For that calculation, we require

that the upper boundary be an admissible upper heuristic for

the problem.

In the previous section, we discussed a method of how

to transform discounted POMDPs to goal POMDPs. This

transformation is particularly useful when working with

goal POMDPs we get theoretical boundaries on the value

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

B3RTDP:ABeliefBranchandBoundReal-TimeDynamicProgrammingApproachtoSolvingPOMDPsSigurdurOrnAdalgeirssonMITMediaLabCynthiaBreazealMITMediaLabAbstractPartiallyObservableMarkovDecisionProcesses(POMDPs)offerapromisingworldrepresentationforautonomousagents,astheycanmodelbothtransitionalandperceptualuncerta...

展开>> 收起<<

B3RTDP A Belief Branch and Bound Real-Time Dynamic Programming Approach to Solving POMDPs Sigurdur Orn Adalgeirsson.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

B3RTDP A Belief Branch and Bound Real-Time Dynamic Programming Approach to Solving POMDPs Sigurdur Orn Adalgeirsson

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: