Opportunistic Qualitative Planning in Stochastic Systems with Incomplete Preferences over Reachability Objectives Abhishek N. Kulkarni and Jie Fu

2025-05-02 0 0 372.16KB 7 页 10玖币

侵权投诉

Opportunistic Qualitative Planning in Stochastic Systems with Incomplete

Preferences over Reachability Objectives

Abhishek N. Kulkarni∗, and Jie Fu

Abstract— Preferences play a key role in determining what

goals/constraints to satisfy when not all constraints can be sat-

isﬁed simultaneously. In this paper, we study how to synthesize

preference satisfying plans in stochastic systems, modeled as a

Markov Decision Process (MDP), given a (possibly incomplete)

combinative preference model over temporally extended goals.

We start by introducing new semantics to interpret preferences

over inﬁnite plays of the stochastic system. Then, we introduce

a new notion of improvement to enable comparison between

two preﬁxes of an inﬁnite play. Based on this, we deﬁne two

solution concepts called Safe and Positive Improving (SPI)

and Safe and Almost-Sure Improving (SASI) that enforce

improvements with a positive probability and with probability

one, respectively. We construct a model called an improvement

MDP, in which the synthesis of SPI and SASI strategies that

guarantee at least one improvement reduces to computing

positive and almost-sure winning strategies in an MDP. We

present an algorithm to synthesize the SPI and SASI strategies

that induce multiple sequential improvements. We demonstrate

the proposed approach using a robot motion planning problem.

I. INTRODUCTION

With the rise of artiﬁcial intelligence, robotics and au-

tonomous systems are being designed to make complex

decisions by reasoning about multiple goals at the same

time. Preference-based planning (PBP) allows the systems

to decide which goals to satisfy when not all of them can

be achieved [1]. Even though PBP has been studied since

the early 1950s, most works on preference-based temporal

planning (c.f. [2]) fall into at least one of the following cate-

gories: (a) those which assume that all outcomes are pairwise

comparable—that is, the preference relation is complete [3],

[4], (b) those which study exclusionary preferences—that is,

the set of outcomes is mutually exclusive (see [5] and the

references within), (c) those which are interpreted over ﬁnite

traces [6]. In this work, we study the PBP problem for the

class of systems in which the preference model is incomplete,

combinative (as opposed to exclusionary) and is interpreted

over inﬁnite plays of the stochastic system.

The motivation to study incomplete, combinative prefer-

ences comes from two well-known facts that the assumption

of completeness is strong and, in many cases, unrealistic [7],

and that combinative preferences are more expressive than

exclusionary preferences [8]. In many control applications,

preferences may need to admit incompleteness because of (a)

Inescapability: An agent has to make decisions under time

A. Kulkarni (∗corresponding author) and J. Fu are with the Dept. of

Electrical and Computer Engineering, University of Florida, Gainesville, Fl

32611 USA. {a.kulkarni2,fujie}@ufl.edu

This material is based upon work supported by the Air Force Ofﬁce of

Scientiﬁc Research under award number FA9550-21-1-0085.

limits but with incomplete information about preferences

because, for example, it lost communication with the server;

and (b) Incommensurability: Some situations, for instance,

comparing the quality of an apple to that of banana, are fun-

damentally incomparable since they lack a standard basis to

compare. Similarly, the common preferences in robotics such

as “visiting A is preferred to visiting B” are better interpreted

under a combinative model because a path visiting A may

pass through B, which means that the outcomes (sets of plays

of the stochastic model satisfying a certain property) are not

mutually exclusive.

Preference-based planning problems over temporal goals

have been well-studied for deterministic planning given

both complete and incomplete preferences (see [2] for a

survey). For preference speciﬁed over temporal goals, several

works [9], [10], [11] proposed minimum violation planning

methods that decide which low-priority constraints should

be violated in a deterministic system. Mehdipour et al. [12]

associate weights with Boolean and temporal operators in

signal temporal logic to specify the importance of satisfying

the sub-formula and priority in the timing of satisfaction.

This reduces the PBP problem to that of maximizing the

weighted satisfaction in deterministic dynamical systems.

However, the solutions to the PBP problem for deterministic

systems cannot be applied to stochastic systems. This is

because, in stochastic systems, even a deterministic strategy

yields a distribution over outcomes. Hence, to determine a

better strategy, we need a comparison of distributions—a task

a deterministic planner cannot do.

Several works have studied the PBP problem for stochastic

systems. Lahijanian and Kwiatkowska [13] considered the

problem of revising a given speciﬁcation to improve the

probability of satisfaction of the speciﬁcation. They formu-

lated the problem as a multi-objective MDP problem that

trades off minimizing the cost of revision and maximizing

the probability of satisfying the revised formula. Li et al. [14]

solve a preference-based probabilistic planning problem by

reducing it to a multi-objective model checking problem.

However, all these works assume the preference relation to

be complete. To the best of our knowledge, [15] is the only

work that studies the problem of probabilistic planning with

incomplete preferences. The authors introduce the notion of

the value of preference satisfaction for planning within a pre-

deﬁned ﬁnite time duration and developed a mixed-integer

linear program to maximize the satisfaction value for a subset

of preference relations.

The aforementioned work studied preference-based quan-

titative planning. In comparison, this work focuses on qual-

arXiv:2210.01878v1 [cs.AI] 4 Oct 2022

itative planning in MDPs with preferences over a set of

outcomes, representable by reachability objectives. We ﬁrst

introduce new semantics to interpret an incomplete, com-

binative preference model over inﬁnite plays of a stochas-

tic system. We observe that uncertainties in the planning

environment combined with inﬁnite plays might give rise

to opportunities to improve the outcomes achieved by the

agent. Thus, analogous to the idea of an improving ﬂip [16],

we deﬁne the notion of improvement that compares two

preﬁxes of an inﬁnite play to determine which one is more

preferred, based on their different prospects regarding the

set of possible, achievable objectives. Based on whether a

strategy exists to enforce an improvement with a positive

probability or with probability one, we introduce two solu-

tion concepts called safe and positively improving and safe

and almost-surely improving which ensure an improvement

can be made with positive probability and with probability

one, respectively. The synthesis of SPI and SASI strategies

is through a construction called improvement MDP and a

reduction to that of computing positive and almost-sure

winning strategies for some reachability objectives of the im-

provement MDP. In the case of almost-surely improvement,

we also provide an algorithm to determine the maximum

number of improvements achievable given any given state.

The correctness of the proposed algorithms is demonstrated

through a robot motion planning example.

II. PRELIMINARIES

Notation. Given a ﬁnite set X, the powerset of Xis

denoted as ℘(X). The set of all ﬁnite (resp., inﬁnite) ordered

sequences of elements from Xis denoted by X∗(resp.,

Xω). The set of all ﬁnite ordered sequences of length >0

is denoted by X+. We write D(X)to denote the set of

probability distributions over X. The support of a distribution

D∈ D(X)is denoted by Supp(D) = {x∈X|D(x)>0}.

In this paper, we consider a class of decision-making

problems in stochastic systems modeled as a MDP without

the reward function. We then introduce a preference model

over the set of inﬁnite plays in the MDP.

Deﬁnition 1 (MDP).An MDP is a tuple M=hS, A, T, ιi,

where Sand Aare ﬁnite state and action sets, ι∈Sis

an initial state, and T:S×A→ D(S)is the transition

probability function such that T(s, a, s0)is the probability

of reaching the state s0∈Swhen action a∈Ais chosen at

the state s∈S.

Aplay in an MDP Mis an inﬁnite sequence of states

ρ=s0s1. . . ∈Sωsuch that, for every integer i≥0,

there exists an action a∈Asuch that T(si, a, si+1)>0.

We denote the set of all plays starting from s∈Sin

the MDP by Plays(M, s)and the set of all plays in M

is denoted by Plays(M) = Ss∈SPlays(M, s). The set of

states occurring in a play is given by Occ(ρ) = {s∈

S| ∃i≥0, si=s}. A preﬁx of a play ρis a ﬁnite

sub-sequence of states ν=s0s1. . . sk,k≥0, whose the

length is |ν|=k+ 1. The set of all preﬁxes of a play

ρis denoted by Pref(ρ). The set of all preﬁxes in Mis

denoted by PrefPlays(M) = ∪ρ∈Plays(M)Pref(ρ). Given a

preﬁx ν=s0s1. . . sk∈PrefPlays(M), the sequence of

states sk+1sk+2 . . . ∈Sωis called a sufﬁx of νif the play

νρ0=s0s1. . . sksk+1sk+2 . . . is an element of Plays(M).

In this MDP, we consider reachability objectives for the

agent. Given a set F⊆S, a reachability objective is

characterized by the set Reach(F) = {ρ∈Plays(M)|

Occ(ρ)∩F6=∅}, which contains all the plays in Mstarting

at the state s∈Sthat visit F. Any play ρ∈Plays(M)that

satisﬁes a reachability objective Reach(F)has a good preﬁx

ν∈Pref(ρ)such that the last state of νis in F[17].

A ﬁnite-memory (resp., memoryless), non-deterministic

strategy in the MDP is a mapping π:S+→℘(A)(resp.,

π:S→℘(A)) from a preﬁx to a subset of actions that

can be taken from that preﬁx. The set of all ﬁnite-memory,

nondeterministic strategies is denoted Π. Given a preﬁx ν=

s0. . . sk∈PrefPlays(M), a sufﬁx ρ=sk+1sk+2 . . . ∈Sω

is consistent with π, if for all i≥0, there exists an action

a∈π(s0. . . sk. . . sk+i)such that T(si, a, si+1)>0. Given

an MDP M, a preﬁx ν∈PrefPlays(M)and a strategy π,

the cone is deﬁned as the set of consistent sufﬁxes of νw.r.t.

π, that is

Cone(M, ν, π) = {ρ∈Sω|νρ is consistent with π}.

Given a preﬁx ν∈PrefPlays(M)and a reachability

objective Reach(F), a (ﬁnite-memory/memoryless) strategy

πPWin(F)is said to be positive winning if Cone(M, ν, π)∩

Reach(F)6=∅. Similarly, a (ﬁnite-memory/memoryless)

strategy πASWin(F)is said to be almost-sure winning if

Cone(M, ν, π)⊆Reach(F).

The set of states in the MDP M, starting from which the

agent has an almost-sure (resp. positive) winning strategy

to satisfy a reachability objective F∈Fis called the

almost-sure (resp., positive) winning region and is denoted by

ASWin(F)(resp., PWin(F)). The almost-sure and positive

winning strategies in the product MDP are known to be

memoryless. The almost-sure winning region and strategies

can be synthesized in polynomial time and linear time,

respectively [18].

III. PREFERENCE MODEL

Deﬁnition 2. A preference model is a tuple hU, i, where

Uis a countable set of outcomes and is a reﬂexive and

transitive binary relation on U.

Given u1, u2∈U, we write u1u2if u1is weakly

preferred to (i.e., is at least as good as) u2; and u1∼u2

if u1u2and u2u1, that is, u1and u2are indifferent.

We write u1u2to mean that u1is strictly preferred to

u2, i.e., u1u2and u26 u1. We write u1∦u2if u1and

u2are incomparable. Since Def. 3allows outcomes to be

incomparable, it models incomplete preferences [19].

We consider planning objectives speciﬁed as preferences

over reachability objectives.

Deﬁnition 3. A preference model over reachability ob-

jectives in an MDP Mis a tuple hF,Di, where F=

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OpportunisticQualitativePlanninginStochasticSystemswithIncompletePreferencesoverReachabilityObjectivesAbhishekN.Kulkarni,andJieFuAbstractPreferencesplayakeyroleindeterminingwhatgoals/constraintstosatisfywhennotallconstraintscanbesat-isedsimultaneously.Inthispaper,westudyhowtosynthesizepreferences...

展开>> 收起<<

Opportunistic Qualitative Planning in Stochastic Systems with Incomplete Preferences over Reachability Objectives Abhishek N. Kulkarni and Jie Fu.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Opportunistic Qualitative Planning in Stochastic Systems with Incomplete Preferences over Reachability Objectives Abhishek N. Kulkarni and Jie Fu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: