Opportunistic Qualitative Planning in Stochastic Systems with Incomplete Preferences over Reachability Objectives Abhishek N. Kulkarni and Jie Fu

2025-05-02 0 0 372.16KB 7 页 10玖币
侵权投诉
Opportunistic Qualitative Planning in Stochastic Systems with Incomplete
Preferences over Reachability Objectives
Abhishek N. Kulkarni, and Jie Fu
Abstract Preferences play a key role in determining what
goals/constraints to satisfy when not all constraints can be sat-
isfied simultaneously. In this paper, we study how to synthesize
preference satisfying plans in stochastic systems, modeled as a
Markov Decision Process (MDP), given a (possibly incomplete)
combinative preference model over temporally extended goals.
We start by introducing new semantics to interpret preferences
over infinite plays of the stochastic system. Then, we introduce
a new notion of improvement to enable comparison between
two prefixes of an infinite play. Based on this, we define two
solution concepts called Safe and Positive Improving (SPI)
and Safe and Almost-Sure Improving (SASI) that enforce
improvements with a positive probability and with probability
one, respectively. We construct a model called an improvement
MDP, in which the synthesis of SPI and SASI strategies that
guarantee at least one improvement reduces to computing
positive and almost-sure winning strategies in an MDP. We
present an algorithm to synthesize the SPI and SASI strategies
that induce multiple sequential improvements. We demonstrate
the proposed approach using a robot motion planning problem.
I. INTRODUCTION
With the rise of artificial intelligence, robotics and au-
tonomous systems are being designed to make complex
decisions by reasoning about multiple goals at the same
time. Preference-based planning (PBP) allows the systems
to decide which goals to satisfy when not all of them can
be achieved [1]. Even though PBP has been studied since
the early 1950s, most works on preference-based temporal
planning (c.f. [2]) fall into at least one of the following cate-
gories: (a) those which assume that all outcomes are pairwise
comparable—that is, the preference relation is complete [3],
[4], (b) those which study exclusionary preferences—that is,
the set of outcomes is mutually exclusive (see [5] and the
references within), (c) those which are interpreted over finite
traces [6]. In this work, we study the PBP problem for the
class of systems in which the preference model is incomplete,
combinative (as opposed to exclusionary) and is interpreted
over infinite plays of the stochastic system.
The motivation to study incomplete, combinative prefer-
ences comes from two well-known facts that the assumption
of completeness is strong and, in many cases, unrealistic [7],
and that combinative preferences are more expressive than
exclusionary preferences [8]. In many control applications,
preferences may need to admit incompleteness because of (a)
Inescapability: An agent has to make decisions under time
A. Kulkarni (corresponding author) and J. Fu are with the Dept. of
Electrical and Computer Engineering, University of Florida, Gainesville, Fl
32611 USA. {a.kulkarni2,fujie}@ufl.edu
This material is based upon work supported by the Air Force Office of
Scientific Research under award number FA9550-21-1-0085.
limits but with incomplete information about preferences
because, for example, it lost communication with the server;
and (b) Incommensurability: Some situations, for instance,
comparing the quality of an apple to that of banana, are fun-
damentally incomparable since they lack a standard basis to
compare. Similarly, the common preferences in robotics such
as “visiting A is preferred to visiting B” are better interpreted
under a combinative model because a path visiting A may
pass through B, which means that the outcomes (sets of plays
of the stochastic model satisfying a certain property) are not
mutually exclusive.
Preference-based planning problems over temporal goals
have been well-studied for deterministic planning given
both complete and incomplete preferences (see [2] for a
survey). For preference specified over temporal goals, several
works [9], [10], [11] proposed minimum violation planning
methods that decide which low-priority constraints should
be violated in a deterministic system. Mehdipour et al. [12]
associate weights with Boolean and temporal operators in
signal temporal logic to specify the importance of satisfying
the sub-formula and priority in the timing of satisfaction.
This reduces the PBP problem to that of maximizing the
weighted satisfaction in deterministic dynamical systems.
However, the solutions to the PBP problem for deterministic
systems cannot be applied to stochastic systems. This is
because, in stochastic systems, even a deterministic strategy
yields a distribution over outcomes. Hence, to determine a
better strategy, we need a comparison of distributions—a task
a deterministic planner cannot do.
Several works have studied the PBP problem for stochastic
systems. Lahijanian and Kwiatkowska [13] considered the
problem of revising a given specification to improve the
probability of satisfaction of the specification. They formu-
lated the problem as a multi-objective MDP problem that
trades off minimizing the cost of revision and maximizing
the probability of satisfying the revised formula. Li et al. [14]
solve a preference-based probabilistic planning problem by
reducing it to a multi-objective model checking problem.
However, all these works assume the preference relation to
be complete. To the best of our knowledge, [15] is the only
work that studies the problem of probabilistic planning with
incomplete preferences. The authors introduce the notion of
the value of preference satisfaction for planning within a pre-
defined finite time duration and developed a mixed-integer
linear program to maximize the satisfaction value for a subset
of preference relations.
The aforementioned work studied preference-based quan-
titative planning. In comparison, this work focuses on qual-
arXiv:2210.01878v1 [cs.AI] 4 Oct 2022
itative planning in MDPs with preferences over a set of
outcomes, representable by reachability objectives. We first
introduce new semantics to interpret an incomplete, com-
binative preference model over infinite plays of a stochas-
tic system. We observe that uncertainties in the planning
environment combined with infinite plays might give rise
to opportunities to improve the outcomes achieved by the
agent. Thus, analogous to the idea of an improving flip [16],
we define the notion of improvement that compares two
prefixes of an infinite play to determine which one is more
preferred, based on their different prospects regarding the
set of possible, achievable objectives. Based on whether a
strategy exists to enforce an improvement with a positive
probability or with probability one, we introduce two solu-
tion concepts called safe and positively improving and safe
and almost-surely improving which ensure an improvement
can be made with positive probability and with probability
one, respectively. The synthesis of SPI and SASI strategies
is through a construction called improvement MDP and a
reduction to that of computing positive and almost-sure
winning strategies for some reachability objectives of the im-
provement MDP. In the case of almost-surely improvement,
we also provide an algorithm to determine the maximum
number of improvements achievable given any given state.
The correctness of the proposed algorithms is demonstrated
through a robot motion planning example.
II. PRELIMINARIES
Notation. Given a finite set X, the powerset of Xis
denoted as (X). The set of all finite (resp., infinite) ordered
sequences of elements from Xis denoted by X(resp.,
Xω). The set of all finite ordered sequences of length >0
is denoted by X+. We write D(X)to denote the set of
probability distributions over X. The support of a distribution
D∈ D(X)is denoted by Supp(D) = {xX|D(x)>0}.
In this paper, we consider a class of decision-making
problems in stochastic systems modeled as a MDP without
the reward function. We then introduce a preference model
over the set of infinite plays in the MDP.
Definition 1 (MDP).An MDP is a tuple M=hS, A, T, ιi,
where Sand Aare finite state and action sets, ιSis
an initial state, and T:S×A→ D(S)is the transition
probability function such that T(s, a, s0)is the probability
of reaching the state s0Swhen action aAis chosen at
the state sS.
Aplay in an MDP Mis an infinite sequence of states
ρ=s0s1. . . Sωsuch that, for every integer i0,
there exists an action aAsuch that T(si, a, si+1)>0.
We denote the set of all plays starting from sSin
the MDP by Plays(M, s)and the set of all plays in M
is denoted by Plays(M) = SsSPlays(M, s). The set of
states occurring in a play is given by Occ(ρ) = {s
S| ∃i0, si=s}. A prefix of a play ρis a finite
sub-sequence of states ν=s0s1. . . sk,k0, whose the
length is |ν|=k+ 1. The set of all prefixes of a play
ρis denoted by Pref(ρ). The set of all prefixes in Mis
denoted by PrefPlays(M) = ρPlays(M)Pref(ρ). Given a
prefix ν=s0s1. . . skPrefPlays(M), the sequence of
states sk+1sk+2 . . . Sωis called a suffix of νif the play
νρ0=s0s1. . . sksk+1sk+2 . . . is an element of Plays(M).
In this MDP, we consider reachability objectives for the
agent. Given a set FS, a reachability objective is
characterized by the set Reach(F) = {ρPlays(M)|
Occ(ρ)F6=∅}, which contains all the plays in Mstarting
at the state sSthat visit F. Any play ρPlays(M)that
satisfies a reachability objective Reach(F)has a good prefix
νPref(ρ)such that the last state of νis in F[17].
A finite-memory (resp., memoryless), non-deterministic
strategy in the MDP is a mapping π:S+(A)(resp.,
π:S(A)) from a prefix to a subset of actions that
can be taken from that prefix. The set of all finite-memory,
nondeterministic strategies is denoted Π. Given a prefix ν=
s0. . . skPrefPlays(M), a suffix ρ=sk+1sk+2 . . . Sω
is consistent with π, if for all i0, there exists an action
aπ(s0. . . sk. . . sk+i)such that T(si, a, si+1)>0. Given
an MDP M, a prefix νPrefPlays(M)and a strategy π,
the cone is defined as the set of consistent suffixes of νw.r.t.
π, that is
Cone(M, ν, π) = {ρSω|νρ is consistent with π}.
Given a prefix νPrefPlays(M)and a reachability
objective Reach(F), a (finite-memory/memoryless) strategy
πPWin(F)is said to be positive winning if Cone(M, ν, π)
Reach(F)6=. Similarly, a (finite-memory/memoryless)
strategy πASWin(F)is said to be almost-sure winning if
Cone(M, ν, π)Reach(F).
The set of states in the MDP M, starting from which the
agent has an almost-sure (resp. positive) winning strategy
to satisfy a reachability objective FFis called the
almost-sure (resp., positive) winning region and is denoted by
ASWin(F)(resp., PWin(F)). The almost-sure and positive
winning strategies in the product MDP are known to be
memoryless. The almost-sure winning region and strategies
can be synthesized in polynomial time and linear time,
respectively [18].
III. PREFERENCE MODEL
Definition 2. A preference model is a tuple hU, i, where
Uis a countable set of outcomes and is a reflexive and
transitive binary relation on U.
Given u1, u2U, we write u1u2if u1is weakly
preferred to (i.e., is at least as good as) u2; and u1u2
if u1u2and u2u1, that is, u1and u2are indifferent.
We write u1u2to mean that u1is strictly preferred to
u2, i.e., u1u2and u26 u1. We write u1u2if u1and
u2are incomparable. Since Def. 3allows outcomes to be
incomparable, it models incomplete preferences [19].
We consider planning objectives specified as preferences
over reachability objectives.
Definition 3. A preference model over reachability ob-
jectives in an MDP Mis a tuple hF,Di, where F=
摘要:

OpportunisticQualitativePlanninginStochasticSystemswithIncompletePreferencesoverReachabilityObjectivesAbhishekN.Kulkarni,andJieFuAbstract—Preferencesplayakeyroleindeterminingwhatgoals/constraintstosatisfywhennotallconstraintscanbesat-isedsimultaneously.Inthispaper,westudyhowtosynthesizepreferences...

展开>> 收起<<
Opportunistic Qualitative Planning in Stochastic Systems with Incomplete Preferences over Reachability Objectives Abhishek N. Kulkarni and Jie Fu.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:372.16KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注