itative planning in MDPs with preferences over a set of
outcomes, representable by reachability objectives. We first
introduce new semantics to interpret an incomplete, com-
binative preference model over infinite plays of a stochas-
tic system. We observe that uncertainties in the planning
environment combined with infinite plays might give rise
to opportunities to improve the outcomes achieved by the
agent. Thus, analogous to the idea of an improving flip [16],
we define the notion of improvement that compares two
prefixes of an infinite play to determine which one is more
preferred, based on their different prospects regarding the
set of possible, achievable objectives. Based on whether a
strategy exists to enforce an improvement with a positive
probability or with probability one, we introduce two solu-
tion concepts called safe and positively improving and safe
and almost-surely improving which ensure an improvement
can be made with positive probability and with probability
one, respectively. The synthesis of SPI and SASI strategies
is through a construction called improvement MDP and a
reduction to that of computing positive and almost-sure
winning strategies for some reachability objectives of the im-
provement MDP. In the case of almost-surely improvement,
we also provide an algorithm to determine the maximum
number of improvements achievable given any given state.
The correctness of the proposed algorithms is demonstrated
through a robot motion planning example.
II. PRELIMINARIES
Notation. Given a finite set X, the powerset of Xis
denoted as ℘(X). The set of all finite (resp., infinite) ordered
sequences of elements from Xis denoted by X∗(resp.,
Xω). The set of all finite ordered sequences of length >0
is denoted by X+. We write D(X)to denote the set of
probability distributions over X. The support of a distribution
D∈ D(X)is denoted by Supp(D) = {x∈X|D(x)>0}.
In this paper, we consider a class of decision-making
problems in stochastic systems modeled as a MDP without
the reward function. We then introduce a preference model
over the set of infinite plays in the MDP.
Definition 1 (MDP).An MDP is a tuple M=hS, A, T, ιi,
where Sand Aare finite state and action sets, ι∈Sis
an initial state, and T:S×A→ D(S)is the transition
probability function such that T(s, a, s0)is the probability
of reaching the state s0∈Swhen action a∈Ais chosen at
the state s∈S.
Aplay in an MDP Mis an infinite sequence of states
ρ=s0s1. . . ∈Sωsuch that, for every integer i≥0,
there exists an action a∈Asuch that T(si, a, si+1)>0.
We denote the set of all plays starting from s∈Sin
the MDP by Plays(M, s)and the set of all plays in M
is denoted by Plays(M) = Ss∈SPlays(M, s). The set of
states occurring in a play is given by Occ(ρ) = {s∈
S| ∃i≥0, si=s}. A prefix of a play ρis a finite
sub-sequence of states ν=s0s1. . . sk,k≥0, whose the
length is |ν|=k+ 1. The set of all prefixes of a play
ρis denoted by Pref(ρ). The set of all prefixes in Mis
denoted by PrefPlays(M) = ∪ρ∈Plays(M)Pref(ρ). Given a
prefix ν=s0s1. . . sk∈PrefPlays(M), the sequence of
states sk+1sk+2 . . . ∈Sωis called a suffix of νif the play
νρ0=s0s1. . . sksk+1sk+2 . . . is an element of Plays(M).
In this MDP, we consider reachability objectives for the
agent. Given a set F⊆S, a reachability objective is
characterized by the set Reach(F) = {ρ∈Plays(M)|
Occ(ρ)∩F6=∅}, which contains all the plays in Mstarting
at the state s∈Sthat visit F. Any play ρ∈Plays(M)that
satisfies a reachability objective Reach(F)has a good prefix
ν∈Pref(ρ)such that the last state of νis in F[17].
A finite-memory (resp., memoryless), non-deterministic
strategy in the MDP is a mapping π:S+→℘(A)(resp.,
π:S→℘(A)) from a prefix to a subset of actions that
can be taken from that prefix. The set of all finite-memory,
nondeterministic strategies is denoted Π. Given a prefix ν=
s0. . . sk∈PrefPlays(M), a suffix ρ=sk+1sk+2 . . . ∈Sω
is consistent with π, if for all i≥0, there exists an action
a∈π(s0. . . sk. . . sk+i)such that T(si, a, si+1)>0. Given
an MDP M, a prefix ν∈PrefPlays(M)and a strategy π,
the cone is defined as the set of consistent suffixes of νw.r.t.
π, that is
Cone(M, ν, π) = {ρ∈Sω|νρ is consistent with π}.
Given a prefix ν∈PrefPlays(M)and a reachability
objective Reach(F), a (finite-memory/memoryless) strategy
πPWin(F)is said to be positive winning if Cone(M, ν, π)∩
Reach(F)6=∅. Similarly, a (finite-memory/memoryless)
strategy πASWin(F)is said to be almost-sure winning if
Cone(M, ν, π)⊆Reach(F).
The set of states in the MDP M, starting from which the
agent has an almost-sure (resp. positive) winning strategy
to satisfy a reachability objective F∈Fis called the
almost-sure (resp., positive) winning region and is denoted by
ASWin(F)(resp., PWin(F)). The almost-sure and positive
winning strategies in the product MDP are known to be
memoryless. The almost-sure winning region and strategies
can be synthesized in polynomial time and linear time,
respectively [18].
III. PREFERENCE MODEL
Definition 2. A preference model is a tuple hU, i, where
Uis a countable set of outcomes and is a reflexive and
transitive binary relation on U.
Given u1, u2∈U, we write u1u2if u1is weakly
preferred to (i.e., is at least as good as) u2; and u1∼u2
if u1u2and u2u1, that is, u1and u2are indifferent.
We write u1u2to mean that u1is strictly preferred to
u2, i.e., u1u2and u26 u1. We write u1∦u2if u1and
u2are incomparable. Since Def. 3allows outcomes to be
incomparable, it models incomplete preferences [19].
We consider planning objectives specified as preferences
over reachability objectives.
Definition 3. A preference model over reachability ob-
jectives in an MDP Mis a tuple hF,Di, where F=