
where
T
is the length of a single episode,
LG
is the Lipschitz constant for the weighting function
G
,
|S|
is the number of states in the MDP,
|A|
is the number of actions, and
K
is the number of episodes
(Theorem 4.1). Importantly, it achieves the optimal rate
√K
achievable for typical expected return
objectives (which is a lower bound in our setting since expected return is an objective in the class we
consider, taking
G(τ) = τ
). For CVaR objectives, we have
LG= 1/α
, where
α
is the size of the tail
considered—e.g., when αis small, it averages over outliers with particularly small return.
The main challenge behind proving our result is bounding the gap between the objective value for the
estimated MDP and the true MDP. In particular, even if we have a uniform bound
kFˆ
Z(π)−FZ(π)k∞
on the CDFs of the estimated return
ˆ
Z(π)
and the true return
Z(π)
, we need to translate this to a
bound on the corresponding objective values. To do so, we prove that equivalently, we have
Φ(π) = 2T−ZR
G(FZ(π)(x)) ·dx.
This equivalent expression for
Φ
follows by variable substitution and integration by parts when
FZ(π)
is invertible (so
F†
Z(π)(τ) = F−1
Z(π)(τ)
), but the general case requires significantly more care. We
show that it holds for an arbitrary CDF FZ(π).
In addition to our regret bound, we provide several other useful results for MDPs with risk-sensitive
objectives. In particular, optimal policies for risk-sensitive objectives may be non-Markov. For
CVaR objectives, it is known that the optimal policy only needs to depend on the cumulative return
accrued so far [
13
]. We prove that this holds in general for objectives of the form (1) (Theorem 3.1).
Furthermore, the cumulative return so far is a continuous component; we prove that discretizing this
component yields an arbitrarily close approximation of the true MDP (Theorem 3.2).
Related work.
To the best of our knowledge, the only prior work on regret bounds for risk-sensitive
reinforcement learning is specific to the entropic risk objective [7, 8]:
J(π) = 1
βlog EZ(π)heβZ(π)i,
where
β∈R>0
is a hyperparameter. As
β→0
, this objective recovers the expected return objective;
for
β < 0
, it encourages risk aversion by upweighting negative returns; and for
β > 0
, it encourages
risk seeking behaviors by upweighting positive returns. This objective is amenable to theoretical
analysis since the value function satisfies a variant of the Bellman equation called the exponential
Bellman equation; however, it is a narrow family of risk measures and is not widely used in practice.
In contrast, we focus on a much broader class of risk measures including the popular CVaR objec-
tive [
1
], which is used to minimize tail losses. To the best of our knowledge, we provide the first
regret bounds for the CVaR objective and for the wide range of objectives given by (1).
2 Problem Formulation
Markov decision process.
We consider a Markov decision process (MDP)
M= (S,A, D, P, P, T )
,
with finite state space
S
, finite action space
A
, initial state distribution
D(s)
, finite time horizon
T
, transition probabilities
P(s0|s, a)
, and reward measure
PR(s,a)
; without loss of generality, we
assume r∈[0,1] with probability one. A history is a sequence
ξ∈ Z =
T
[
t=1 Ztwhere Zt= (S × A × R)t−1× S
Intuitively, a history captures the interaction between an agent and
M
up to step
t
. We consider
stochastic, time-varying, history-dependent policies
πt(at|ξt)
, where
t
is the time step. Given
π
, the
history Ξ(π)
tgenerated by πup to step tis a random variable with probability measure
PΞ(π)
t
(ξt) = (D(s1)if t= 1
PΞ(π)
t−1(ξt−1)·πt(at|ξt−1)·PR(st,at)(rt)·P(st+1 |st, at)otherwise,
where for all τ∈[T]we use the notation
ξτ= ((s1, a1, r1), ..., (sτ−1, aτ−1, rτ−1), sτ).
2