Regret Bounds for Risk-Sensitive Reinforcement Learning Osbert Bastani

2025-04-24 0 0 425.3KB 21 页 10玖币
侵权投诉
Regret Bounds for Risk-Sensitive
Reinforcement Learning
Osbert Bastani
University of Pennsylvania
obastani@seas.upenn.edu
Yecheng Jason Ma
University of Pennsylvania
jasonyma@seas.upenn.edu
Estelle Shen
University of Pennsylvania
pixna@sas.upenn.edu
Wanqiao Xu
Stanford University
wanqiaox@stanford.edu
Abstract
In safety-critical applications of reinforcement learning such as healthcare and
robotics, it is often desirable to optimize risk-sensitive objectives that account for
tail outcomes rather than expected reward. We prove the first regret bounds for
reinforcement learning under a general class of risk-sensitive objectives including
the popular CVaR objective. Our theory is based on a novel characterization of the
CVaR objective as well as a novel optimistic MDP construction.
1 Introduction
There has been recent interest in risk-sensitive reinforcement learning, which replaces the usual
expected reward objective with one that accounts for variation in possible outcomes. One of the
most popular risk-sensitive objectives is the conditional value-at-risk (CVaR) objective [
1
,
2
,
3
,
4
],
which is the average risk at some tail of the distribution of returns (i.e., cumulative rewards) under a
given policy [
5
,
6
]. More generally, we consider a broad class of objectives in the form of a weighted
integral of quantiles of the return distribution, of which CVaR is a special case.
A key question is providing regret bounds for risk-sensitive reinforcement learning. While there has
been some work studying this question, it has focused on a specific objective called the entropic risk
measure [
7
,
8
], leaving open the question of bounds for more general risk-sensitive objectives. There
has also been work on optimistic exploration for CVaR [9], but without any regret bounds.
We provide the first regret bounds for risk-sensitive reinforcement learning with objectives of form
Φ(π) = Z1
0
F
Z(π)(τ)·dG(τ),(1)
where
Z(π)
is the random variable encoding the return of policy
π
,
FZ(π)
is its quantile function
(roughly speaking, the inverse CDF), and
G
is a weighting function over the quantiles. This class
captures a broad range of useful objectives, and has been studied in prior work [10, 4].
We focus on the episodic setting, where the agent interacts with the environment, modeled by a
Markov decision process (MDP), over a fixed sequence of episodes. Its goal is to minimize the
regret—i.e., the gap between the objective value it achieves compared to the optimal policy. Our
approach is based on the upper confidence bound strategy [
11
,
12
], which makes decisions according
to an optimistic estimate of the MDP. We prove that this algorithm (denoted A) has regret
regret(A) = ˜
OT2·LG· |S|3/2· |A| · K,
Preprint. Under review.
arXiv:2210.05650v1 [cs.LG] 11 Oct 2022
where
T
is the length of a single episode,
LG
is the Lipschitz constant for the weighting function
G
,
|S|
is the number of states in the MDP,
|A|
is the number of actions, and
K
is the number of episodes
(Theorem 4.1). Importantly, it achieves the optimal rate
K
achievable for typical expected return
objectives (which is a lower bound in our setting since expected return is an objective in the class we
consider, taking
G(τ) = τ
). For CVaR objectives, we have
LG= 1
, where
α
is the size of the tail
considered—e.g., when αis small, it averages over outliers with particularly small return.
The main challenge behind proving our result is bounding the gap between the objective value for the
estimated MDP and the true MDP. In particular, even if we have a uniform bound
kFˆ
Z(π)FZ(π)k
on the CDFs of the estimated return
ˆ
Z(π)
and the true return
Z(π)
, we need to translate this to a
bound on the corresponding objective values. To do so, we prove that equivalently, we have
Φ(π) = 2TZR
G(FZ(π)(x)) ·dx.
This equivalent expression for
Φ
follows by variable substitution and integration by parts when
FZ(π)
is invertible (so
F
Z(π)(τ) = F1
Z(π)(τ)
), but the general case requires significantly more care. We
show that it holds for an arbitrary CDF FZ(π).
In addition to our regret bound, we provide several other useful results for MDPs with risk-sensitive
objectives. In particular, optimal policies for risk-sensitive objectives may be non-Markov. For
CVaR objectives, it is known that the optimal policy only needs to depend on the cumulative return
accrued so far [
13
]. We prove that this holds in general for objectives of the form (1) (Theorem 3.1).
Furthermore, the cumulative return so far is a continuous component; we prove that discretizing this
component yields an arbitrarily close approximation of the true MDP (Theorem 3.2).
Related work.
To the best of our knowledge, the only prior work on regret bounds for risk-sensitive
reinforcement learning is specific to the entropic risk objective [7, 8]:
J(π) = 1
βlog EZ(π)heβZ(π)i,
where
βR>0
is a hyperparameter. As
β0
, this objective recovers the expected return objective;
for
β < 0
, it encourages risk aversion by upweighting negative returns; and for
β > 0
, it encourages
risk seeking behaviors by upweighting positive returns. This objective is amenable to theoretical
analysis since the value function satisfies a variant of the Bellman equation called the exponential
Bellman equation; however, it is a narrow family of risk measures and is not widely used in practice.
In contrast, we focus on a much broader class of risk measures including the popular CVaR objec-
tive [
1
], which is used to minimize tail losses. To the best of our knowledge, we provide the first
regret bounds for the CVaR objective and for the wide range of objectives given by (1).
2 Problem Formulation
Markov decision process.
We consider a Markov decision process (MDP)
M= (S,A, D, P, P, T )
,
with finite state space
S
, finite action space
A
, initial state distribution
D(s)
, finite time horizon
T
, transition probabilities
P(s0|s, a)
, and reward measure
PR(s,a)
; without loss of generality, we
assume r[0,1] with probability one. A history is a sequence
ξ∈ Z =
T
[
t=1 Ztwhere Zt= (S × A × R)t1× S
Intuitively, a history captures the interaction between an agent and
M
up to step
t
. We consider
stochastic, time-varying, history-dependent policies
πt(at|ξt)
, where
t
is the time step. Given
π
, the
history Ξ(π)
tgenerated by πup to step tis a random variable with probability measure
PΞ(π)
t
(ξt) = (D(s1)if t= 1
PΞ(π)
t1(ξt1)·πt(at|ξt1)·PR(st,at)(rt)·P(st+1 |st, at)otherwise,
where for all τ[T]we use the notation
ξτ= ((s1, a1, r1), ..., (sτ1, aτ1, rτ1), sτ).
2
Finally, an episode (or rollout) is a history ξ∈ ZTof length Tgenerated by a given policy π.
Bellman equation.
The return of
π
on step
t
is the random variable
(Z(π)
t(ξt))(ξT) = PT
τ=trt
,
where
ξTPΞ(π)
T
(· | Ξ(π)
t=ξt)
—i.e., it is the reward from step
t
given that the current history is
ξt
.
Defining Z(π)
T+1(ξ, s)=0, the distributional Bellman equation [14, 9] is
FZ(π)
t(ξ)(x) = X
a∈A
πt(a|ξ)X
s0∈S
P(s0|S(ξ), a)ZFZ(π)
t+1(ξ(a,r,s0))(xr)·dPR(s,a)(r),
where
S(ξ) = s
for
ξ= (..., s)
is the current state in history
ξ
, and
FX
is the cumulative distribution
function (CDF) of random variable
X
. Finally, the cumulative return of
π
is
Z(π)=Z(π)
1(ξ)
, where
ξ= (s)∈ Z1for sDis the initial history; in particular, we have
FZ(π)(x) = ZFZ(π)
1(ξ)(x)·dD(s).
Risk-sensitive objective. The quantile function of a random variable Xis
F
X(τ) = inf {xR|FX(x)τ}.
Note that if
FX
is strictly monotone, then it is invertible and we have
F
X(τ) = F1
X(τ)
. Now, our
objective is given by the Riemann-Stieljes integral
ΦM(π) = Z1
0
F
Z(π)(τ)·dG(τ),
where
G(τ)
is a given CDF over quantiles
τ[0,1]
. This objective was originally studied in [
15
]
for the reinforcement learning setting. For example, choosing
G(τ) = min{τ/α, 1}
(i.e., the CDF
of the distribution
Uniform([0, α])
) for
α[0,1]
yields the
α
-conditional value at risk (CVaR)
objective; furthermore, taking
α= 1
yields the usual expected cumulative reward objective. In
addition, choosing
G(τ) = (τα)
for
α[0,1]
yields the
α
value at risk (VaR) objective. Other
risk sensitive-objectives can also be captured in this form, for example the Wang measure [
16
], and
the cumulative probability weighting (CPW) metric [17]. We call any policy
π
Marg max
π
ΦM(π).
an optimal policy—i.e., it maximizes the given objective for M.
Assumptions. First, we have the following assumption on the quantile function for Z(π):
Assumption 2.1. F
Z(π)(1) = T.
Since
T
is the maximum reward attainable in an episode, this assumption says that the maximum
reward is attained with some nontrivial probability. This assumption is very minor; for any given
MDP M, we can modify Mto include a path achieving reward Twith arbitrarily low probability.
Assumption 2.2. Gis LG-Lipschitz continuous for some LGR>0, and G(0) = 0.
For example, for the α-CVaR objective, we have LG= 1.
Assumption 2.3. We are given an algorithm for computing π
Mfor a given MDP M.
For CVaR objectives, existing algorithms [
13
] can compute
π
M
with any desired approximation error.
For completeness, we give a formal description of the procedure in Appendix D. When unambiguous,
we drop the dependence on Mand simply write π.
Finally, our goal is to learn while interacting with the MDP
M
across a fixed number of episodes
K
.
In particular, at the beginning of each episode
k[K]
, our algorithm chooses a policy
π(k)=A(Hk)
,
where
Hk={ξT}k1
κ=1
is the random set of episodes observed so far, to use for the duration of
episode
k
. Then, our goal is to design an algorithm
A
that aims to minimize regret, which measures
the expected sub-optimality with respect to π:
regret(A) = E
X
k[K]
Φ(π)Φ(π(k))
.
Finally, for simplicity, we assume that the initial state distribution
D
is known; in practice, we can
remove this assumption using a standard strategy.
3
3 Optimal Risk-Sensitive Policies
In this section, we characterize properties of the optimal risk-sensitive policy
π
M
. First, we show
that it suffices to consider policies dependent on the current state and the cumulative rewards
obtained so far, rather than the entire history. Second, the cumulative reward is a continuous quantity,
making it difficult to compute the optimal policy; we prove that discretizing this component does
not significantly reduce the objective value. For CVaR objectives, these results imply that existing
algorithms can be used to compute the optimal risk-sensitive policy [13].
Augmented state space.
We show there exists an optimal policy
π
t(at|yt, st)
that only depends
on the current state
st
and cumulative reward
yt=J(ξt) = Pt1
τ=1 rτ
obtained so far. To this end, let
Zt(y, s) = {ξ∈ Zt|J(ξt)yst=s}
be the set of length
t
histories
ξ
with cumulative reward at most
y
so far, and current state
s
. For any
history-dependent policy π, define the alternative policy ˜πby
˜πt(at|ξt) = EΞ(π)
thπt(at|Ξ(π)
t)Ξ(π)
t∈ Zt(J(ξt), st)i.
Note that
˜π
only depends on
ξt
through
yt=J(ξt)
and
st
, we can define
˜πt(at|ξt) = ˜πt(at|yt, st)
.
Theorem 3.1. For any policy π, we have Φ(˜π) = Φ(π).
We give a proof in Appendix A. In particular, given any optimal policy
π
, we have
Φ(˜π) = Φ(π)
;
thus, we have
˜πarg maxπΦ(π)
. Finally, we note that this result has already been shown for
CVaR objectives [
13
]; our theorem generalizes the existing result to any risk-sensitive objective that
can be expressed as a weighted integral of the quantile function.
Augmented MDP.
As a consequence of Theorem 3.1, it suffices to consider the augmented MDP
˜
M= ( ˜
S,A,˜
D, ˜
P , ˜
P, T )
. First,
˜
S=S × R
is the augmented state space; for a state
(s, y)˜
S
, the
first component encodes the current state and the second encodes the cumulative rewards so far. The
initial state distribution is a probability measure
˜
D((s, y)) = D(s)·δ0(y),
where
δ0
is the Dirac delta measure placing all probability mass on
y= 0
(i.e., the cumulative reward
so far is initially zero). The transitions are given by the product measure
˜
P((s0, y0)|(s, y), a) = P(s0|s, a)·PR(s,a)(y0y),
i.e., the second component of the state space is incremented as
y0=y+r
, where
r
is the reward
achieved in the original MDP. Finally, the rewards are now only provided on the final step:
PRt((s,y),a)(r) = δy(r)if t=T
0otherwise,
i.e., the reward at the end of a rollout is simply the cumulative reward so far, as encoded by the
second component of the state. By Theorem 3.1, it suffices to compute the optimal policy for
˜
M
over
history-independent policies πt(at|˜st):
max
πΠind
Φ˜
M(π) = max
πΦM(π),
where
Πind
is the set of history-independent policies. Once we have
π˜
M
, we can use it in
M
by
defining πM(a|ξ, s) = π˜
M(a|J(ξ), s).
Discretized augmented MDP.
Planning over
˜
M
is complicated by the fact that the second compo-
nent of its state space is continuous. Thus, we consider an
η
-discretization of
˜
M
, for some
ηR>0
.
To this end, we modify the reward function so that it only produces rewards in
η·N={η·n|nN}
,
by always rounding the reward up. Then, sums of these rewards are contained in
η·N
, so we can
replace the second component of
˜
S
with
η·N
. In particular, we consider the discretized MDP
ˆ
M= ( ˆ
S,A,˜
D, ˆ
P , ˜
P, T ), where ˆ
S=S × (η·N), and transition probability measure
ˆ
P((s0, y0)|(s, y), a) = P(s0|s, a)·(PR(s,a)φ1)(y0y)
4
where
φ(r) = η· dre
. That is,
PR(s,a)
is replaced with the pushforward measure
PR(s,a)φ1
,
which gives reward η·iwith probability PR(s,a)[η·(i1) < r η·i].
Now, we prove that the optimal policy
πˆ
M
for the discretized augmented MDP
ˆ
M
achieves objective
value close to the optimal policy
π
M
for the original MDP
M
. Importantly, we want to consider
measure performance of both policies based on the objective
ΦM
of the original MDP
M
. To do so,
we need a way to use
πˆ
M
in
M
. Note that
πˆ
M
depends only on the state
ˆs= (s, y)
, where
s∈ S
is a state of the original MDP
M
, and
yη·N
is a discretized version of the cumulative reward
obtained so far. Thus, we can run
πˆ
M
in
M
by simply rounding the reward
rt
at each step
t
up to
the nearest value
ˆrtη·N
at each step—i.e.,
ˆrt=φ(rt)
; then, we increment the internal state as
yt=yt1+ ˆrt
. We call the resulting policy
πM
the version of
πˆ
M
adapted to
M
. Then, our next
result says that the performance of πMis not too much worse than the performance of π
M.
Theorem 3.2. Let πˆ
Mbe the optimal policy for the discretized augmented MDP ˆ
M, let πMbe the
policy
πˆ
M
adapted to the original MDP
M
, and let
π
M
be the optimal (history-dependent) policy
for the original MDP M. Then, we have
ΦM(πM)ΦM(π
M)η.
We give a proof in Appendix B. Note that we can set
η
to be sufficiently small to achieve any desired
error level (i.e., choose
/T
, where
is the desired error). The only cost is in computation time.
Note that the number of states in
ˆ
M
is still infinite; however, since the cumulative return satisfies
y[0, H], it suffices to take ˆ
S=S × (·[dH/ηe]); then, ˆ
Mhas |ˆ
S| =|S| · dHestates.
4 Upper Confidence Bound Algorithm
Here, we present our upper confidence bound (UCB) algorithm (summarized in Algorithm 1). At a
high level, for each episode, our algorithm constructs an estimate
M(k)
of the underlying MDP
M
based on the prior episodes
i[k1]
; to ensure exploration, it optimistically inflates the estimate of
the reward probability measure
P
. Then, it plans in
M(k)
to obtain an optimistic policy
π(k)=π
M(k)
,
and uses this policy to act in the MDP for episode k.
Optimistic MDP.
We define
M(k)
. Without loss of generality, we assume
S
includes a distinguished
state
s
with rewards
FR(s,a)(r) = (r1)
(i.e., achieve the maximum reward
r= 1
with
probability one), and transitions
P(s|s, a) = (s=s)
and
P(s0|s, a) = (s0=s)
(i.e.,
inaccessible from other states and only transitions to itself). Our construction of
ˆ
M(k)
uses
s
for
optimism. Now, let ˜
M(k)be the MDP using the empirical estimates of the transitions and rewards:
˜
P(k)(s0|s, a) = Nk,t(s, a, s0)
Nk,t(s, a)
F˜
R(k)(s,a)(r) = 1
Nk,t(s, a)
k1
X
i=1
T
X
t=1
(rri,t)·(si,t =sai,t =a).
Then, let ˆ
M(k)be the optimistic MDP; in particular, its transitions
ˆ
P(k)(s0|s, a) =
(s0=s)if s=s
1Ps0∈S\{s}˜
P(k)(s0|s, a)if s0=s
max n˜
P(k)(s0|s, a)(k)
R(s, a),0ootherwise
transition to the optimistic state swhen uncertain, and its rewards
Fˆ
R(k)(s,a)(r) =
(r1) if s=s
1if r1
max nF˜
R(k)(s,a)(r)(k)
R(s, a),0ootherwise
optimistically shift the reward CDF downwards. Here,
(k)
P(s, a)
and
(k)
R(s, a)
are defined in
Section 5; intuitively, they are high-probability upper bounds on the errors of the empirical estimates
˜
P(k)(· | s, a)and F˜
R(k)(s,a)of the transitions and rewards, respectively.
5
摘要:

RegretBoundsforRisk-SensitiveReinforcementLearningOsbertBastaniUniversityofPennsylvaniaobastani@seas.upenn.eduYechengJasonMaUniversityofPennsylvaniajasonyma@seas.upenn.eduEstelleShenUniversityofPennsylvaniapixna@sas.upenn.eduWanqiaoXuStanfordUniversitywanqiaox@stanford.eduAbstractInsafety-criticalap...

展开>> 收起<<
Regret Bounds for Risk-Sensitive Reinforcement Learning Osbert Bastani.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:425.3KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注