Regret Bounds for Risk-Sensitive Reinforcement Learning Osbert Bastani

2025-04-24 0 0 425.3KB 21 页 10玖币

侵权投诉

Regret Bounds for Risk-Sensitive

Reinforcement Learning

Osbert Bastani

University of Pennsylvania

obastani@seas.upenn.edu

Yecheng Jason Ma

University of Pennsylvania

jasonyma@seas.upenn.edu

Estelle Shen

University of Pennsylvania

pixna@sas.upenn.edu

Wanqiao Xu

Stanford University

wanqiaox@stanford.edu

Abstract

In safety-critical applications of reinforcement learning such as healthcare and

robotics, it is often desirable to optimize risk-sensitive objectives that account for

tail outcomes rather than expected reward. We prove the ﬁrst regret bounds for

reinforcement learning under a general class of risk-sensitive objectives including

the popular CVaR objective. Our theory is based on a novel characterization of the

CVaR objective as well as a novel optimistic MDP construction.

1 Introduction

There has been recent interest in risk-sensitive reinforcement learning, which replaces the usual

expected reward objective with one that accounts for variation in possible outcomes. One of the

most popular risk-sensitive objectives is the conditional value-at-risk (CVaR) objective [

which is the average risk at some tail of the distribution of returns (i.e., cumulative rewards) under a

given policy [

]. More generally, we consider a broad class of objectives in the form of a weighted

integral of quantiles of the return distribution, of which CVaR is a special case.

A key question is providing regret bounds for risk-sensitive reinforcement learning. While there has

been some work studying this question, it has focused on a speciﬁc objective called the entropic risk

measure [

], leaving open the question of bounds for more general risk-sensitive objectives. There

has also been work on optimistic exploration for CVaR [9], but without any regret bounds.

We provide the ﬁrst regret bounds for risk-sensitive reinforcement learning with objectives of form

Φ(π) = Z1

F†

Z(π)(τ)·dG(τ),(1)

where

Z(π)

is the random variable encoding the return of policy

FZ(π)

is its quantile function

(roughly speaking, the inverse CDF), and

is a weighting function over the quantiles. This class

captures a broad range of useful objectives, and has been studied in prior work [10, 4].

We focus on the episodic setting, where the agent interacts with the environment, modeled by a

Markov decision process (MDP), over a ﬁxed sequence of episodes. Its goal is to minimize the

regret—i.e., the gap between the objective value it achieves compared to the optimal policy. Our

approach is based on the upper conﬁdence bound strategy [

], which makes decisions according

to an optimistic estimate of the MDP. We prove that this algorithm (denoted A) has regret

regret(A) = ˜

OT2·LG· |S|3/2· |A| · √K,

Preprint. Under review.

arXiv:2210.05650v1 [cs.LG] 11 Oct 2022

where

is the length of a single episode,

is the Lipschitz constant for the weighting function

|S|

is the number of states in the MDP,

|A|

is the number of actions, and

is the number of episodes

(Theorem 4.1). Importantly, it achieves the optimal rate

√K

achievable for typical expected return

objectives (which is a lower bound in our setting since expected return is an objective in the class we

consider, taking

G(τ) = τ

). For CVaR objectives, we have

LG= 1/α

, where

is the size of the tail

considered—e.g., when αis small, it averages over outliers with particularly small return.

The main challenge behind proving our result is bounding the gap between the objective value for the

estimated MDP and the true MDP. In particular, even if we have a uniform bound

kFˆ

Z(π)−FZ(π)k∞

on the CDFs of the estimated return

Z(π)

and the true return

Z(π)

, we need to translate this to a

bound on the corresponding objective values. To do so, we prove that equivalently, we have

Φ(π) = 2T−ZR

G(FZ(π)(x)) ·dx.

This equivalent expression for

follows by variable substitution and integration by parts when

FZ(π)

is invertible (so

F†

Z(π)(τ) = F−1

Z(π)(τ)

), but the general case requires signiﬁcantly more care. We

show that it holds for an arbitrary CDF FZ(π).

In addition to our regret bound, we provide several other useful results for MDPs with risk-sensitive

objectives. In particular, optimal policies for risk-sensitive objectives may be non-Markov. For

CVaR objectives, it is known that the optimal policy only needs to depend on the cumulative return

accrued so far [

]. We prove that this holds in general for objectives of the form (1) (Theorem 3.1).

Furthermore, the cumulative return so far is a continuous component; we prove that discretizing this

component yields an arbitrarily close approximation of the true MDP (Theorem 3.2).

Related work.

To the best of our knowledge, the only prior work on regret bounds for risk-sensitive

reinforcement learning is speciﬁc to the entropic risk objective [7, 8]:

J(π) = 1

βlog EZ(π)heβZ(π)i,

where

β∈R>0

is a hyperparameter. As

β→0

, this objective recovers the expected return objective;

for

β < 0

, it encourages risk aversion by upweighting negative returns; and for

β > 0

, it encourages

risk seeking behaviors by upweighting positive returns. This objective is amenable to theoretical

analysis since the value function satisﬁes a variant of the Bellman equation called the exponential

Bellman equation; however, it is a narrow family of risk measures and is not widely used in practice.

In contrast, we focus on a much broader class of risk measures including the popular CVaR objec-

tive [

], which is used to minimize tail losses. To the best of our knowledge, we provide the ﬁrst

regret bounds for the CVaR objective and for the wide range of objectives given by (1).

2 Problem Formulation

Markov decision process.

We consider a Markov decision process (MDP)

M= (S,A, D, P, P, T )

with ﬁnite state space

, ﬁnite action space

, initial state distribution

D(s)

, ﬁnite time horizon

, transition probabilities

P(s0|s, a)

, and reward measure

PR(s,a)

; without loss of generality, we

assume r∈[0,1] with probability one. A history is a sequence

ξ∈ Z =

[

t=1 Ztwhere Zt= (S × A × R)t−1× S

Intuitively, a history captures the interaction between an agent and

up to step

. We consider

stochastic, time-varying, history-dependent policies

πt(at|ξt)

, where

is the time step. Given

, the

history Ξ(π)

tgenerated by πup to step tis a random variable with probability measure

PΞ(π)

(ξt) = (D(s1)if t= 1

PΞ(π)

t−1(ξt−1)·πt(at|ξt−1)·PR(st,at)(rt)·P(st+1 |st, at)otherwise,

where for all τ∈[T]we use the notation

ξτ= ((s1, a1, r1), ..., (sτ−1, aτ−1, rτ−1), sτ).

Finally, an episode (or rollout) is a history ξ∈ ZTof length Tgenerated by a given policy π.

Bellman equation.

The return of

on step

is the random variable

(Z(π)

t(ξt))(ξT) = PT

τ=trt

where

ξT∼PΞ(π)

(· | Ξ(π)

t=ξt)

—i.e., it is the reward from step

given that the current history is

ξt

Deﬁning Z(π)

T+1(ξ, s)=0, the distributional Bellman equation [14, 9] is

FZ(π)

t(ξ)(x) = X

a∈A

πt(a|ξ)X

s0∈S

P(s0|S(ξ), a)ZFZ(π)

t+1(ξ◦(a,r,s0))(x−r)·dPR(s,a)(r),

where

S(ξ) = s

for

ξ= (..., s)

is the current state in history

, and

is the cumulative distribution

function (CDF) of random variable

. Finally, the cumulative return of

Z(π)=Z(π)

1(ξ)

, where

ξ= (s)∈ Z1for s∼Dis the initial history; in particular, we have

FZ(π)(x) = ZFZ(π)

1(ξ)(x)·dD(s).

Risk-sensitive objective. The quantile function of a random variable Xis

F†

X(τ) = inf {x∈R|FX(x)≥τ}.

Note that if

is strictly monotone, then it is invertible and we have

F†

X(τ) = F−1

X(τ)

. Now, our

objective is given by the Riemann-Stieljes integral

ΦM(π) = Z1

F†

Z(π)(τ)·dG(τ),

where

G(τ)

is a given CDF over quantiles

τ∈[0,1]

. This objective was originally studied in [

]

for the reinforcement learning setting. For example, choosing

G(τ) = min{τ/α, 1}

(i.e., the CDF

of the distribution

Uniform([0, α])

) for

α∈[0,1]

yields the

-conditional value at risk (CVaR)

objective; furthermore, taking

α= 1

yields the usual expected cumulative reward objective. In

addition, choosing

G(τ) = (τ≤α)

for

α∈[0,1]

yields the

value at risk (VaR) objective. Other

risk sensitive-objectives can also be captured in this form, for example the Wang measure [

], and

the cumulative probability weighting (CPW) metric [17]. We call any policy

π∗

M∈arg max

ΦM(π).

an optimal policy—i.e., it maximizes the given objective for M.

Assumptions. First, we have the following assumption on the quantile function for Z(π):

Assumption 2.1. F†

Z(π)(1) = T.

Since

is the maximum reward attainable in an episode, this assumption says that the maximum

reward is attained with some nontrivial probability. This assumption is very minor; for any given

MDP M, we can modify Mto include a path achieving reward Twith arbitrarily low probability.

Assumption 2.2. Gis LG-Lipschitz continuous for some LG∈R>0, and G(0) = 0.

For example, for the α-CVaR objective, we have LG= 1/α.

Assumption 2.3. We are given an algorithm for computing π∗

Mfor a given MDP M.

For CVaR objectives, existing algorithms [

] can compute

π∗

with any desired approximation error.

For completeness, we give a formal description of the procedure in Appendix D. When unambiguous,

we drop the dependence on Mand simply write π∗.

Finally, our goal is to learn while interacting with the MDP

across a ﬁxed number of episodes

In particular, at the beginning of each episode

k∈[K]

, our algorithm chooses a policy

π(k)=A(Hk)

where

Hk={ξT,κ}k−1

κ=1

is the random set of episodes observed so far, to use for the duration of

episode

. Then, our goal is to design an algorithm

that aims to minimize regret, which measures

the expected sub-optimality with respect to π∗:

regret(A) = E

X

k∈[K]

Φ(π∗)−Φ(π(k))

.

Finally, for simplicity, we assume that the initial state distribution

is known; in practice, we can

remove this assumption using a standard strategy.

3 Optimal Risk-Sensitive Policies

In this section, we characterize properties of the optimal risk-sensitive policy

π∗

. First, we show

that it sufﬁces to consider policies dependent on the current state and the cumulative rewards

obtained so far, rather than the entire history. Second, the cumulative reward is a continuous quantity,

making it difﬁcult to compute the optimal policy; we prove that discretizing this component does

not signiﬁcantly reduce the objective value. For CVaR objectives, these results imply that existing

algorithms can be used to compute the optimal risk-sensitive policy [13].

Augmented state space.

We show there exists an optimal policy

π∗

t(at|yt, st)

that only depends

on the current state

and cumulative reward

yt=J(ξt) = Pt−1

τ=1 rτ

obtained so far. To this end, let

Zt(y, s) = {ξ∈ Zt|J(ξt)≤y∧st=s}

be the set of length

histories

with cumulative reward at most

so far, and current state

. For any

history-dependent policy π, deﬁne the alternative policy ˜πby

˜πt(at|ξt) = EΞ(π)

thπt(at|Ξ(π)

t)Ξ(π)

t∈ Zt(J(ξt), st)i.

Note that

˜π

only depends on

ξt

through

yt=J(ξt)

and

, we can deﬁne

˜πt(at|ξt) = ˜πt(at|yt, st)

Theorem 3.1. For any policy π, we have Φ(˜π) = Φ(π).

We give a proof in Appendix A. In particular, given any optimal policy

π∗

, we have

Φ(˜π∗) = Φ(π∗)

;

thus, we have

˜π∗∈arg maxπΦ(π)

. Finally, we note that this result has already been shown for

CVaR objectives [

]; our theorem generalizes the existing result to any risk-sensitive objective that

can be expressed as a weighted integral of the quantile function.

Augmented MDP.

As a consequence of Theorem 3.1, it sufﬁces to consider the augmented MDP

M= ( ˜

S,A,˜

D, ˜

P , ˜

P, T )

. First,

S=S × R

is the augmented state space; for a state

(s, y)∈˜

, the

ﬁrst component encodes the current state and the second encodes the cumulative rewards so far. The

initial state distribution is a probability measure

D((s, y)) = D(s)·δ0(y),

where

δ0

is the Dirac delta measure placing all probability mass on

y= 0

(i.e., the cumulative reward

so far is initially zero). The transitions are given by the product measure

P((s0, y0)|(s, y), a) = P(s0|s, a)·PR(s,a)(y0−y),

i.e., the second component of the state space is incremented as

y0=y+r

, where

is the reward

achieved in the original MDP. Finally, the rewards are now only provided on the ﬁnal step:

PRt((s,y),a)(r) = δy(r)if t=T

0otherwise,

i.e., the reward at the end of a rollout is simply the cumulative reward so far, as encoded by the

second component of the state. By Theorem 3.1, it sufﬁces to compute the optimal policy for

over

history-independent policies πt(at|˜st):

max

π∈Πind

Φ˜

M(π) = max

πΦM(π),

where

Πind

is the set of history-independent policies. Once we have

π∗˜

, we can use it in

deﬁning πM(a|ξ, s) = π∗˜

M(a|J(ξ), s).

Discretized augmented MDP.

Planning over

is complicated by the fact that the second compo-

nent of its state space is continuous. Thus, we consider an

-discretization of

, for some

η∈R>0

To this end, we modify the reward function so that it only produces rewards in

η·N={η·n|n∈N}

by always rounding the reward up. Then, sums of these rewards are contained in

η·N

, so we can

replace the second component of

with

η·N

. In particular, we consider the discretized MDP

M= ( ˆ

S,A,˜

D, ˆ

P , ˜

P, T ), where ˆ

S=S × (η·N), and transition probability measure

P((s0, y0)|(s, y), a) = P(s0|s, a)·(PR(s,a)◦φ−1)(y0−y)

where

φ(r) = η· dr/ηe

. That is,

PR(s,a)

is replaced with the pushforward measure

PR(s,a)◦φ−1

which gives reward η·iwith probability PR(s,a)[η·(i−1) < r ≤η·i].

Now, we prove that the optimal policy

π∗ˆ

for the discretized augmented MDP

achieves objective

value close to the optimal policy

π∗

for the original MDP

. Importantly, we want to consider

measure performance of both policies based on the objective

ΦM

of the original MDP

. To do so,

we need a way to use

π∗ˆ

. Note that

π∗ˆ

depends only on the state

ˆs= (s, y)

, where

s∈ S

is a state of the original MDP

, and

y∈η·N

is a discretized version of the cumulative reward

obtained so far. Thus, we can run

π∗ˆ

by simply rounding the reward

at each step

up to

the nearest value

ˆrt∈η·N

at each step—i.e.,

ˆrt=φ(rt)

; then, we increment the internal state as

yt=yt−1+ ˆrt

. We call the resulting policy

πM

the version of

π∗ˆ

adapted to

. Then, our next

result says that the performance of πMis not too much worse than the performance of π∗

Theorem 3.2. Let π∗ˆ

Mbe the optimal policy for the discretized augmented MDP ˆ

M, let πMbe the

policy

π∗ˆ

adapted to the original MDP

, and let

π∗

be the optimal (history-dependent) policy

for the original MDP M. Then, we have

ΦM(πM)≥ΦM(π∗

M)−η.

We give a proof in Appendix B. Note that we can set

to be sufﬁciently small to achieve any desired

error level (i.e., choose

/T

, where



is the desired error). The only cost is in computation time.

Note that the number of states in

is still inﬁnite; however, since the cumulative return satisﬁes

y∈[0, H], it sufﬁces to take ˆ

S=S × (·[dH/ηe]); then, ˆ

Mhas |ˆ

S| =|S| · dH/ηestates.

4 Upper Conﬁdence Bound Algorithm

Here, we present our upper conﬁdence bound (UCB) algorithm (summarized in Algorithm 1). At a

high level, for each episode, our algorithm constructs an estimate

M(k)

of the underlying MDP

based on the prior episodes

i∈[k−1]

; to ensure exploration, it optimistically inﬂates the estimate of

the reward probability measure

. Then, it plans in

M(k)

to obtain an optimistic policy

π(k)=π∗

M(k)

and uses this policy to act in the MDP for episode k.

Optimistic MDP.

We deﬁne

M(k)

. Without loss of generality, we assume

includes a distinguished

state

s∞

with rewards

FR(s∞,a)(r) = (r≥1)

(i.e., achieve the maximum reward

r= 1

with

probability one), and transitions

P(s∞|s, a) = (s=s∞)

and

P(s0|s∞, a) = (s0=s∞)

(i.e.,

inaccessible from other states and only transitions to itself). Our construction of

M(k)

uses

s∞

for

optimism. Now, let ˜

M(k)be the MDP using the empirical estimates of the transitions and rewards:

P(k)(s0|s, a) = Nk,t(s, a, s0)

Nk,t(s, a)

F˜

R(k)(s,a)(r) = 1

Nk,t(s, a)

k−1

i=1

t=1

(r≤ri,t)·(si,t =s∧ai,t =a).

Then, let ˆ

M(k)be the optimistic MDP; in particular, its transitions

P(k)(s0|s, a) = 









(s0=s∞)if s=s∞

1−Ps0∈S\{s∞}˜

P(k)(s0|s, a)if s0=s∞

max n˜

P(k)(s0|s, a)−(k)

R(s, a),0ootherwise

transition to the optimistic state s∞when uncertain, and its rewards

Fˆ

R(k)(s,a)(r) = 









(r≥1) if s=s∞

1if r≥1

max nF˜

R(k)(s,a)(r)−(k)

R(s, a),0ootherwise

optimistically shift the reward CDF downwards. Here,

(k)

P(s, a)

and

(k)

R(s, a)

are deﬁned in

Section 5; intuitively, they are high-probability upper bounds on the errors of the empirical estimates

P(k)(· | s, a)and F˜

R(k)(s,a)of the transitions and rewards, respectively.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RegretBoundsforRisk-SensitiveReinforcementLearningOsbertBastaniUniversityofPennsylvaniaobastani@seas.upenn.eduYechengJasonMaUniversityofPennsylvaniajasonyma@seas.upenn.eduEstelleShenUniversityofPennsylvaniapixna@sas.upenn.eduWanqiaoXuStanfordUniversitywanqiaox@stanford.eduAbstractInsafety-criticalap...

展开>> 收起<<

Regret Bounds for Risk-Sensitive Reinforcement Learning Osbert Bastani.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Regret Bounds for Risk-Sensitive Reinforcement Learning Osbert Bastani

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: