Bridging Distributional and Risk-sensitive Reinforcement Learning Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

2025-04-30 1 0 722.09KB 54 页 10玖币

侵权投诉

Bridging Distributional and Risk-sensitive Reinforcement Learning

Bridging Distributional and Risk-sensitive Reinforcement

Learning with Provable Regret Bounds

Hao Liang haoliang1@link.cuhk.edu.cn

School of Science and Engineering

The Chinese University of Hong Kong, Shenzhen

Zhi-Quan Luo luozq@cuhk.edu.cn

School of Science and Engineering

The Chinese University of Hong Kong, Shenzhen

Abstract

We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distri-

butional reinforcement learning (DRL) methods. In particular, we consider ﬁnite episodic

Markov decision processes whose objective is the entropic risk measure (EntRM) of re-

turn. By leveraging a key property of the EntRM, the independence property, we establish

the risk-sensitive distributional dynamic programming framework. We then propose two

novel DRL algorithms that implement optimism through two diﬀerent schemes, including

a model-free one and a model-based one.

We prove that they both attain ˜

Oexp(|β|H)−1

|β|H√S2AKregret upper bound, where

S,A,K, and Hrepresent the number of states, actions, episodes, and the time horizon,

respectively. It matches RSVI2 proposed in Fei et al. (2021), with novel distributional

analysis. To the best of our knowledge, this is the ﬁrst regret analysis that bridges DRL

and RSRL in terms of sample complexity.

Acknowledging the computational ineﬃciency associated with the model-free DRL al-

gorithm, we propose an alternative DRL algorithm with distribution representation. This

approach not only maintains the established regret bounds but also signiﬁcantly ampliﬁes

computational eﬃciency.

We also prove a tighter minimax lower bound of Ω exp(βH/6)−1

βH H√SAT for the β > 0

case, which recovers the tight lower bound Ω(H√SAT ) in the risk-neutral setting.

Keywords: distributional reinforcement learning, risk-sensitive reinforcement learning,

regret bounds, episodic MDP, entropic risk measure

1. Introduction

Standard reinforcement learning (RL) seeks to ﬁnd an optimal policy that maximizes the

expected return (Sutton and Barto, 2018). This approach is often referred to as risk-neutral

RL, as it focuses on the mean functional of the return distribution. However, in high-stakes

applications, such as ﬁnance (Davis and Lleo, 2008; Bielecki et al., 2000), medical treatment

(Ernst et al., 2006), and operations research (Delage and Mannor, 2010), decision-makers

are often risk-sensitive and aim to maximize a risk measure of the return distribution.

Since the pioneering work of Howard and Matheson (1972), risk-sensitive reinforcement

learning (RSRL) based on the exponential risk measure (EntRM) has been applied to a wide

arXiv:2210.14051v3 [cs.LG] 25 Jan 2024

Hao Liang and Zhi-Quan Luo

range of domains (Shen et al., 2014; Nass et al., 2019; Hansen and Sargent, 2011). EntRM

oﬀers a trade-oﬀ between the expected return and its variance and allows for the adjustment

of risk-sensitivity through a risk parameter. However, existing approaches typically require

complicated algorithmic designs to handle the non-linearity of EntRM.

Distributional reinforcement learning (DRL) has demonstrated superior performance

over traditional methods in some challenging tasks under a risk-neutral setting (Bellemare

et al., 2017; Dabney et al., 2018b,a). Unlike value-based approaches, DRL learns the en-

tire return distribution instead of a real-valued value function. Given the distributional

information, it is natural to leverage it to optimize a risk measure other than expectation

(Dabney et al., 2018a; Singh et al., 2020; Ma et al., 2020). Despite the intrinsic connec-

tion between DRL and RSRL, existing works on RSRL via DRL approaches lack regret

analysis (Dabney et al., 2018a; Ma et al., 2021; Achab and Neu, 2021). Consequently, it

is challenging to evaluate and improve these DRL algorithms in terms of sample eﬃciency.

Additionally, DRL can be computationally demanding as return distributions are typically

inﬁnite-dimensional objects. This complexity raises a pertinent question:

Is it feasible for DRL to attain near-optimal regret in RSRL while preserving

computational eﬃciency?

In this work, we answer this question positively by providing computationally eﬃcient DRL

algorithms with regret guarantee. We have developed two types of DRL algorithms, both

designed to be computationally eﬃcient, and equipped with principled exploration strategies

tailored for tabular EntRM-MDP. Notably, these proposed algorithms apply the principle

of optimism in the face of uncertainty (OFU) at a distributional level, eﬀectively balancing

the exploration-exploitation dilemma. By conducting the ﬁrst regret analysis in the ﬁeld

of DRL, we bridge the gap between computationally eﬃcient DRL and RSRL, especially

in terms of sample complexity. Our work paves the way for deeper understanding and

improving the eﬃciency of RSRL through the lens of distributional approaches.

1.1 Related Work

Related work in DRL has rapidly grown since Bellemare et al. (2017), with numerous studies

aiming to improve performance in the risk-neutral setting (see Rowland et al., 2018; Dabney

et al., 2018b,a; Barth-Maron et al., 2018; Yang et al., 2019; Lyle et al., 2019; Zhang et al.,

2021). However, only a few works have considered risk-sensitive behavior, including Dabney

et al. (2018a); Ma et al. (2021); Achab and Neu (2021). None of these works have addressed

sample complexity.

A large body of work has investigated RSRL using the EntRM in various settings

(Borkar, 2001, 2002; Borkar and Meyn, 2002; Borkar, 2010; B¨auerle and Rieder, 2014;

Di Masi et al., 2000; Di Masi and Stettner, 2007; Cavazos-Cadena and Hern´andez-Hern´andez,

2011; Ja´skiewicz, 2007; Ma et al., 2020; Mihatsch and Neuneier, 2002; Osogami, 2012; Patek,

2001; Shen et al., 2013, 2014). However, these works either assume known transition and

reward or consider inﬁnite-horizon settings without considering sample complexity.

Our work is related to two recent studies by Fei et al. (2020) and Fei et al. (2021) in

the same setting. Fei et al. (2020) introduced the ﬁrst regret-guaranteed algorithms for

risk-sensitive episodic Markov decision processes (MDPs), but their regret upper bounds

Bridging Distributional and Risk-sensitive Reinforcement Learning

contain an unnecessary factor of exp(|β|H2) and their lower bound proof contains errors,

leading to a weaker bound. While Fei et al. (2021) reﬁned their algorithm by introducing

a doubly decaying bonus that eﬀectively removes the exp(|β|H2) factor 1, the issue with

the lower bound was not resolved. Achab and Neu (2021) proposed a risk-sensitive deep

deterministic policy gradient framework, but their work is fundamentally diﬀerent from ours

as they consider the conditional value at risk and focus on discounted MDP with inﬁnite

horizon settings. Moreover, Achab and Neu (2021) assumes that the model is known.

1.2 Contributions

This paper makes the following primary contributions:

1. Formulation of a Risk-Sensitive Distributional Dynamic Programming (RS-DDP) frame-

work. This framework introduces a distributional Bellman optimality equation tailored for

EntRM-MDP, leveraging the independence property of the EntRM.

2. Proposal of computationally eﬃcient DRL algorithms that enforce the OFU principle in

a distributional fashion, along with regret upper bounds of ˜

Oexp(|β|H)−1

|β|H√S2AK. The

DRL algorithms not only outperform existing methods empirically but are also supported

by theoretical justiﬁcations. Furthermore, this marks the ﬁrst instance of analyzing a DRL

algorithm within a ﬁnite episodic EntRM-MDP setting.

3. Filling of gaps in the lower bound in Fei et al. (2020), resulting in a tight lower bound of

Ωexp(βH/6)−1

β√SAT for β > 0. This lower bound is dependent of Sand Aand recovers

the tight lower bound in the risk-neutral setting (as β→0).

Algorithm Regret bound Time Space

RSVI ˜

Oexp(|β|H2)exp(|β|H)−1

|β|√HS2AT OT S2AO(HSA +T)

RSVI2

Oexp(|β|H)−1

|β|√HS2AT 

RODI-Rep

RODI-MF O(KSH)O(SH)

RODI-MB OT S2AO(HS2A)

lower bound Ω exp(βH/6)−1

β√SAT - -

Table 1: Regret bounds and computational complexity comparisons.

2. Preliminaries

We provide the technical background in this section.

2.1 Notations

We write [M:N]≜{M, M + 1, ..., N}and [N]≜[1 : N] for any positive integers M≤N.

We adopt the convention that Pm

i=nai≜0 if n>mand Qm

i=nai≜1 if n>m. We use I{·}

to denote the indicator function. For any x∈R, we deﬁne [x]+≜max{x, 0}. We denote

by δcthe Dirac measure at c. We denote by D(a, b), DMand Dthe set of distributions

1. A detailed comparison with Fei et al. (2021) is given in Section 8.

Hao Liang and Zhi-Quan Luo

supported on [a, b], [0, M] and the set of all distributions respectively. We use (x1, x2;p) to

denote a binary r.v. taking values x1and x2with probability 1 −pand p. For a discrete set

x={x1,··· , xn}and a probability vector p= (p1,··· , pn), the notation (x, p) represents

the discrete distribution with P(X=xi) = pi. For a discrete distribution η= (x, p), we use

|η|=|x|to denote the number of atoms of the distribution η. We use ˜

O(·) to denote O(·)

omitting logarithmic factors. A table of notation is provided in Appendix A.

2.2 Episodic MDP

An episodic MDP is identiﬁed by M≜(S,A,(Ph)h∈[H],(rh)h∈[H], H), where Sis the state

space, Athe action space, Ph:S × A → ∆(S) the probability transition kernel at step h,

rh:S × A → [0,1] the collection of reward functions at step hand Hthe length of one

episode. The agent interacts with the environment for Kepisodes. At the beginning of

episode k, Nature selects an initial state sk

1arbitrarily. In step h, the agent takes action ak

and observes random reward rh(sk

h, ak

h) and reaches the next state sk

h+1 ∼Ph(·|sk

h, ak

h). The

episode terminates at H+ 1 with rH+1 = 0, then the agent proceeds to next episode.

For each (k, h)∈[K]×[H], we denote by Hk

h≜s1

1, a1

1, s1

2, a1

2, . . . , s1

H, a1

H, . . . , sk

h, ak

h

the (random) history up to step hof episode k. We deﬁne Fk≜Hk−1

Has the history up to

episode k−1. We describe the interaction between the algorithm and MDP in two levels.

In the level of episode, we deﬁne an algorithm as a sequence of function A≜(Ak)k∈[K],

each mapping Fkto a policy Ak(Fk)∈Π. We denote by πk≜Ak(Fk) the policy at episode

k. In the level of step, a (deterministic) policy πis a sequence of functions π= (πh)h∈[H]

with πh:S → A.

2.3 Risk Measure

Consider two random variables X∼Fand Y∼G. We assert that Ydominates X, and

correspondingly, Gdominates F, denoted as Y⪰Xand G⪰F, if and only if for every real

number x, the inequality F(x)≥G(x) holds true. A risk measure, ρ, is a function mapping

a set of random variables, denoted as X, to the real numbers. This mapping adheres to

several crucial properties:

•Monotonicity (M): X⪯Y⇒ρ(X)≤ρ(Y),∀X, Y ∈X,

•Translation-invariance (TI): ρ(X+c) = ρ(X) + c, ∀X∈X,∀c∈R,

•Distribution-invariance (DI): FX1=FX2⇒ρ(X1) = ρ(X2).

A mapping ρ:X→Rqualiﬁes as a risk measure if it satisﬁes both (M) and (TI).

Additionally, a risk measure that also adheres to (DI) is termed a distribution-invariant risk

measure. This paper focuses exclusively on distribution-invariant risk measures, allowing

us to simplify notation by denoting ρ(FX) as ρ(X).

We direct our attention to EntRM, a prominent risk measure in domains requiring

risk-sensitive decision-making, such as mathematical ﬁnance (F¨ollmer and Schied, 2016),

Markovian decision processes (B¨auerle and Rieder, 2014). For a random variable X∼F

and a non-zero coeﬃcient β, the EntRM is deﬁned as:

Uβ(X)≜1

βlog EX∼FheβX i=1

βlog ZR

eβX dF (x).

Bridging Distributional and Risk-sensitive Reinforcement Learning

Given that EntRM satisﬁes (DI), we can denote Uβ(F) as Uβ(X) for X∼F. When β

possesses a small absolute value, employing Taylor’s expansion yields

Uβ(X) = E[X] + β

2V[X] + O(β2).(1)

Therefore, a decision-maker aiming to maximize the EntRM value demonstrates risk-seeking

behavior (preferring higher uncertainty in X) when β > 0, and risk-averse behavior (prefer-

ring lower uncertainty in X) when β < 0. The absolute value of βdictates the sensitivity

to risk, with the measure converging to the mean functional as βapproaches zero.

2.4 Risk-neutral Distributional Dynamic Programming Revisited

Bellemare et al. (2017); Rowland et al. (2018) have discussed the inﬁnite-horizon distribu-

tional dynamic programming in the risk-neutral setting, which will be referred to as the

classical DDP. Now we adapt their results to the ﬁnite horizon setting. We deﬁne the return

for a policy πstarting from state-action pair (s, a) at step h

Zπ

h(s, a)≜

h′=h

rh′(sh′, ah′), sh=s, ah′=πh′(sh′), sh′+1 ∼Ph′(·|sh′, ah′).

Deﬁne Yπ

h(s)≜Zπ

h(s, πh(s)), then it is immediate that

Zπ

h(s, a) = rh(s, a) + Yπ

h+1(S′), S′∼Ph(·|s, a).

There are two sources of randomness in Zπ

h(s, a): the transition Pπ

hand the next-state

return Yπ

h+1. Denote by νπ

h(s) and ηπ

h(s, a) the cumulative distribution function (CDF)

corresponding to Yπ

h(s) and Zπ

h(s, a) respectively. Rewriting the random variable in the

form of CDF, we have the distributional Bellman equation

ηπ

h(s, a) = X

s′

Ph(s′|s, a)νπ

h+1(s′)(· − rh(s, a)), νπ

h(s) = ηπ

h(s, πh(s)).

The distributional Bellman equation outlines the backward recursion of the return distribu-

tion under a ﬁxed policy. Our focus is primarily on risk-neutral control, aiming to maximize

the mean value of the return, as represented by:

π∗(s)≜arg max

(π1,...,πH)∈Π

E[Zπ

1(s)].

Here, π= (π1, ..., πH) signiﬁes that this is a multi-stage maximization problem. An ex-

haustive search approach is impractical due to its exponential computational complexity.

However, the principle of optimality applies, suggesting that the optimal policy for any

tail sub-problem coincides with the tail of the optimal policy, as discussed in (Bertsekas

et al., 2000). This principle enables the reduction of the multi-stage maximization problem

into several single-stage maximization problems. Let Z∗=Zπ∗and Y∗=Yπ∗denote the

optimal return. The risk-neutral Bellman optimality equation can be expressed as follows:

Z∗

h(s, a) = rh(s, a) + Y∗

h+1(S′), S′∼Ph(·|s, a)

π∗

h(s) = arg max

E[Z∗

h(s, a)], Y ∗

h(s) = Z∗

h(s, π∗

h(s)).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BridgingDistributionalandRisk-sensitiveReinforcementLearningBridgingDistributionalandRisk-sensitiveReinforcementLearningwithProvableRegretBoundsHaoLianghaoliang1@link.cuhk.edu.cnSchoolofScienceandEngineeringTheChineseUniversityofHongKong,ShenzhenZhi-QuanLuoluozq@cuhk.edu.cnSchoolofScienceandEngineer...

展开>> 收起<<

Bridging Distributional and Risk-sensitive Reinforcement Learning Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds.pdf

共54页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bridging Distributional and Risk-sensitive Reinforcement Learning Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: