Bridging Distributional and Risk-sensitive Reinforcement Learning Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

2025-04-30 0 0 722.09KB 54 页 10玖币
侵权投诉
Bridging Distributional and Risk-sensitive Reinforcement Learning
Bridging Distributional and Risk-sensitive Reinforcement
Learning with Provable Regret Bounds
Hao Liang haoliang1@link.cuhk.edu.cn
School of Science and Engineering
The Chinese University of Hong Kong, Shenzhen
Zhi-Quan Luo luozq@cuhk.edu.cn
School of Science and Engineering
The Chinese University of Hong Kong, Shenzhen
Abstract
We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distri-
butional reinforcement learning (DRL) methods. In particular, we consider finite episodic
Markov decision processes whose objective is the entropic risk measure (EntRM) of re-
turn. By leveraging a key property of the EntRM, the independence property, we establish
the risk-sensitive distributional dynamic programming framework. We then propose two
novel DRL algorithms that implement optimism through two different schemes, including
a model-free one and a model-based one.
We prove that they both attain ˜
Oexp(|β|H)1
|β|HS2AKregret upper bound, where
S,A,K, and Hrepresent the number of states, actions, episodes, and the time horizon,
respectively. It matches RSVI2 proposed in Fei et al. (2021), with novel distributional
analysis. To the best of our knowledge, this is the first regret analysis that bridges DRL
and RSRL in terms of sample complexity.
Acknowledging the computational inefficiency associated with the model-free DRL al-
gorithm, we propose an alternative DRL algorithm with distribution representation. This
approach not only maintains the established regret bounds but also significantly amplifies
computational efficiency.
We also prove a tighter minimax lower bound of Ω exp(βH/6)1
βH HSAT for the β > 0
case, which recovers the tight lower bound Ω(HSAT ) in the risk-neutral setting.
Keywords: distributional reinforcement learning, risk-sensitive reinforcement learning,
regret bounds, episodic MDP, entropic risk measure
1. Introduction
Standard reinforcement learning (RL) seeks to find an optimal policy that maximizes the
expected return (Sutton and Barto, 2018). This approach is often referred to as risk-neutral
RL, as it focuses on the mean functional of the return distribution. However, in high-stakes
applications, such as finance (Davis and Lleo, 2008; Bielecki et al., 2000), medical treatment
(Ernst et al., 2006), and operations research (Delage and Mannor, 2010), decision-makers
are often risk-sensitive and aim to maximize a risk measure of the return distribution.
Since the pioneering work of Howard and Matheson (1972), risk-sensitive reinforcement
learning (RSRL) based on the exponential risk measure (EntRM) has been applied to a wide
1
arXiv:2210.14051v3 [cs.LG] 25 Jan 2024
Hao Liang and Zhi-Quan Luo
range of domains (Shen et al., 2014; Nass et al., 2019; Hansen and Sargent, 2011). EntRM
offers a trade-off between the expected return and its variance and allows for the adjustment
of risk-sensitivity through a risk parameter. However, existing approaches typically require
complicated algorithmic designs to handle the non-linearity of EntRM.
Distributional reinforcement learning (DRL) has demonstrated superior performance
over traditional methods in some challenging tasks under a risk-neutral setting (Bellemare
et al., 2017; Dabney et al., 2018b,a). Unlike value-based approaches, DRL learns the en-
tire return distribution instead of a real-valued value function. Given the distributional
information, it is natural to leverage it to optimize a risk measure other than expectation
(Dabney et al., 2018a; Singh et al., 2020; Ma et al., 2020). Despite the intrinsic connec-
tion between DRL and RSRL, existing works on RSRL via DRL approaches lack regret
analysis (Dabney et al., 2018a; Ma et al., 2021; Achab and Neu, 2021). Consequently, it
is challenging to evaluate and improve these DRL algorithms in terms of sample efficiency.
Additionally, DRL can be computationally demanding as return distributions are typically
infinite-dimensional objects. This complexity raises a pertinent question:
Is it feasible for DRL to attain near-optimal regret in RSRL while preserving
computational efficiency?
In this work, we answer this question positively by providing computationally efficient DRL
algorithms with regret guarantee. We have developed two types of DRL algorithms, both
designed to be computationally efficient, and equipped with principled exploration strategies
tailored for tabular EntRM-MDP. Notably, these proposed algorithms apply the principle
of optimism in the face of uncertainty (OFU) at a distributional level, effectively balancing
the exploration-exploitation dilemma. By conducting the first regret analysis in the field
of DRL, we bridge the gap between computationally efficient DRL and RSRL, especially
in terms of sample complexity. Our work paves the way for deeper understanding and
improving the efficiency of RSRL through the lens of distributional approaches.
1.1 Related Work
Related work in DRL has rapidly grown since Bellemare et al. (2017), with numerous studies
aiming to improve performance in the risk-neutral setting (see Rowland et al., 2018; Dabney
et al., 2018b,a; Barth-Maron et al., 2018; Yang et al., 2019; Lyle et al., 2019; Zhang et al.,
2021). However, only a few works have considered risk-sensitive behavior, including Dabney
et al. (2018a); Ma et al. (2021); Achab and Neu (2021). None of these works have addressed
sample complexity.
A large body of work has investigated RSRL using the EntRM in various settings
(Borkar, 2001, 2002; Borkar and Meyn, 2002; Borkar, 2010; B¨auerle and Rieder, 2014;
Di Masi et al., 2000; Di Masi and Stettner, 2007; Cavazos-Cadena and Hern´andez-Hern´andez,
2011; Ja´skiewicz, 2007; Ma et al., 2020; Mihatsch and Neuneier, 2002; Osogami, 2012; Patek,
2001; Shen et al., 2013, 2014). However, these works either assume known transition and
reward or consider infinite-horizon settings without considering sample complexity.
Our work is related to two recent studies by Fei et al. (2020) and Fei et al. (2021) in
the same setting. Fei et al. (2020) introduced the first regret-guaranteed algorithms for
risk-sensitive episodic Markov decision processes (MDPs), but their regret upper bounds
2
Bridging Distributional and Risk-sensitive Reinforcement Learning
contain an unnecessary factor of exp(|β|H2) and their lower bound proof contains errors,
leading to a weaker bound. While Fei et al. (2021) refined their algorithm by introducing
a doubly decaying bonus that effectively removes the exp(|β|H2) factor 1, the issue with
the lower bound was not resolved. Achab and Neu (2021) proposed a risk-sensitive deep
deterministic policy gradient framework, but their work is fundamentally different from ours
as they consider the conditional value at risk and focus on discounted MDP with infinite
horizon settings. Moreover, Achab and Neu (2021) assumes that the model is known.
1.2 Contributions
This paper makes the following primary contributions:
1. Formulation of a Risk-Sensitive Distributional Dynamic Programming (RS-DDP) frame-
work. This framework introduces a distributional Bellman optimality equation tailored for
EntRM-MDP, leveraging the independence property of the EntRM.
2. Proposal of computationally efficient DRL algorithms that enforce the OFU principle in
a distributional fashion, along with regret upper bounds of ˜
Oexp(|β|H)1
|β|HS2AK. The
DRL algorithms not only outperform existing methods empirically but are also supported
by theoretical justifications. Furthermore, this marks the first instance of analyzing a DRL
algorithm within a finite episodic EntRM-MDP setting.
3. Filling of gaps in the lower bound in Fei et al. (2020), resulting in a tight lower bound of
exp(βH/6)1
βSAT for β > 0. This lower bound is dependent of Sand Aand recovers
the tight lower bound in the risk-neutral setting (as β0).
Algorithm Regret bound Time Space
RSVI ˜
Oexp(|β|H2)exp(|β|H)1
|β|HS2AT OT S2AO(HSA +T)
RSVI2
˜
Oexp(|β|H)1
|β|HS2AT
RODI-Rep
RODI-MF O(KSH)O(SH)
RODI-MB OT S2AO(HS2A)
lower bound exp(βH/6)1
βSAT - -
Table 1: Regret bounds and computational complexity comparisons.
2. Preliminaries
We provide the technical background in this section.
2.1 Notations
We write [M:N]{M, M + 1, ..., N}and [N][1 : N] for any positive integers MN.
We adopt the convention that Pm
i=nai0 if n>mand Qm
i=nai1 if n>m. We use I{·}
to denote the indicator function. For any xR, we define [x]+max{x, 0}. We denote
by δcthe Dirac measure at c. We denote by D(a, b), DMand Dthe set of distributions
1. A detailed comparison with Fei et al. (2021) is given in Section 8.
3
Hao Liang and Zhi-Quan Luo
supported on [a, b], [0, M] and the set of all distributions respectively. We use (x1, x2;p) to
denote a binary r.v. taking values x1and x2with probability 1 pand p. For a discrete set
x={x1,··· , xn}and a probability vector p= (p1,··· , pn), the notation (x, p) represents
the discrete distribution with P(X=xi) = pi. For a discrete distribution η= (x, p), we use
|η|=|x|to denote the number of atoms of the distribution η. We use ˜
O(·) to denote O(·)
omitting logarithmic factors. A table of notation is provided in Appendix A.
2.2 Episodic MDP
An episodic MDP is identified by M(S,A,(Ph)h[H],(rh)h[H], H), where Sis the state
space, Athe action space, Ph:S × A ∆(S) the probability transition kernel at step h,
rh:S × A [0,1] the collection of reward functions at step hand Hthe length of one
episode. The agent interacts with the environment for Kepisodes. At the beginning of
episode k, Nature selects an initial state sk
1arbitrarily. In step h, the agent takes action ak
h
and observes random reward rh(sk
h, ak
h) and reaches the next state sk
h+1 Ph(·|sk
h, ak
h). The
episode terminates at H+ 1 with rH+1 = 0, then the agent proceeds to next episode.
For each (k, h)[K]×[H], we denote by Hk
hs1
1, a1
1, s1
2, a1
2, . . . , s1
H, a1
H, . . . , sk
h, ak
h
the (random) history up to step hof episode k. We define FkHk1
Has the history up to
episode k1. We describe the interaction between the algorithm and MDP in two levels.
In the level of episode, we define an algorithm as a sequence of function A(Ak)k[K],
each mapping Fkto a policy Ak(Fk)Π. We denote by πkAk(Fk) the policy at episode
k. In the level of step, a (deterministic) policy πis a sequence of functions π= (πh)h[H]
with πh:S → A.
2.3 Risk Measure
Consider two random variables XFand YG. We assert that Ydominates X, and
correspondingly, Gdominates F, denoted as YXand GF, if and only if for every real
number x, the inequality F(x)G(x) holds true. A risk measure, ρ, is a function mapping
a set of random variables, denoted as X, to the real numbers. This mapping adheres to
several crucial properties:
Monotonicity (M): XYρ(X)ρ(Y),X, Y X,
Translation-invariance (TI): ρ(X+c) = ρ(X) + c, XX,cR,
Distribution-invariance (DI): FX1=FX2ρ(X1) = ρ(X2).
A mapping ρ:XRqualifies as a risk measure if it satisfies both (M) and (TI).
Additionally, a risk measure that also adheres to (DI) is termed a distribution-invariant risk
measure. This paper focuses exclusively on distribution-invariant risk measures, allowing
us to simplify notation by denoting ρ(FX) as ρ(X).
We direct our attention to EntRM, a prominent risk measure in domains requiring
risk-sensitive decision-making, such as mathematical finance (F¨ollmer and Schied, 2016),
Markovian decision processes (B¨auerle and Rieder, 2014). For a random variable XF
and a non-zero coefficient β, the EntRM is defined as:
Uβ(X)1
βlog EXFheβX i=1
βlog ZR
eβX dF (x).
4
Bridging Distributional and Risk-sensitive Reinforcement Learning
Given that EntRM satisfies (DI), we can denote Uβ(F) as Uβ(X) for XF. When β
possesses a small absolute value, employing Taylor’s expansion yields
Uβ(X) = E[X] + β
2V[X] + O(β2).(1)
Therefore, a decision-maker aiming to maximize the EntRM value demonstrates risk-seeking
behavior (preferring higher uncertainty in X) when β > 0, and risk-averse behavior (prefer-
ring lower uncertainty in X) when β < 0. The absolute value of βdictates the sensitivity
to risk, with the measure converging to the mean functional as βapproaches zero.
2.4 Risk-neutral Distributional Dynamic Programming Revisited
Bellemare et al. (2017); Rowland et al. (2018) have discussed the infinite-horizon distribu-
tional dynamic programming in the risk-neutral setting, which will be referred to as the
classical DDP. Now we adapt their results to the finite horizon setting. We define the return
for a policy πstarting from state-action pair (s, a) at step h
Zπ
h(s, a)
H
X
h=h
rh(sh, ah), sh=s, ah=πh(sh), sh+1 Ph(·|sh, ah).
Define Yπ
h(s)Zπ
h(s, πh(s)), then it is immediate that
Zπ
h(s, a) = rh(s, a) + Yπ
h+1(S), SPh(·|s, a).
There are two sources of randomness in Zπ
h(s, a): the transition Pπ
hand the next-state
return Yπ
h+1. Denote by νπ
h(s) and ηπ
h(s, a) the cumulative distribution function (CDF)
corresponding to Yπ
h(s) and Zπ
h(s, a) respectively. Rewriting the random variable in the
form of CDF, we have the distributional Bellman equation
ηπ
h(s, a) = X
s
Ph(s|s, a)νπ
h+1(s)(· − rh(s, a)), νπ
h(s) = ηπ
h(s, πh(s)).
The distributional Bellman equation outlines the backward recursion of the return distribu-
tion under a fixed policy. Our focus is primarily on risk-neutral control, aiming to maximize
the mean value of the return, as represented by:
π(s)arg max
(π1,...,πH)Π
E[Zπ
1(s)].
Here, π= (π1, ..., πH) signifies that this is a multi-stage maximization problem. An ex-
haustive search approach is impractical due to its exponential computational complexity.
However, the principle of optimality applies, suggesting that the optimal policy for any
tail sub-problem coincides with the tail of the optimal policy, as discussed in (Bertsekas
et al., 2000). This principle enables the reduction of the multi-stage maximization problem
into several single-stage maximization problems. Let Z=Zπand Y=Yπdenote the
optimal return. The risk-neutral Bellman optimality equation can be expressed as follows:
Z
h(s, a) = rh(s, a) + Y
h+1(S), SPh(·|s, a)
π
h(s) = arg max
a
E[Z
h(s, a)], Y
h(s) = Z
h(s, π
h(s)).
5
摘要:

BridgingDistributionalandRisk-sensitiveReinforcementLearningBridgingDistributionalandRisk-sensitiveReinforcementLearningwithProvableRegretBoundsHaoLianghaoliang1@link.cuhk.edu.cnSchoolofScienceandEngineeringTheChineseUniversityofHongKong,ShenzhenZhi-QuanLuoluozq@cuhk.edu.cnSchoolofScienceandEngineer...

展开>> 收起<<
Bridging Distributional and Risk-sensitive Reinforcement Learning Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds.pdf

共54页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:54 页 大小:722.09KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 54
客服
关注