Hao Liang and Zhi-Quan Luo
range of domains (Shen et al., 2014; Nass et al., 2019; Hansen and Sargent, 2011). EntRM
offers a trade-off between the expected return and its variance and allows for the adjustment
of risk-sensitivity through a risk parameter. However, existing approaches typically require
complicated algorithmic designs to handle the non-linearity of EntRM.
Distributional reinforcement learning (DRL) has demonstrated superior performance
over traditional methods in some challenging tasks under a risk-neutral setting (Bellemare
et al., 2017; Dabney et al., 2018b,a). Unlike value-based approaches, DRL learns the en-
tire return distribution instead of a real-valued value function. Given the distributional
information, it is natural to leverage it to optimize a risk measure other than expectation
(Dabney et al., 2018a; Singh et al., 2020; Ma et al., 2020). Despite the intrinsic connec-
tion between DRL and RSRL, existing works on RSRL via DRL approaches lack regret
analysis (Dabney et al., 2018a; Ma et al., 2021; Achab and Neu, 2021). Consequently, it
is challenging to evaluate and improve these DRL algorithms in terms of sample efficiency.
Additionally, DRL can be computationally demanding as return distributions are typically
infinite-dimensional objects. This complexity raises a pertinent question:
Is it feasible for DRL to attain near-optimal regret in RSRL while preserving
computational efficiency?
In this work, we answer this question positively by providing computationally efficient DRL
algorithms with regret guarantee. We have developed two types of DRL algorithms, both
designed to be computationally efficient, and equipped with principled exploration strategies
tailored for tabular EntRM-MDP. Notably, these proposed algorithms apply the principle
of optimism in the face of uncertainty (OFU) at a distributional level, effectively balancing
the exploration-exploitation dilemma. By conducting the first regret analysis in the field
of DRL, we bridge the gap between computationally efficient DRL and RSRL, especially
in terms of sample complexity. Our work paves the way for deeper understanding and
improving the efficiency of RSRL through the lens of distributional approaches.
1.1 Related Work
Related work in DRL has rapidly grown since Bellemare et al. (2017), with numerous studies
aiming to improve performance in the risk-neutral setting (see Rowland et al., 2018; Dabney
et al., 2018b,a; Barth-Maron et al., 2018; Yang et al., 2019; Lyle et al., 2019; Zhang et al.,
2021). However, only a few works have considered risk-sensitive behavior, including Dabney
et al. (2018a); Ma et al. (2021); Achab and Neu (2021). None of these works have addressed
sample complexity.
A large body of work has investigated RSRL using the EntRM in various settings
(Borkar, 2001, 2002; Borkar and Meyn, 2002; Borkar, 2010; B¨auerle and Rieder, 2014;
Di Masi et al., 2000; Di Masi and Stettner, 2007; Cavazos-Cadena and Hern´andez-Hern´andez,
2011; Ja´skiewicz, 2007; Ma et al., 2020; Mihatsch and Neuneier, 2002; Osogami, 2012; Patek,
2001; Shen et al., 2013, 2014). However, these works either assume known transition and
reward or consider infinite-horizon settings without considering sample complexity.
Our work is related to two recent studies by Fei et al. (2020) and Fei et al. (2021) in
the same setting. Fei et al. (2020) introduced the first regret-guaranteed algorithms for
risk-sensitive episodic Markov decision processes (MDPs), but their regret upper bounds
2