Elastic Step DQN A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks Adrian Ly1 Richard Dazeley2 Peter Vamplew3 Francisco Cruz4 and Sunil Aryal5

2025-05-03 0 0 632.7KB 24 页 10玖币
侵权投诉
Elastic Step DQN: A novel multi-step algorithm to alleviate
overestimation in Deep Q-Networks
Adrian Ly1, Richard Dazeley2, Peter Vamplew3, Francisco Cruz4, and Sunil Aryal5
1, 2, 5Deakin University
3Federation University
4UNSW
Abstract
Deep Q-Networks algorithm (DQN) was the first reinforcement learning algorithm using
deep neural network to successfully surpass human level performance in a number of Atari
learning environments. However, divergent and unstable behaviour have been long standing
issues in DQNs. The unstable behaviour is often characterised by overestimation in the Q-
values, commonly referred to as the overestimation bias. To address the overestimation bias and
the divergent behaviour, a number of heuristic extensions have been proposed. Notably, multi-
step updates have been shown to drastically reduce unstable behaviour while improving agent’s
training performance. However, agents are often highly sensitive to the selection of the multi-
step update horizon (n), and our empirical experiments show that a poorly chosen static value
for ncan in many cases lead to worse performance than single-step DQN. Inspired by the success
of n-step DQN and the effects that multi-step updates have on overestimation bias, this paper
proposes a new algorithm that we call ‘Elastic Step DQN’ (ES-DQN). It dynamically varies the
step size horizon in multi-step updates based on the similarity of states visited. Our empirical
evaluation shows that ES-DQN out-performs n-step with fixed nupdates, Double DQN and
Average DQN in several OpenAI Gym environments while at the same time alleviating the
overestimation bias.
1
arXiv:2210.03325v1 [cs.LG] 7 Oct 2022
1 Introduction
One of the most prominent reinforcement learning algorithms is Q-learning (Watkins and Dayan,
1992), which has been shown to converge to optimal policies when used with lookup tables. However,
tabular Q-learning is limited to small sized and toy environments, and lookup tables are compu-
tationally inefficient when environments have large state action spaces. To address the scalability
of Q-learning, deep neural networks became a viable alternative to lookup tables to approximate
state-action values in large continuous spaces. One of the most popular algorithms is the Deep Q-
Network (DQN) (Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra and Riedmiller, 2013;
Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski
et al., 2015) which uses neural networks and introduced the concept of a target network and replay
memory with great success in the Atari games environment. In spite of the success, training insta-
bility and divergent behaviour is regularly observed in DQN (Sutton and Barto, 2018; Van Hasselt,
Guez and Silver, 2016). The divergent behaviour itself was not specific to DQN, and has also
been heavily investigated for function approximators in the past (Sutton and Barto, 2018; Baird,
1995; Tsitsiklis and Van Roy, 1997) with a number of linear solutions having been proposed to
address the problem (Maei, Szepesvari, Bhatnagar, Precup, Silver and Sutton, 2009; Baird, 1995).
Overestimation of the Q-values has frequently been identified as one of the key reasons that causes
sub-optimal learning and divergent behaviour – this was thoroughly investigated by Thrun and
Schwartz (1993) who attributed the issue to noise generated through function approximation. On
the other hand, Hasselt (2010) theorised that the overestimation originated from the max operator
used as part Qvalue updates which tends estimations towards larger values and a more optimistic
outlook. Van Hasselt, Doron, Strub, Hessel, Sonnerat and Modayil (2018) and Van Hasselt et al.
(2016) have suggested that by correcting for overestimation, an agent is much less susceptible to
divergent behaviours.
Empirical observations by Van Hasselt et al. (2018) and Hessel, Modayil, Van Hasselt, Schaul,
Ostrovski, Dabney, Horgan, Piot, Azar and Silver (2018) characterised a number of plausible archi-
tectural and algorithmic mechanics that may lead to divergent behaviour as well as suggestions that
may alleviate divergence. Van Hasselt et al. (2016) also hypothesised that multi-step returns are
likely to reduce the occurrence of divergence. This idea that multi-step DQN updates can regulate
and reduce divergent behaviour while at the same time return stronger training performance is
not without merit. Multi-step implementations like the Rainbow agent (Hessel et al., 2018) and
Mixed Multi-step DDPG (Meng, Gorbet and Kuli´c, 2021) have shown empirically that multi-step
updates under certain conditions are more stable, and in many circumstances can circumvent the
divergence problem that exists in single-step DQN. However, studies have also shown that multi-
step DQN updates are highly sensitive to the selection of the value nwhich if incorrectly selected
can be detrimental to learning (Hessel et al., 2018; Chiang, Yang, Hong and Lee, 2020; Horgan,
Quan, Budden, Barth-Maron, Hessel, Van Hasselt and Silver, 2018; Deng, Yin, Deng and Li, 2020;
Fedus, Ramachandran, Agarwal, Bengio, Larochelle, Rowland and Dabney, 2020). This raises the
question then, is a static value of nthe best approach to multi-step updates or is it possible to
dynamically select the nparameter and take advantage of the special properties that multi-step
updates provide.
One possible approach is inspired by the work of Dazeley, Vamplew and Bignold (2015), who
identified an issue where agents may diverge in a grid world setting under linear approximation due
to a self-referential learning loop problem. It was suggested that the consolidation of consecutive
states and the treatment of them as sub-states of larger state, can encourage more stable algo-
2
rithmic performance. However the ideas were limited to a grid world setting under linear function
approximation. In this paper, we extend the ideas of Dazeley et al. (2015) into a deep reinforcement
learning context by leveraging a multi-step update.
As a combination of these ideas, this paper will introduce Elastic Step DQN (ES-DQN) – a
novel multi-step update that dynamically selects the step size horizon based on consecutive state
similarity. The intuition behind the idea of consolidating similar states together is in general
inspired by spatial information theory which identified that people have a natural inclination to use
landmarks to provide route directions (Michon and Denis, 2001; Klippel and Winter, 2005). When
providing directions, individuals tend to summarise the instructions based on broader objectives and
use landmarks to anchor and help generate mental images (e.g. ‘walk down the end of the hallway
and turn left once you see the toilet’, in comparison to more detailed step by step descriptions
– ‘take one step then another step until you reach the end of the hallway’). Our experimental
results indicate that the systematic accumulation of experiences that are similar is able to achieve
comparable performance to the well established DQN extensions and superior performance to n-
step updates. Furthermore to achieve statistical significance for the experiments, this paper will
primarily use three small sized environments as its test bed for this investigation.
The following sections will investigate the efficacy of multi-step updates, against single-step
DQN as well as introduce a novel multi-step DQN update that automates the selection of the step
size horizon. The algorithm will also be compared against DoubleDQN and Average DQN which
are well established methods of reducing overestimation. This paper makes four key contributions:
introduce the idea of dynamic multi-step updates based on state similarity
the introduction of ES-DQN to select the nstep horizon dynamically
an in-depth empirical exploration of n-step DQN against ES-DQN
and an in-depth empirical analysis of algorithmic stability when grouping together similar
states.
2 Related works
2.1 Background
The training of a reinforcement learning agent is typically unguided and is characterised by a
Markov decision process where the agent would at each time step tobserve a particular state St
before it takes an action At. The agent would receive some sort of reward Rt+1 and enter into a
new state St+1 – the main objective is for the agent to learn an optimal policy that maximises the
expected discounted accumulation of rewards Gtat time tto ksteps into the future, at the discount
factor γ.
Gt=
k1
X
k=1
γk
tRt+k+1
Classical reinforcement learning algorithms rely largely on lookup tables to store state action
values to learn an optimal policy – one such algorithm is Q-learning which was first introduced by
(Watkins and Dayan, 1992) and has since been a focal point of value based reinforcement learning
approaches. As a Q-learning agent navigates its environment and receives responses from the
3
environment, the lookup table is updated at every time step where Qis the state-action value
function, αis the learning rate, Ris the reward, and A0is an action selected from a policy derived
from QSutton and Barto (2018).
Qnew(S, A)Q(S, A) + α[R+γmax
A0Q(S0, A0)Q(S, A)]
Multi-step updates (Sutton and Barto, 2018) expand on single-step by temporally extending the
step size horizon of the target by the value of n, traditionally multi-step updates were infrequently
used because it was inconvenient to implement and were more used as an intermediate step to
eligibility traces. However, multi-step updates have shown to allow for faster convergence (Sutton
and Barto, 2018; Ceron and Castro, 2021) when the value of nis well-tuned.
2.2 Related literature
Conceptually the proposed multi-step updates in this paper extends on the ideas introduced in
Options (Sutton, Precup and Singh, 1999) and Coarse Q-learning (Dazeley et al., 2015). Options
were first formalized as a framework to allow agents to take sub-actions under a single policy until
a termination condition is activated. The termination condition can come in the form of a sub-
goal being achieved or when a certain temporal range is met. Similarly Coarse Q-learning builds
on the Options framework and explores the issue of a wandering agent in a coarsely quantised
environment. Dazeley et al. (2015) showed that divergence under quantisation can be explained by
a wandering phenomenon where an agent when trapped in the same state for multiple updates may
wander aimlessly and exhibit unstable learning. To address the behaviour, Dazeley et al. (2015)
introduces clamped actions which force the agent to take the same action until the quantised state
is exited, and the Q-function is only updated once this occurs. While conceptually both Options
and Coarse Q-learning is similar to our approach, both frameworks extend largely to linear function
approximators.
Interestingly, while our idea is relatively simple, the concept of consolidating experiences based
on state similarity to address value-based divergence has been relatively unexplored in deep rein-
forcement learning. Early works like the Rainbow agent incorporated multi-step updates into DQN
(Hessel et al., 2018; Van Hasselt et al., 2018) which surpassed super-human performance bench-
marks set by the DQN and Double DQN agents. While the study indicated that the incorporation
of multi-step methods had a strong positive impact on overall agent performance, it was limited
in sample size due to the computational intensity of the Atari games. Ceron and Castro (2021)
further investigated this with a large exploratory set of experiments and was able to show in an
ablation study that the inclusion of multi-step update in DQN yielded strong performance gains,
but the study was limited, as it only compared single-step DQN and n-step DQN based on the same
set of hyper-parameters. As single-step and n-step DQN are different algorithms, they may have a
different set of optimal hyper-parameters. Meng et al. (2021) showed empirically that a step size
where n= 1 was comparatively more prone to inaccurate estimations of the q-value in comparison
to values where n > 1. The authors also observed an improvement in training performance when
the target was abstracted by an aggregate function over a set time horizon. A similar conclusion was
again reached by Chiang et al. (2020) who experimented with a mixture of different step sizes and
achieved stronger agent performance across a series of Atari test beds than single-step DQN. The
authors hypothesised that the incremental improvement of agent performance was a result from an
increased level of heterogeneity from the samples collected in the replay memory. While the study
4
in general was limited by sample size having had only 3 random seeds as empirical evidence, it
provided interesting insights into the potential of mixing step sizes to address both performance
and estimation deficiencies experienced with single-step DQN. Other papers like (Deng et al., 2020)
also recognised the value of multi-step updates but their approach focused on discount rates that
further regularises the reward accumulated to moderate the overestimation potential in multi-step
updates.
Moreover, there also exists a wide body of work ((Van Hasselt et al., 2016; Hasselt, 2010;
Anschel, Baram and Shimkin, 2017; Wang, Schaul, Hessel, Hasselt, Lanctot and Freitas, 2016;
Colas, Sigaud and Oudeyer, 2018; Dabney, Rowland, Bellemare and Munos, 2018; Lan, Pan, Fyshe
and White, 2019)) that while indirectly related to our investigation should be noted as they are
widely used approaches to address divergent behaviours exhibited by DQN and serves as good
benchmarks for our analysis. Of the large body of work, Double DQN and Average DQN will be
used as benchmark algorithms in this paper, both algorithms have shown efficacy with tempering
overestimation in DQN. Double DQN argues that the max operator used in DQN to both select
and evaluate actions inadvertently leads the agent to be over optimistic in its value estimations.
To address the problem of overestimation induced the max operator, Double DQN decouples the
use of the max operator in action and selection and evaluation. In contrast, Average DQN was also
able to achieve a much more stable agent by taking the average of kprevious state action value
estimates. This allowed the agent to produce more conservative and moderate state action values
over time, thus reducing the overestimation . It is also possible as explored by (Sabry and Khalifa,
2019) that overestimation can be addressed through altering the neural network structure. Sabry
and Khalifa (2019) through their empirical investigations showed that the use of dropout layers
may help alleviate overestimation .
Another related area similar to our approach are algorithmic methods that use contrastive
learning and representation learning to remove noise and extract low-dimensional features from
the observational space (Laskin, Srinivas and Abbeel, 2020; Misra, Henaff, Krishnamurthy and
Langford, 2020; Pathak, Agrawal, Efros and Darrell, 2017; Efroni, Misra, Krishnamurthy, Agarwal
and Langford, 2021). A major difference between the constrastive and representation learning
approaches applied in a reinforcement learning context is that algorithms such as CURL (Laskin
et al., 2020) act as prepossessing techniques that extract features from high-dimensional observation
spaces and projects it to a more enriched lower dimensional space. On the other hand our algorithm
builds a mechanism into the DQN to extract intermediate feature outputs from feed-forward layers
to be used as training observations for an unsupervised learning algorithm, this component while
non-trivial, represents a small portion of the algorithm.
Across the literature, overestimation is typically measured in terms of how far the Q-value
diverges from the realisable true value of Q. In (Fujimoto, Hoof and Meger, 2018), the true value is
calculated as the average discounted return when the agent follows the current policy. In contrast,
Hasselt (2010); Van Hasselt et al. (2016) measured overestimation based on how far it deviated
from the realisable range of the true absolute Qvalue (|Q|) - this was made possible because the
rewards in the environment was bounded between [-1, 1] and the gamma used was 0.99.
3 Elastic Step Deep Q-Networks
The objective of Elastic Step DQN is to incorporate the ideas from Coarse Q-learning and multi-step
DQN to reduce overestimation and improve the overall performance of DQN. To achieve this there
needs to be a method of identifying whether two states are similar or dissimilar that is agnostic to
5
摘要:

ElasticStepDQN:Anovelmulti-stepalgorithmtoalleviateoverestimationinDeepQ-NetworksAdrianLy1,RichardDazeley2,PeterVamplew3,FranciscoCruz4,andSunilAryal51,2,5DeakinUniversity3FederationUniversity4UNSWAbstractDeepQ-Networksalgorithm(DQN)wasthe rstreinforcementlearningalgorithmusingdeepneuralnetworktosuc...

展开>> 收起<<
Elastic Step DQN A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks Adrian Ly1 Richard Dazeley2 Peter Vamplew3 Francisco Cruz4 and Sunil Aryal5.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:632.7KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注