Elastic Step DQN A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks Adrian Ly1 Richard Dazeley2 Peter Vamplew3 Francisco Cruz4 and Sunil Aryal5

2025-05-03 0 0 632.7KB 24 页 10玖币

侵权投诉

Elastic Step DQN: A novel multi-step algorithm to alleviate

overestimation in Deep Q-Networks

Adrian Ly1, Richard Dazeley2, Peter Vamplew3, Francisco Cruz4, and Sunil Aryal5

1, 2, 5Deakin University

3Federation University

4UNSW

Abstract

Deep Q-Networks algorithm (DQN) was the ﬁrst reinforcement learning algorithm using

deep neural network to successfully surpass human level performance in a number of Atari

learning environments. However, divergent and unstable behaviour have been long standing

issues in DQNs. The unstable behaviour is often characterised by overestimation in the Q-

values, commonly referred to as the overestimation bias. To address the overestimation bias and

the divergent behaviour, a number of heuristic extensions have been proposed. Notably, multi-

step updates have been shown to drastically reduce unstable behaviour while improving agent’s

training performance. However, agents are often highly sensitive to the selection of the multi-

step update horizon (n), and our empirical experiments show that a poorly chosen static value

for ncan in many cases lead to worse performance than single-step DQN. Inspired by the success

of n-step DQN and the eﬀects that multi-step updates have on overestimation bias, this paper

proposes a new algorithm that we call ‘Elastic Step DQN’ (ES-DQN). It dynamically varies the

step size horizon in multi-step updates based on the similarity of states visited. Our empirical

evaluation shows that ES-DQN out-performs n-step with ﬁxed nupdates, Double DQN and

Average DQN in several OpenAI Gym environments while at the same time alleviating the

overestimation bias.

arXiv:2210.03325v1 [cs.LG] 7 Oct 2022

1 Introduction

One of the most prominent reinforcement learning algorithms is Q-learning (Watkins and Dayan,

1992), which has been shown to converge to optimal policies when used with lookup tables. However,

tabular Q-learning is limited to small sized and toy environments, and lookup tables are compu-

tationally ineﬃcient when environments have large state action spaces. To address the scalability

of Q-learning, deep neural networks became a viable alternative to lookup tables to approximate

state-action values in large continuous spaces. One of the most popular algorithms is the Deep Q-

Network (DQN) (Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra and Riedmiller, 2013;

Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski

et al., 2015) which uses neural networks and introduced the concept of a target network and replay

memory with great success in the Atari games environment. In spite of the success, training insta-

bility and divergent behaviour is regularly observed in DQN (Sutton and Barto, 2018; Van Hasselt,

Guez and Silver, 2016). The divergent behaviour itself was not speciﬁc to DQN, and has also

been heavily investigated for function approximators in the past (Sutton and Barto, 2018; Baird,

1995; Tsitsiklis and Van Roy, 1997) with a number of linear solutions having been proposed to

address the problem (Maei, Szepesvari, Bhatnagar, Precup, Silver and Sutton, 2009; Baird, 1995).

Overestimation of the Q-values has frequently been identiﬁed as one of the key reasons that causes

sub-optimal learning and divergent behaviour – this was thoroughly investigated by Thrun and

Schwartz (1993) who attributed the issue to noise generated through function approximation. On

the other hand, Hasselt (2010) theorised that the overestimation originated from the max operator

used as part Qvalue updates which tends estimations towards larger values and a more optimistic

outlook. Van Hasselt, Doron, Strub, Hessel, Sonnerat and Modayil (2018) and Van Hasselt et al.

(2016) have suggested that by correcting for overestimation, an agent is much less susceptible to

divergent behaviours.

Empirical observations by Van Hasselt et al. (2018) and Hessel, Modayil, Van Hasselt, Schaul,

Ostrovski, Dabney, Horgan, Piot, Azar and Silver (2018) characterised a number of plausible archi-

tectural and algorithmic mechanics that may lead to divergent behaviour as well as suggestions that

may alleviate divergence. Van Hasselt et al. (2016) also hypothesised that multi-step returns are

likely to reduce the occurrence of divergence. This idea that multi-step DQN updates can regulate

and reduce divergent behaviour while at the same time return stronger training performance is

not without merit. Multi-step implementations like the Rainbow agent (Hessel et al., 2018) and

Mixed Multi-step DDPG (Meng, Gorbet and Kuli´c, 2021) have shown empirically that multi-step

updates under certain conditions are more stable, and in many circumstances can circumvent the

divergence problem that exists in single-step DQN. However, studies have also shown that multi-

step DQN updates are highly sensitive to the selection of the value nwhich if incorrectly selected

can be detrimental to learning (Hessel et al., 2018; Chiang, Yang, Hong and Lee, 2020; Horgan,

Quan, Budden, Barth-Maron, Hessel, Van Hasselt and Silver, 2018; Deng, Yin, Deng and Li, 2020;

Fedus, Ramachandran, Agarwal, Bengio, Larochelle, Rowland and Dabney, 2020). This raises the

question then, is a static value of nthe best approach to multi-step updates or is it possible to

dynamically select the nparameter and take advantage of the special properties that multi-step

updates provide.

One possible approach is inspired by the work of Dazeley, Vamplew and Bignold (2015), who

identiﬁed an issue where agents may diverge in a grid world setting under linear approximation due

to a self-referential learning loop problem. It was suggested that the consolidation of consecutive

states and the treatment of them as sub-states of larger state, can encourage more stable algo-

rithmic performance. However the ideas were limited to a grid world setting under linear function

approximation. In this paper, we extend the ideas of Dazeley et al. (2015) into a deep reinforcement

learning context by leveraging a multi-step update.

As a combination of these ideas, this paper will introduce Elastic Step DQN (ES-DQN) – a

novel multi-step update that dynamically selects the step size horizon based on consecutive state

similarity. The intuition behind the idea of consolidating similar states together is in general

inspired by spatial information theory which identiﬁed that people have a natural inclination to use

landmarks to provide route directions (Michon and Denis, 2001; Klippel and Winter, 2005). When

providing directions, individuals tend to summarise the instructions based on broader objectives and

use landmarks to anchor and help generate mental images (e.g. ‘walk down the end of the hallway

and turn left once you see the toilet’, in comparison to more detailed step by step descriptions

– ‘take one step then another step until you reach the end of the hallway’). Our experimental

results indicate that the systematic accumulation of experiences that are similar is able to achieve

comparable performance to the well established DQN extensions and superior performance to n-

step updates. Furthermore to achieve statistical signiﬁcance for the experiments, this paper will

primarily use three small sized environments as its test bed for this investigation.

The following sections will investigate the eﬃcacy of multi-step updates, against single-step

DQN as well as introduce a novel multi-step DQN update that automates the selection of the step

size horizon. The algorithm will also be compared against DoubleDQN and Average DQN which

are well established methods of reducing overestimation. This paper makes four key contributions:

•introduce the idea of dynamic multi-step updates based on state similarity

•the introduction of ES-DQN to select the nstep horizon dynamically

•an in-depth empirical exploration of n-step DQN against ES-DQN

•and an in-depth empirical analysis of algorithmic stability when grouping together similar

states.

2 Related works

2.1 Background

The training of a reinforcement learning agent is typically unguided and is characterised by a

Markov decision process where the agent would at each time step tobserve a particular state St

before it takes an action At. The agent would receive some sort of reward Rt+1 and enter into a

new state St+1 – the main objective is for the agent to learn an optimal policy that maximises the

expected discounted accumulation of rewards Gtat time tto ksteps into the future, at the discount

factor γ.

Gt=

k−1

k=1

γk

tRt+k+1

Classical reinforcement learning algorithms rely largely on lookup tables to store state action

values to learn an optimal policy – one such algorithm is Q-learning which was ﬁrst introduced by

(Watkins and Dayan, 1992) and has since been a focal point of value based reinforcement learning

approaches. As a Q-learning agent navigates its environment and receives responses from the

environment, the lookup table is updated at every time step where Qis the state-action value

function, αis the learning rate, Ris the reward, and A0is an action selected from a policy derived

from QSutton and Barto (2018).

Qnew(S, A)←Q(S, A) + α[R+γmax

A0Q(S0, A0)−Q(S, A)]

Multi-step updates (Sutton and Barto, 2018) expand on single-step by temporally extending the

step size horizon of the target by the value of n, traditionally multi-step updates were infrequently

used because it was inconvenient to implement and were more used as an intermediate step to

eligibility traces. However, multi-step updates have shown to allow for faster convergence (Sutton

and Barto, 2018; Ceron and Castro, 2021) when the value of nis well-tuned.

2.2 Related literature

Conceptually the proposed multi-step updates in this paper extends on the ideas introduced in

Options (Sutton, Precup and Singh, 1999) and Coarse Q-learning (Dazeley et al., 2015). Options

were ﬁrst formalized as a framework to allow agents to take sub-actions under a single policy until

a termination condition is activated. The termination condition can come in the form of a sub-

goal being achieved or when a certain temporal range is met. Similarly Coarse Q-learning builds

on the Options framework and explores the issue of a wandering agent in a coarsely quantised

environment. Dazeley et al. (2015) showed that divergence under quantisation can be explained by

a wandering phenomenon where an agent when trapped in the same state for multiple updates may

wander aimlessly and exhibit unstable learning. To address the behaviour, Dazeley et al. (2015)

introduces clamped actions which force the agent to take the same action until the quantised state

is exited, and the Q-function is only updated once this occurs. While conceptually both Options

and Coarse Q-learning is similar to our approach, both frameworks extend largely to linear function

approximators.

Interestingly, while our idea is relatively simple, the concept of consolidating experiences based

on state similarity to address value-based divergence has been relatively unexplored in deep rein-

forcement learning. Early works like the Rainbow agent incorporated multi-step updates into DQN

(Hessel et al., 2018; Van Hasselt et al., 2018) which surpassed super-human performance bench-

marks set by the DQN and Double DQN agents. While the study indicated that the incorporation

of multi-step methods had a strong positive impact on overall agent performance, it was limited

in sample size due to the computational intensity of the Atari games. Ceron and Castro (2021)

further investigated this with a large exploratory set of experiments and was able to show in an

ablation study that the inclusion of multi-step update in DQN yielded strong performance gains,

but the study was limited, as it only compared single-step DQN and n-step DQN based on the same

set of hyper-parameters. As single-step and n-step DQN are diﬀerent algorithms, they may have a

diﬀerent set of optimal hyper-parameters. Meng et al. (2021) showed empirically that a step size

where n= 1 was comparatively more prone to inaccurate estimations of the q-value in comparison

to values where n > 1. The authors also observed an improvement in training performance when

the target was abstracted by an aggregate function over a set time horizon. A similar conclusion was

again reached by Chiang et al. (2020) who experimented with a mixture of diﬀerent step sizes and

achieved stronger agent performance across a series of Atari test beds than single-step DQN. The

authors hypothesised that the incremental improvement of agent performance was a result from an

increased level of heterogeneity from the samples collected in the replay memory. While the study

in general was limited by sample size having had only 3 random seeds as empirical evidence, it

provided interesting insights into the potential of mixing step sizes to address both performance

and estimation deﬁciencies experienced with single-step DQN. Other papers like (Deng et al., 2020)

also recognised the value of multi-step updates but their approach focused on discount rates that

further regularises the reward accumulated to moderate the overestimation potential in multi-step

updates.

Moreover, there also exists a wide body of work ((Van Hasselt et al., 2016; Hasselt, 2010;

Anschel, Baram and Shimkin, 2017; Wang, Schaul, Hessel, Hasselt, Lanctot and Freitas, 2016;

Colas, Sigaud and Oudeyer, 2018; Dabney, Rowland, Bellemare and Munos, 2018; Lan, Pan, Fyshe

and White, 2019)) that while indirectly related to our investigation should be noted as they are

widely used approaches to address divergent behaviours exhibited by DQN and serves as good

benchmarks for our analysis. Of the large body of work, Double DQN and Average DQN will be

used as benchmark algorithms in this paper, both algorithms have shown eﬃcacy with tempering

overestimation in DQN. Double DQN argues that the max operator used in DQN to both select

and evaluate actions inadvertently leads the agent to be over optimistic in its value estimations.

To address the problem of overestimation induced the max operator, Double DQN decouples the

use of the max operator in action and selection and evaluation. In contrast, Average DQN was also

able to achieve a much more stable agent by taking the average of kprevious state action value

estimates. This allowed the agent to produce more conservative and moderate state action values

over time, thus reducing the overestimation . It is also possible as explored by (Sabry and Khalifa,

2019) that overestimation can be addressed through altering the neural network structure. Sabry

and Khalifa (2019) through their empirical investigations showed that the use of dropout layers

may help alleviate overestimation .

Another related area similar to our approach are algorithmic methods that use contrastive

learning and representation learning to remove noise and extract low-dimensional features from

the observational space (Laskin, Srinivas and Abbeel, 2020; Misra, Henaﬀ, Krishnamurthy and

Langford, 2020; Pathak, Agrawal, Efros and Darrell, 2017; Efroni, Misra, Krishnamurthy, Agarwal

and Langford, 2021). A major diﬀerence between the constrastive and representation learning

approaches applied in a reinforcement learning context is that algorithms such as CURL (Laskin

et al., 2020) act as prepossessing techniques that extract features from high-dimensional observation

spaces and projects it to a more enriched lower dimensional space. On the other hand our algorithm

builds a mechanism into the DQN to extract intermediate feature outputs from feed-forward layers

to be used as training observations for an unsupervised learning algorithm, this component while

non-trivial, represents a small portion of the algorithm.

Across the literature, overestimation is typically measured in terms of how far the Q-value

diverges from the realisable true value of Q. In (Fujimoto, Hoof and Meger, 2018), the true value is

calculated as the average discounted return when the agent follows the current policy. In contrast,

Hasselt (2010); Van Hasselt et al. (2016) measured overestimation based on how far it deviated

from the realisable range of the true absolute Qvalue (|Q|) - this was made possible because the

rewards in the environment was bounded between [-1, 1] and the gamma used was 0.99.

3 Elastic Step Deep Q-Networks

The objective of Elastic Step DQN is to incorporate the ideas from Coarse Q-learning and multi-step

DQN to reduce overestimation and improve the overall performance of DQN. To achieve this there

needs to be a method of identifying whether two states are similar or dissimilar that is agnostic to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ElasticStepDQN:Anovelmulti-stepalgorithmtoalleviateoverestimationinDeepQ-NetworksAdrianLy1,RichardDazeley2,PeterVamplew3,FranciscoCruz4,andSunilAryal51,2,5DeakinUniversity3FederationUniversity4UNSWAbstractDeepQ-Networksalgorithm(DQN)wastherstreinforcementlearningalgorithmusingdeepneuralnetworktosuc...

展开>> 收起<<

Elastic Step DQN A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks Adrian Ly1 Richard Dazeley2 Peter Vamplew3 Francisco Cruz4 and Sunil Aryal5.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Elastic Step DQN A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks Adrian Ly1 Richard Dazeley2 Peter Vamplew3 Francisco Cruz4 and Sunil Aryal5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: