1 Introduction
One of the most prominent reinforcement learning algorithms is Q-learning (Watkins and Dayan,
1992), which has been shown to converge to optimal policies when used with lookup tables. However,
tabular Q-learning is limited to small sized and toy environments, and lookup tables are compu-
tationally inefficient when environments have large state action spaces. To address the scalability
of Q-learning, deep neural networks became a viable alternative to lookup tables to approximate
state-action values in large continuous spaces. One of the most popular algorithms is the Deep Q-
Network (DQN) (Mnih, Kavukcuoglu, Silver, Graves, Antonoglou, Wierstra and Riedmiller, 2013;
Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski
et al., 2015) which uses neural networks and introduced the concept of a target network and replay
memory with great success in the Atari games environment. In spite of the success, training insta-
bility and divergent behaviour is regularly observed in DQN (Sutton and Barto, 2018; Van Hasselt,
Guez and Silver, 2016). The divergent behaviour itself was not specific to DQN, and has also
been heavily investigated for function approximators in the past (Sutton and Barto, 2018; Baird,
1995; Tsitsiklis and Van Roy, 1997) with a number of linear solutions having been proposed to
address the problem (Maei, Szepesvari, Bhatnagar, Precup, Silver and Sutton, 2009; Baird, 1995).
Overestimation of the Q-values has frequently been identified as one of the key reasons that causes
sub-optimal learning and divergent behaviour – this was thoroughly investigated by Thrun and
Schwartz (1993) who attributed the issue to noise generated through function approximation. On
the other hand, Hasselt (2010) theorised that the overestimation originated from the max operator
used as part Qvalue updates which tends estimations towards larger values and a more optimistic
outlook. Van Hasselt, Doron, Strub, Hessel, Sonnerat and Modayil (2018) and Van Hasselt et al.
(2016) have suggested that by correcting for overestimation, an agent is much less susceptible to
divergent behaviours.
Empirical observations by Van Hasselt et al. (2018) and Hessel, Modayil, Van Hasselt, Schaul,
Ostrovski, Dabney, Horgan, Piot, Azar and Silver (2018) characterised a number of plausible archi-
tectural and algorithmic mechanics that may lead to divergent behaviour as well as suggestions that
may alleviate divergence. Van Hasselt et al. (2016) also hypothesised that multi-step returns are
likely to reduce the occurrence of divergence. This idea that multi-step DQN updates can regulate
and reduce divergent behaviour while at the same time return stronger training performance is
not without merit. Multi-step implementations like the Rainbow agent (Hessel et al., 2018) and
Mixed Multi-step DDPG (Meng, Gorbet and Kuli´c, 2021) have shown empirically that multi-step
updates under certain conditions are more stable, and in many circumstances can circumvent the
divergence problem that exists in single-step DQN. However, studies have also shown that multi-
step DQN updates are highly sensitive to the selection of the value nwhich if incorrectly selected
can be detrimental to learning (Hessel et al., 2018; Chiang, Yang, Hong and Lee, 2020; Horgan,
Quan, Budden, Barth-Maron, Hessel, Van Hasselt and Silver, 2018; Deng, Yin, Deng and Li, 2020;
Fedus, Ramachandran, Agarwal, Bengio, Larochelle, Rowland and Dabney, 2020). This raises the
question then, is a static value of nthe best approach to multi-step updates or is it possible to
dynamically select the nparameter and take advantage of the special properties that multi-step
updates provide.
One possible approach is inspired by the work of Dazeley, Vamplew and Bignold (2015), who
identified an issue where agents may diverge in a grid world setting under linear approximation due
to a self-referential learning loop problem. It was suggested that the consolidation of consecutive
states and the treatment of them as sub-states of larger state, can encourage more stable algo-
2