
Factors of Influence of the Overestimation Bias of Q-Learning
instead of its maximum expected value, an issue which as
discussed by Van Hasselt (2011) dates back to research in
order statistics (Clark,1961). As a result, most recent work
aimed to reduce Q-Learning’s overestimation bias by re-
placing its
max
operator: Maxmin Q-Learning (Lan et al.,
2020) controls it by trying to reduce the estimated variance
of the different state-action values; whereas Variation Re-
sistant Q-Learning (Pentaliotis & Wiering,2021) does so
by keeping track of past state-action value estimates that
can then be used when constructing the TD-target of the
algorithm. On a similar note, Karimpanal et al. (2021) also
define a novel TD-target which is a convex combination
of a pessimistic and an optimistic term. A relatively older
approach is that of Van Hasselt (2010) who introduced the
double estimator approach where one estimator is used for
choosing the maximizing action while the other is used for
determining its value. This approach plays a central role
in his Double Q-Learning algorithm as well as in the more
recent Weighted Double Q-Learning (Zhang et al.,2017),
Double Delayed Q-Learning (Abed-alguni & Ottom,2018)
and Self-Correcting Q-Learning algorithms (Zhu & Rigotti,
2021). The recent rise of Deep Reinforcement Learning has
shown that the overestimation bias of Q-Learning plays an
even more important role when model-free RL algorithms
are combined with deep neural networks, which in turn has
resulted in a large body of works that have studied this phe-
nomenon outside the tabular RL setting (Van Hasselt et al.,
2016;Fujimoto et al.,2018;Kim et al.,2019;Cini et al.,
2020;Sabatelli et al.,2020;Peer et al.,2021).
2. Methods
While, as explained earlier, reducing Q-Learning’s overesti-
mation bias has been mainly done by focusing on its
max
operator, in this paper, we take a different approach. Before
introducing it, let us recall that Q-Learning learns
Q∗(s, a)
as follows:
Q(st, at) := Q(st, at) + αrt+γmax
a∈A Q(st+1, a)−
Q(st, at).(5)
2.1. Factors of Influence
Instead of replacing the maximization estimator, we investi-
gate whether overestimation can be prevented by tuning the
following parameters:
1. The learning rate α
also denoted as the step-size pa-
rameter, controls the extent to which a certain state-
action tuple gets updated with respect to the TD-target.
Typically, small values imply slow convergence, while
larger values may lead to divergence (Pirotta et al.,
2013). It is well known that Q-Learning’s maximiza-
tion estimator enhances the divergence of the algorithm,
therefore we investigate whether its overestimation can
be controlled by adopting learning rates which are low
and fixed instead of linearly or exponentially decaying
as done by Van Hasselt (2010).
2. The discount factor γ
enables to control the trade-
off between immediate and long term rewards. While
for many years it was considered best practice to set
γ
to a constant value as close as possible to
1
, more
recent research has demonstrated that this might not
always be the best approach (Van Seijen et al.,2019).
In fact a constant value of
γ
has proven to yield time-
inconsistent behaviours (Lattimore & Hutter,2014),
failures in modelling agent’s preferences (Pitis,2019)
and sub-optimal exploration (Fran
c¸
ois-Lavet et al.,
2015). As mentioned by Fedus et al. (2019) there
seems to be a growing tension between the original
γ
formulation and current RL research, which, however,
has not yet been studied from an overestimation bias
perspective, a limitation which we start addressing in
this work.
3. The reward signal rt
causes overestimation in envi-
ronments where rewards are stochastic. The larger the
variance in stochastic rewards, the higher the poten-
tial for overoptimistic values to accumulate and prop-
agate through the system. However, if one averages
the reward observed for a certain state-action pair over
time, these averaged values would deviate from the
true mean with a smaller variance. Therefore, we ex-
amine if overestimation can be reduced by using an
exponential moving average in Q-Learning’s TD-target
which is computed as follows
ˆr(s) += 1
x(r(t)−ˆr(s)),(6)
where
x
is a static hyperparameter determining the
degree of weighting decrease.
2.2. Experimental Setup
We examine the effect on overestimation and performance
of keeping
α
low and static, lowering
γ
, and using and aver-
aged reward signal
ˆr
instead of
rt
in three different environ-
ments, and compare the performance of Q-Learning (QL) to
that of Double Q-Learning (Van Hasselt,2010) (DQL) and
Self-Correcting Q-Learning (Zhu & Rigotti,2021) (SCQL).
For the tabular setting, we use the Gridworld environment
initially proposed by Van Hasselt (2010). The environment
is a
3×3
grid with stochastic rewards in non-terminal states
of a Bernoulli distribution
r∈[−12,10]
and a fixed re-
ward of 5 in the terminal state. We also test the effect of
α, γ
and
ˆr
on the OpenAI gym (Brockman et al.,2016)
Blackjack-v0
environment, which simulates Blackjack
including its stochastic state transitions and stochastic re-
wards. Lastly, we consider the function approximator case