Factors of Influence of the Overestimation Bias of Q-Learning_2

2025-04-27 0 0 2.65MB 7 页 10玖币
侵权投诉
Factors of Influence of the Overestimation Bias of Q-Learning
Julius Wagenbach 1Matthia Sabatelli 1
Abstract
We study whether the learning rate
α
, the discount
factor
γ
and the reward signal
r
have an influence
on the overestimation bias of the Q-Learning al-
gorithm. Our preliminary results in environments
which are stochastic and that require the use of
neural networks as function approximators, show
that all three parameters influence overestimation
significantly. By carefully tuning
α
and
γ
, and
by using an exponential moving average of
r
in
Q-Learning’s temporal difference target, we show
that the algorithm can learn value estimates that
are more accurate than the ones of several other
popular model-free methods that have addressed
its overestimation bias in the past.
1. Introduction and Preliminaries
Reinforcement Learning (RL) is a machine learning
paradigm that aims to train agents such that they can inter-
act with an environment and maximize a numerical reward
signal. While there exist numerous ways of learning from
interaction, in model-free RL this is achieved by learning
value functions that estimate how good it is for an agent to
be in a certain state, or how good it is for the agent to take a
certain action in a particular state (Sutton & Barto,2018).
The goodness/badness of state-action pairs is typically ex-
pressed in terms of expected future rewards: the higher the
expected value of a state-action tuple, the better it is for
the RL agent to perform a certain action in a given state.
Estimating state-action values accurately is therefore key
when it comes to model-free RL, as it is in fact the agent’s
value functions that define its actions and, as a result, allow
it to interact optimally with its environment.
To express such concepts more formally, let us define the RL
setting as a Markov Decision Process (MDP) represented
by the following tuple
(S,A, P, r, γ)
(Puterman,2014). Its
components are a state space
S
, an action space
A
, a transi-
*
Equal contribution
1
Bernoulli Institute for Mathematics, Com-
puter Science and Artificial Intelligence University of Gronin-
gen, The Netherlands. Correspondence to: Matthia Sabatelli
<m.sabatelli@rug.nl>.
tion probability distribution
P
, that defines the probability
for an agent to visit state
s
given action
a
at time step
t
,
p(st+1|st, at)
, a reward signal
r
, coming from the reward
function
<(st, at, st+1)
, and a discount factor
γ[0,1)
.
The actions of the RL agent are selected based on its policy
π:S → A
that maps each state to an action. For every
state
s∈ S
, under policy
π
the agent’s state-value function
Vπ:S Ris defined as:
Vπ(s) = E
X
k=0
γkrt+k
st=s, π,(1)
while its state-action value function
Q:S × A R
is
defined as:
Qπ(s, a) = E
X
k=0
γkrt+k
st=s, at=a, π.(2)
The main goal for the agent is to find a policy
π
that realizes
the optimal expected return:
V(s) = max
πVπ(s),for all s∈ S (3)
and the optimal Qvalue function:
Q(s, a) = max
πQπ(s, a)for all s∈ S and a∈ A.(4)
Learning these value functions is a well-studied problem
in RL (Szepesv
´
ari,2010;Sutton & Barto,2018), and sev-
eral algorithms have been proposed to do so. The arguably
most popular one is Q-Learning (Watkins & Dayan,1992)
which keeps track of an estimate of the optimal state-action
value function
Q:S × A R
and given a RL trajec-
tory
hst, at, rt, st+1i
updates
Q(st, at)
with respect to the
greedy target
rt+γmaxa∈A Q(st+1, a)
. Despite guarantee-
ing convergence to
Q(s, a)
with probability
1
, Q-Learning
is characterized by some biases that can prevent the agent
from learning (Thrun & Schwartz,1993;Van Hasselt,2010;
Lu et al.,2018).
1.1. The Overestimation Bias of Q-Learning
Among such biases, the arguably most studied one is its
overestimation bias: due to the maximization operator in
its Temporal Difference (TD) target
maxa∈A Q(st+1, a)
, Q-
Learning estimates the expected maximum value of a state,
arXiv:2210.05262v1 [stat.ML] 11 Oct 2022
Factors of Influence of the Overestimation Bias of Q-Learning
instead of its maximum expected value, an issue which as
discussed by Van Hasselt (2011) dates back to research in
order statistics (Clark,1961). As a result, most recent work
aimed to reduce Q-Learning’s overestimation bias by re-
placing its
max
operator: Maxmin Q-Learning (Lan et al.,
2020) controls it by trying to reduce the estimated variance
of the different state-action values; whereas Variation Re-
sistant Q-Learning (Pentaliotis & Wiering,2021) does so
by keeping track of past state-action value estimates that
can then be used when constructing the TD-target of the
algorithm. On a similar note, Karimpanal et al. (2021) also
define a novel TD-target which is a convex combination
of a pessimistic and an optimistic term. A relatively older
approach is that of Van Hasselt (2010) who introduced the
double estimator approach where one estimator is used for
choosing the maximizing action while the other is used for
determining its value. This approach plays a central role
in his Double Q-Learning algorithm as well as in the more
recent Weighted Double Q-Learning (Zhang et al.,2017),
Double Delayed Q-Learning (Abed-alguni & Ottom,2018)
and Self-Correcting Q-Learning algorithms (Zhu & Rigotti,
2021). The recent rise of Deep Reinforcement Learning has
shown that the overestimation bias of Q-Learning plays an
even more important role when model-free RL algorithms
are combined with deep neural networks, which in turn has
resulted in a large body of works that have studied this phe-
nomenon outside the tabular RL setting (Van Hasselt et al.,
2016;Fujimoto et al.,2018;Kim et al.,2019;Cini et al.,
2020;Sabatelli et al.,2020;Peer et al.,2021).
2. Methods
While, as explained earlier, reducing Q-Learning’s overesti-
mation bias has been mainly done by focusing on its
max
operator, in this paper, we take a different approach. Before
introducing it, let us recall that Q-Learning learns
Q(s, a)
as follows:
Q(st, at) := Q(st, at) + αrt+γmax
a∈A Q(st+1, a)
Q(st, at).(5)
2.1. Factors of Influence
Instead of replacing the maximization estimator, we investi-
gate whether overestimation can be prevented by tuning the
following parameters:
1. The learning rate α
also denoted as the step-size pa-
rameter, controls the extent to which a certain state-
action tuple gets updated with respect to the TD-target.
Typically, small values imply slow convergence, while
larger values may lead to divergence (Pirotta et al.,
2013). It is well known that Q-Learning’s maximiza-
tion estimator enhances the divergence of the algorithm,
therefore we investigate whether its overestimation can
be controlled by adopting learning rates which are low
and fixed instead of linearly or exponentially decaying
as done by Van Hasselt (2010).
2. The discount factor γ
enables to control the trade-
off between immediate and long term rewards. While
for many years it was considered best practice to set
γ
to a constant value as close as possible to
1
, more
recent research has demonstrated that this might not
always be the best approach (Van Seijen et al.,2019).
In fact a constant value of
γ
has proven to yield time-
inconsistent behaviours (Lattimore & Hutter,2014),
failures in modelling agent’s preferences (Pitis,2019)
and sub-optimal exploration (Fran
c¸
ois-Lavet et al.,
2015). As mentioned by Fedus et al. (2019) there
seems to be a growing tension between the original
γ
formulation and current RL research, which, however,
has not yet been studied from an overestimation bias
perspective, a limitation which we start addressing in
this work.
3. The reward signal rt
causes overestimation in envi-
ronments where rewards are stochastic. The larger the
variance in stochastic rewards, the higher the poten-
tial for overoptimistic values to accumulate and prop-
agate through the system. However, if one averages
the reward observed for a certain state-action pair over
time, these averaged values would deviate from the
true mean with a smaller variance. Therefore, we ex-
amine if overestimation can be reduced by using an
exponential moving average in Q-Learning’s TD-target
which is computed as follows
ˆr(s) += 1
x(r(t)ˆr(s)),(6)
where
x
is a static hyperparameter determining the
degree of weighting decrease.
2.2. Experimental Setup
We examine the effect on overestimation and performance
of keeping
α
low and static, lowering
γ
, and using and aver-
aged reward signal
ˆr
instead of
rt
in three different environ-
ments, and compare the performance of Q-Learning (QL) to
that of Double Q-Learning (Van Hasselt,2010) (DQL) and
Self-Correcting Q-Learning (Zhu & Rigotti,2021) (SCQL).
For the tabular setting, we use the Gridworld environment
initially proposed by Van Hasselt (2010). The environment
is a
3×3
grid with stochastic rewards in non-terminal states
of a Bernoulli distribution
r[12,10]
and a fixed re-
ward of 5 in the terminal state. We also test the effect of
α, γ
and
ˆr
on the OpenAI gym (Brockman et al.,2016)
Blackjack-v0
environment, which simulates Blackjack
including its stochastic state transitions and stochastic re-
wards. Lastly, we consider the function approximator case
摘要:

FactorsofInuenceoftheOverestimationBiasofQ-LearningJuliusWagenbach1MatthiaSabatelli1AbstractWestudywhetherthelearningrate ,thediscountfactorandtherewardsignalrhaveaninuenceontheoverestimationbiasoftheQ-Learningal-gorithm.Ourpreliminaryresultsinenvironmentswhicharestochasticandthatrequiretheuseofne...

展开>> 收起<<
Factors of Influence of the Overestimation Bias of Q-Learning_2.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:2.65MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注