Factors of Influence of the Overestimation Bias of Q-Learning_2

2025-04-27 1 0 2.65MB 7 页 10玖币

侵权投诉

Factors of Inﬂuence of the Overestimation Bias of Q-Learning

Julius Wagenbach 1Matthia Sabatelli 1

Abstract

We study whether the learning rate

, the discount

factor

and the reward signal

have an inﬂuence

on the overestimation bias of the Q-Learning al-

gorithm. Our preliminary results in environments

which are stochastic and that require the use of

neural networks as function approximators, show

that all three parameters inﬂuence overestimation

signiﬁcantly. By carefully tuning

and

, and

by using an exponential moving average of

Q-Learning’s temporal difference target, we show

that the algorithm can learn value estimates that

are more accurate than the ones of several other

popular model-free methods that have addressed

its overestimation bias in the past.

1. Introduction and Preliminaries

Reinforcement Learning (RL) is a machine learning

paradigm that aims to train agents such that they can inter-

act with an environment and maximize a numerical reward

signal. While there exist numerous ways of learning from

interaction, in model-free RL this is achieved by learning

value functions that estimate how good it is for an agent to

be in a certain state, or how good it is for the agent to take a

certain action in a particular state (Sutton & Barto,2018).

The goodness/badness of state-action pairs is typically ex-

pressed in terms of expected future rewards: the higher the

expected value of a state-action tuple, the better it is for

the RL agent to perform a certain action in a given state.

Estimating state-action values accurately is therefore key

when it comes to model-free RL, as it is in fact the agent’s

value functions that deﬁne its actions and, as a result, allow

it to interact optimally with its environment.

To express such concepts more formally, let us deﬁne the RL

setting as a Markov Decision Process (MDP) represented

by the following tuple

(S,A, P, r, γ)

(Puterman,2014). Its

components are a state space

, an action space

, a transi-

Equal contribution

Bernoulli Institute for Mathematics, Com-

puter Science and Artiﬁcial Intelligence University of Gronin-

gen, The Netherlands. Correspondence to: Matthia Sabatelli

<m.sabatelli@rug.nl>.

tion probability distribution

, that deﬁnes the probability

for an agent to visit state

given action

at time step

p(st+1|st, at)

, a reward signal

, coming from the reward

function

<(st, at, st+1)

, and a discount factor

γ∈[0,1)

The actions of the RL agent are selected based on its policy

π:S → A

that maps each state to an action. For every

state

s∈ S

, under policy

the agent’s state-value function

Vπ:S → Ris deﬁned as:

Vπ(s) = E∞

k=0

γkrt+k



st=s, π,(1)

while its state-action value function

Q:S × A → R

deﬁned as:

Qπ(s, a) = E∞

k=0

γkrt+k



st=s, at=a, π.(2)

The main goal for the agent is to ﬁnd a policy

π∗

that realizes

the optimal expected return:

V∗(s) = max

πVπ(s),for all s∈ S (3)

and the optimal Qvalue function:

Q∗(s, a) = max

πQπ(s, a)for all s∈ S and a∈ A.(4)

Learning these value functions is a well-studied problem

in RL (Szepesv

ari,2010;Sutton & Barto,2018), and sev-

eral algorithms have been proposed to do so. The arguably

most popular one is Q-Learning (Watkins & Dayan,1992)

which keeps track of an estimate of the optimal state-action

value function

Q:S × A → R

and given a RL trajec-

tory

hst, at, rt, st+1i

updates

Q(st, at)

with respect to the

greedy target

rt+γmaxa∈A Q(st+1, a)

. Despite guarantee-

ing convergence to

Q∗(s, a)

with probability

, Q-Learning

is characterized by some biases that can prevent the agent

from learning (Thrun & Schwartz,1993;Van Hasselt,2010;

Lu et al.,2018).

1.1. The Overestimation Bias of Q-Learning

Among such biases, the arguably most studied one is its

overestimation bias: due to the maximization operator in

its Temporal Difference (TD) target

maxa∈A Q(st+1, a)

, Q-

Learning estimates the expected maximum value of a state,

arXiv:2210.05262v1 [stat.ML] 11 Oct 2022

Factors of Inﬂuence of the Overestimation Bias of Q-Learning

instead of its maximum expected value, an issue which as

discussed by Van Hasselt (2011) dates back to research in

order statistics (Clark,1961). As a result, most recent work

aimed to reduce Q-Learning’s overestimation bias by re-

placing its

max

operator: Maxmin Q-Learning (Lan et al.,

2020) controls it by trying to reduce the estimated variance

of the different state-action values; whereas Variation Re-

sistant Q-Learning (Pentaliotis & Wiering,2021) does so

by keeping track of past state-action value estimates that

can then be used when constructing the TD-target of the

algorithm. On a similar note, Karimpanal et al. (2021) also

deﬁne a novel TD-target which is a convex combination

of a pessimistic and an optimistic term. A relatively older

approach is that of Van Hasselt (2010) who introduced the

double estimator approach where one estimator is used for

choosing the maximizing action while the other is used for

determining its value. This approach plays a central role

in his Double Q-Learning algorithm as well as in the more

recent Weighted Double Q-Learning (Zhang et al.,2017),

Double Delayed Q-Learning (Abed-alguni & Ottom,2018)

and Self-Correcting Q-Learning algorithms (Zhu & Rigotti,

2021). The recent rise of Deep Reinforcement Learning has

shown that the overestimation bias of Q-Learning plays an

even more important role when model-free RL algorithms

are combined with deep neural networks, which in turn has

resulted in a large body of works that have studied this phe-

nomenon outside the tabular RL setting (Van Hasselt et al.,

2016;Fujimoto et al.,2018;Kim et al.,2019;Cini et al.,

2020;Sabatelli et al.,2020;Peer et al.,2021).

2. Methods

While, as explained earlier, reducing Q-Learning’s overesti-

mation bias has been mainly done by focusing on its

max

operator, in this paper, we take a different approach. Before

introducing it, let us recall that Q-Learning learns

Q∗(s, a)

as follows:

Q(st, at) := Q(st, at) + αrt+γmax

a∈A Q(st+1, a)−

Q(st, at).(5)

2.1. Factors of Inﬂuence

Instead of replacing the maximization estimator, we investi-

gate whether overestimation can be prevented by tuning the

following parameters:

1. The learning rate α

also denoted as the step-size pa-

rameter, controls the extent to which a certain state-

action tuple gets updated with respect to the TD-target.

Typically, small values imply slow convergence, while

larger values may lead to divergence (Pirotta et al.,

2013). It is well known that Q-Learning’s maximiza-

tion estimator enhances the divergence of the algorithm,

therefore we investigate whether its overestimation can

be controlled by adopting learning rates which are low

and ﬁxed instead of linearly or exponentially decaying

as done by Van Hasselt (2010).

2. The discount factor γ

enables to control the trade-

off between immediate and long term rewards. While

for many years it was considered best practice to set

to a constant value as close as possible to

, more

recent research has demonstrated that this might not

always be the best approach (Van Seijen et al.,2019).

In fact a constant value of

has proven to yield time-

inconsistent behaviours (Lattimore & Hutter,2014),

failures in modelling agent’s preferences (Pitis,2019)

and sub-optimal exploration (Fran

c¸

ois-Lavet et al.,

2015). As mentioned by Fedus et al. (2019) there

seems to be a growing tension between the original

formulation and current RL research, which, however,

has not yet been studied from an overestimation bias

perspective, a limitation which we start addressing in

this work.

3. The reward signal rt

causes overestimation in envi-

ronments where rewards are stochastic. The larger the

variance in stochastic rewards, the higher the poten-

tial for overoptimistic values to accumulate and prop-

agate through the system. However, if one averages

the reward observed for a certain state-action pair over

time, these averaged values would deviate from the

true mean with a smaller variance. Therefore, we ex-

amine if overestimation can be reduced by using an

exponential moving average in Q-Learning’s TD-target

which is computed as follows

ˆr(s) += 1

x(r(t)−ˆr(s)),(6)

where

is a static hyperparameter determining the

degree of weighting decrease.

2.2. Experimental Setup

We examine the effect on overestimation and performance

of keeping

low and static, lowering

, and using and aver-

aged reward signal

ˆr

instead of

in three different environ-

ments, and compare the performance of Q-Learning (QL) to

that of Double Q-Learning (Van Hasselt,2010) (DQL) and

Self-Correcting Q-Learning (Zhu & Rigotti,2021) (SCQL).

For the tabular setting, we use the Gridworld environment

initially proposed by Van Hasselt (2010). The environment

is a

3×3

grid with stochastic rewards in non-terminal states

of a Bernoulli distribution

r∈[−12,10]

and a ﬁxed re-

ward of 5 in the terminal state. We also test the effect of

α, γ

and

ˆr

on the OpenAI gym (Brockman et al.,2016)

Blackjack-v0

environment, which simulates Blackjack

including its stochastic state transitions and stochastic re-

wards. Lastly, we consider the function approximator case

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FactorsofInuenceoftheOverestimationBiasofQ-LearningJuliusWagenbach1MatthiaSabatelli1AbstractWestudywhetherthelearningrate,thediscountfactorandtherewardsignalrhaveaninuenceontheoverestimationbiasoftheQ-Learningal-gorithm.Ourpreliminaryresultsinenvironmentswhicharestochasticandthatrequiretheuseofne...

展开>> 收起<<

Factors of Influence of the Overestimation Bias of Q-Learning_2.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Factors of Influence of the Overestimation Bias of Q-Learning_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: