Bayesian Q-learning With Imperfect Expert Demonstrations Fengdi Che

2025-04-27 1 0 2.7MB 16 页 10玖币

侵权投诉

Bayesian Q-learning With Imperfect Expert

Demonstrations

Fengdi Che∗

University of Alberta

Xiru Zhu †

McGill University

Doina Precup

McGill University, MILA, DeepMind

David Meger

McGill University

Gregory Dudek

McGill University, Samsung

Abstract

Guided exploration with expert demonstrations improves data efﬁciency for rein-

forcement learning, but current algorithms often overuse expert information. We

propose a novel algorithm to speed up Q-learning with the help of a limited amount

of imperfect expert demonstrations. The algorithm avoids excessive reliance on

expert data by relaxing the optimal expert assumption and gradually reducing the

usage of uninformative expert data. Experimentally, we evaluate our approach on

a sparse-reward chain environment and six more complicated Atari games with

delayed rewards. With the proposed methods, we can achieve better results than

Deep Q-learning from Demonstrations (Hester et al., 2017) in most environments.

1 Introduction

Reinforcement learning (RL) trains an agent to maximize expected cumulative rewards through online

interactions with an environment [Sutton and Barto, 2018]. To speed up learning, online RL can be

combined with ofﬂine expert demonstrations [Hester et al., 2018], which guides the agent toward

high-rewarding behaviours and thus improves data efﬁciency. Most existing expert demonstration

methods [Brys et al., 2015, Abbeel and Ng, 2004] either clone behaviours from the expert or bonus

expert actions. However, agents are provided with imperfect expert data and gain distracting guidance.

Some works start handling the imperfect expert data but still lack solid theoretical foundation [Nair

et al., 2018, Zhang et al., 2022].

Our paper adopts Bayesian frameworks [Ghavamzadeh et al., 2015, Dearden et al., 1998], which

provide an analytical method to incorporate extra sub-optimal expert information for learning. In the

probabilistic model for Bayesian inference, the sub-optimal expert decision is assumed to rely on the

Boltzmann distribution dependent on the optimal expected returns from state-action pairs, called the

optimal Q-values. Intuitively, our paper considers that the expert would prefer actions with higher

optimal Q-values. Based on this relationship, the agent can infer values of the optimal Q-values

from the given expert data during online learning and correct estimated Q-values to better correspond

to the expert behaviours. This inference is equivalent to maximizing the posterior distribution of

Q-values conditioned on expert data, but computing the maximum of a posterior probability is a

difﬁcult task [Hoffman et al., 2013, Diaconis and Ylvisaker, 1979, West et al., 1985]. Therefore, our

paper proposes to utilize the generalized extended Kalman ﬁlter (GEKF) [Fahrmeir, 1992] to derive

a posterior maximum, requiring fewer restrictions on Q-values functions than other Bayesian RL

frameworks [Dearden et al., 1998, Engel et al., 2003, Osband et al., 2019].

∗equal contribution, fengdi@ualberta.ca

†equal contribution

Preprint. Under review.

arXiv:2210.01800v1 [cs.LG] 1 Oct 2022

Under the assumptions from the GEKF framework, we derive an iterative forward way to compute

Q-values, which consists of a Q-value-update step as in the Q-learning algorithm and an expert

correction step. In the correction step, our update rule weighs expert information according to

an agent’s uncertainty of its self-learning result, measured by the estimated posterior variance of

learned Q-values. Thus, the larger Q-values’ variance is, the more our expert correction encourages

expert behaviours by increasing expert actions’ Q-values and decreasing non-experts’ Q-values.

This mechanism reduces the inﬂuence of uninformative expert data and avoids excessive guided

exploration.

Furthermore, we propose our computationally efﬁcient deep algorithm, Bayesian Q-learning from

Demonstrations (BQfD), built on top of the Deep Q-learning Network (DQN) [Mnih et al., 2013]. The

algorithm embeds reliable expert knowledge into Q-values and leads to a more efﬁcient exploration, as

shown on a sparse-reward chain environment DeepSea and six randomly chosen Atari games. In most

environments, our algorithm learns faster than Deep Q-learning from Demonstrations (DQfD) [Hester

et al., 2018] and Prioritized Double Duelling (PDD) DQN [Wang et al., 2016].

2 Related Work

Reinforcement learning from demonstrations (RLfD) has drawn attention in recent years. This

method only requires a small number of ofﬂine expert demonstrations and can noticeably improve

performance. Deep Q-learning from demonstrations (DQfD) [Hester et al., 2018] includes an

additional margin classiﬁcation loss to ensure that

-values of expert actions are higher than other

actions. Reinforcement learning from demonstrations through shaping [Brys et al., 2015] assigns high

potential to a state-action pair

(s, a)

when the action

is used by the expert in the neighbourhood of

the state

. Wu et al. [Wu et al., 2020] further leverages reward potential computed by generative

models. However, this requires an added complexity of training generative models. Expert data is

also used implicitly to learn rewards inversely at the beginning [Abbeel and Ng, 2004, Brown et al.,

2019].

However, these approaches do not consider the case where the expert is imperfect and can often have

difﬁculty exceeding expert performance. In contrast, Nair et al. [Nair et al., 2018] handle suboptimal

demonstrations by only cloning expert actions when their current estimated Q-values are higher than

others. Zhang et al. [Zhang et al., 2022] utilizes expert information when the cumulative Q-values on

expert state-action pairs are larger than cumulative value functions. However, the learned Q-value is

often unreliable and frequently changes, making it a poor judgment of expert data quality. At the same

time, our measurement based on the posterior variance of estimated Q-values is more reasonable.

Jing et al. [Jing et al., 2020] models the suboptimal expert policy as a local optimum to maximize

the expected returns and then constrains the agent to learn within a region around the expert policy,

measured by KL divergence between occupancy measures. However, the local optimum condition

is hard to satisfy, and the limitation of learning around the expert policy cannot deal with mistaken

expert actions. In contrast, our assumption on Boltzmann distributed expert policy is much weaker.

3 Background

A Markov decision process [Sutton and Barto, 2018] is a tuple

M=hS,A, P, γ, R, ρi

where

is the state space,

is the action space,

is the time-homogeneous transition probability matrix

with

P(s0|s, a)

as elements,

γ∈(0,1)

is the discount factor,

is the reward function, with

R(s, a)

denoting a random vector describing rewards received after state-action pair

(s, a)

and

is the

distribution of the initial state. At each time step

, the agent samples an action

from a policy

and then transits to the next state

Sh+1

and gains a reward

R(Sh, Ah)

. Our paper focuses on ﬁnite

horizon cases with horizon

, where an agent stops at time step

. Also, our paper works on ﬁnite

action and state spaces.

The expected discounted cumulative return starting from each state-action pair

(s, a)

at time step

and following the policy µis called the Q-value, denoted by qµ

h(s, a), and is deﬁned as follows:

qµ

h(s, a) = Eτ[

t=h

γtR(St, At)],

where

denotes trajectories with

Sh=s

Ah=a

St∼P(·|St−1, At−1), At∼µ(·|Sh)

and

rewards R(St, At). The unique optimal Q-value function at time his deﬁned as:

q∗

h(s, a) = qµ∗

h(s, a) = sup

qµ

h(s, a).

The optimal Q-values also satisfy the optimal Bellman equation for all states and actions:

q∗

h(s, a) = E[R(s, a)] + γES0∼P(·|s,a)[max

a0q∗

h+1(S0, a0)] =: Th+1q∗

h+1(s, a), h = 0,· · · , H −1

q∗

H(s, a)=0,∀(s, a).(1)

The optimal Bellman operator

is deﬁned by the equation. Moreover, the optimal Q-values at

each time

for all state-action pairs can be combined into one vector,

q∗

h∈R|S|×|A|

, each element

representing the optimal Q-value for a state-action pair.

Q-learning

Optimal Q-values can be computed by Q-learning [Watkins and Dayan, 1992, Jin et al., 2018], which

estimates Q-values directly and is based on the following update rule:

Qh(s, a) = (1 −α)Qh(s, a) + α[R(s, a) + γmax

a0Qh+1(S0, a0),

where

is an estimation of Q-values,

S0∼P(·|s, a)

is a sampled next state, and

α∈(0,1)

is the

learning rate. A choice of adaptive learning rate is to assign each state-action pair a learning rate and

decay it with respect to the number of visitations to the state-action pair. We use

nl(s, a)

to represent

the number of visitation times of state-action pair (s, a)until and including episode l.

Bayesian Model-free Framework

Bayesian reinforcement learning assumes that there is an initial guess or a prior probability over the

model parameters, denoted by

P(M)

. Then in the model-based case, the agent gradually learns a

posterior probability for the model parameters conditioned on observed trajectories [Duff, 2003].

Next, an agent can learn a policy based on the most likely MDP or by sampling MDPs from the

posterior. In the model-free case, we treat Q-values as random variables and implicitly embed

parametric uncertainties from different MDPs. A prior distribution of the optimal Q-values at time

is deﬁned as:

P(q∗

h(s, a)≤c) = P({M :q∗

M,h(s, a)≤c}).

Then algorithms compute the posterior probability of Q-values [Dearden et al., 1998, Osband et al.,

2018].

4 Model

In this section, we present the probabilistic model used for Bayesian inference and then derive the

update rule of Q-values by maximizing posterior probability.

4.1 Probabilistic Model of Suboptimal Expert Actions

We start by modeling the relationship between expert actions and the optimal Q-values. The expert

chooses the optimal actions most frequently but may contain mistakes and select low probability

actions. The Boltzmann distribution can describe this behavior with the optimal Q-values as parame-

ters. Under this distribution, expert actions are sampled proportional to the exponential of optimal

Q-values up to a constant multiplier. As known, this assumed expert policy maximizes the expected

returns regularized by the entropy [Haarnoja et al., 2018], describing almost optimal but sometimes

mistaken expert behaviors. The above assumption is formally presented as follows and shown on the

left of ﬁgure 1, where the expert action at time hand state sis denoted by Aexp,h(s).

Assumption 1.

Assume that expert demonstrations are drawn from a policy

πexpert

dependent on

the optimal Q-values. Also, this policy follows the Boltzmann distribution:

πexpert(a|s) = eηq∗(s,a)

Pb∈A eηq∗(s,b),(2)

where ηis any positive constant.

Figure 1: (Left) The ﬁgure demonstrates the relationship between optimal Q-values and expert

actions. Expert actions are sampled proportional to the exponential of optimal Q-values up to a

constant multiplier. (Right) The ﬁgure also describes the Bellman update rule for the Q-values along

a trajectory.

Moreover, we capture the relationship between the optimal Q-values at different time steps in our

probabilistic model through the Bellman equation 1. However, the Bellman equation cannot be

computed without the knowledge of environment dynamics. Therefore, our model relies on samples

of the reward and the next state, which are widely accepted in the ﬁeld [Watkins and Dayan, 1992].

We re-write the Bellman equation in a stochastic form as

q∗(s, a) = R(s, a) + maxa0∈Aγq∗(S0, a0) + ν, (3)

where

R(s, a)

and the next state

are sampled according to the rules of the underlying MDP. The

random noise νis deﬁned as

ν=E[R(s, a) + maxa0∈Aγq∗(S0, a0)] −R(s, a)−maxa0∈Aγq∗(S0, a0),

which captures the randomness from the dynamics. Then, the whole model of relationships among

rewards, optimal Q-values, and expert actions along a trajectory is presented on the right of ﬁgure 1.

4.2 GEKF Framework

Next, our desired estimated Q-values should not only follow the stochastic Bellman equation, but they

should also most likely give the Boltzmann distribution that the expert is following. This objective is

equivalent to maximizing the posterior probabilities of optimal Q-values equaling our estimations

conditioned on expert information while constraining to the stochastic Bellman equation.

In order to compute the posterior probability, we need help from a framework under which the

posterior probability density function or the posterior mode is of analytical form. Thus, our paper

leverages the generalized extended Kalman ﬁlter (GEKF) [Fahrmeir, 1992], which can analyze the

time series of Q-values with extra expert information and provides an estimation of the posterior

maximum. This framework requires extra expert information following an exponential family

distribution, which is already assumed.

Furthermore, the framework requires the random noise

in the stochastic Bellman equation 3 at each

Q-value update step to be Gaussian, which is also assumed in Osband et al. (2019) [Osband et al.,

2019]. Meanwhile, this assumption does not hurt in the latter stage of training since the inﬂuence of

the random noise gradually approaches zero as the number of samples increases, and the learning

rate decays. Notice that the random noise should have a zero expectation and bounded variance when

rewards and time horizons are bounded. Thus, our paper considers

as a Gaussian random variable

with zero expectation and a ﬁxed variance λ, as shown in the following assumption.

Assumption 2.

For all state-action pair

(s, a)

,the noise

is modeled independently by a Gaussian

random variable ν∼N(0, λ).

Also, the GEKF framework approximates the posterior distribution by Gaussian and treats the

maximization operator in Q-value updates linearly as in Osband et al. (2019) [Osband et al., 2019].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BayesianQ-learningWithImperfectExpertDemonstrationsFengdiCheUniversityofAlbertaXiruZhuyMcGillUniversityDoinaPrecupMcGillUniversity,MILA,DeepMindDavidMegerMcGillUniversityGregoryDudekMcGillUniversity,SamsungAbstractGuidedexplorationwithexpertdemonstrationsimprovesdataefciencyforrein-forcementlearni...

展开>> 收起<<

Bayesian Q-learning With Imperfect Expert Demonstrations Fengdi Che.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bayesian Q-learning With Imperfect Expert Demonstrations Fengdi Che

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: