Bayesian Q-learning With Imperfect Expert Demonstrations Fengdi Che

2025-04-27 0 0 2.7MB 16 页 10玖币
侵权投诉
Bayesian Q-learning With Imperfect Expert
Demonstrations
Fengdi Che
University of Alberta
Xiru Zhu
McGill University
Doina Precup
McGill University, MILA, DeepMind
David Meger
McGill University
Gregory Dudek
McGill University, Samsung
Abstract
Guided exploration with expert demonstrations improves data efficiency for rein-
forcement learning, but current algorithms often overuse expert information. We
propose a novel algorithm to speed up Q-learning with the help of a limited amount
of imperfect expert demonstrations. The algorithm avoids excessive reliance on
expert data by relaxing the optimal expert assumption and gradually reducing the
usage of uninformative expert data. Experimentally, we evaluate our approach on
a sparse-reward chain environment and six more complicated Atari games with
delayed rewards. With the proposed methods, we can achieve better results than
Deep Q-learning from Demonstrations (Hester et al., 2017) in most environments.
1 Introduction
Reinforcement learning (RL) trains an agent to maximize expected cumulative rewards through online
interactions with an environment [Sutton and Barto, 2018]. To speed up learning, online RL can be
combined with offline expert demonstrations [Hester et al., 2018], which guides the agent toward
high-rewarding behaviours and thus improves data efficiency. Most existing expert demonstration
methods [Brys et al., 2015, Abbeel and Ng, 2004] either clone behaviours from the expert or bonus
expert actions. However, agents are provided with imperfect expert data and gain distracting guidance.
Some works start handling the imperfect expert data but still lack solid theoretical foundation [Nair
et al., 2018, Zhang et al., 2022].
Our paper adopts Bayesian frameworks [Ghavamzadeh et al., 2015, Dearden et al., 1998], which
provide an analytical method to incorporate extra sub-optimal expert information for learning. In the
probabilistic model for Bayesian inference, the sub-optimal expert decision is assumed to rely on the
Boltzmann distribution dependent on the optimal expected returns from state-action pairs, called the
optimal Q-values. Intuitively, our paper considers that the expert would prefer actions with higher
optimal Q-values. Based on this relationship, the agent can infer values of the optimal Q-values
from the given expert data during online learning and correct estimated Q-values to better correspond
to the expert behaviours. This inference is equivalent to maximizing the posterior distribution of
Q-values conditioned on expert data, but computing the maximum of a posterior probability is a
difficult task [Hoffman et al., 2013, Diaconis and Ylvisaker, 1979, West et al., 1985]. Therefore, our
paper proposes to utilize the generalized extended Kalman filter (GEKF) [Fahrmeir, 1992] to derive
a posterior maximum, requiring fewer restrictions on Q-values functions than other Bayesian RL
frameworks [Dearden et al., 1998, Engel et al., 2003, Osband et al., 2019].
equal contribution, fengdi@ualberta.ca
equal contribution
Preprint. Under review.
arXiv:2210.01800v1 [cs.LG] 1 Oct 2022
Under the assumptions from the GEKF framework, we derive an iterative forward way to compute
Q-values, which consists of a Q-value-update step as in the Q-learning algorithm and an expert
correction step. In the correction step, our update rule weighs expert information according to
an agent’s uncertainty of its self-learning result, measured by the estimated posterior variance of
learned Q-values. Thus, the larger Q-values’ variance is, the more our expert correction encourages
expert behaviours by increasing expert actions’ Q-values and decreasing non-experts’ Q-values.
This mechanism reduces the influence of uninformative expert data and avoids excessive guided
exploration.
Furthermore, we propose our computationally efficient deep algorithm, Bayesian Q-learning from
Demonstrations (BQfD), built on top of the Deep Q-learning Network (DQN) [Mnih et al., 2013]. The
algorithm embeds reliable expert knowledge into Q-values and leads to a more efficient exploration, as
shown on a sparse-reward chain environment DeepSea and six randomly chosen Atari games. In most
environments, our algorithm learns faster than Deep Q-learning from Demonstrations (DQfD) [Hester
et al., 2018] and Prioritized Double Duelling (PDD) DQN [Wang et al., 2016].
2 Related Work
Reinforcement learning from demonstrations (RLfD) has drawn attention in recent years. This
method only requires a small number of offline expert demonstrations and can noticeably improve
performance. Deep Q-learning from demonstrations (DQfD) [Hester et al., 2018] includes an
additional margin classification loss to ensure that
Q
-values of expert actions are higher than other
actions. Reinforcement learning from demonstrations through shaping [Brys et al., 2015] assigns high
potential to a state-action pair
(s, a)
when the action
a
is used by the expert in the neighbourhood of
the state
s
. Wu et al. [Wu et al., 2020] further leverages reward potential computed by generative
models. However, this requires an added complexity of training generative models. Expert data is
also used implicitly to learn rewards inversely at the beginning [Abbeel and Ng, 2004, Brown et al.,
2019].
However, these approaches do not consider the case where the expert is imperfect and can often have
difficulty exceeding expert performance. In contrast, Nair et al. [Nair et al., 2018] handle suboptimal
demonstrations by only cloning expert actions when their current estimated Q-values are higher than
others. Zhang et al. [Zhang et al., 2022] utilizes expert information when the cumulative Q-values on
expert state-action pairs are larger than cumulative value functions. However, the learned Q-value is
often unreliable and frequently changes, making it a poor judgment of expert data quality. At the same
time, our measurement based on the posterior variance of estimated Q-values is more reasonable.
Jing et al. [Jing et al., 2020] models the suboptimal expert policy as a local optimum to maximize
the expected returns and then constrains the agent to learn within a region around the expert policy,
measured by KL divergence between occupancy measures. However, the local optimum condition
is hard to satisfy, and the limitation of learning around the expert policy cannot deal with mistaken
expert actions. In contrast, our assumption on Boltzmann distributed expert policy is much weaker.
3 Background
A Markov decision process [Sutton and Barto, 2018] is a tuple
M=hS,A, P, γ, R, ρi
where
S
is the state space,
A
is the action space,
P
is the time-homogeneous transition probability matrix
with
P(s0|s, a)
as elements,
γ(0,1)
is the discount factor,
R
is the reward function, with
R(s, a)
denoting a random vector describing rewards received after state-action pair
(s, a)
and
ρ
is the
distribution of the initial state. At each time step
h
, the agent samples an action
Ah
from a policy
µ
,
and then transits to the next state
Sh+1
and gains a reward
R(Sh, Ah)
. Our paper focuses on finite
horizon cases with horizon
H
, where an agent stops at time step
H
. Also, our paper works on finite
action and state spaces.
The expected discounted cumulative return starting from each state-action pair
(s, a)
at time step
h
and following the policy µis called the Q-value, denoted by qµ
h(s, a), and is defined as follows:
qµ
h(s, a) = Eτ[
H
X
t=h
γtR(St, At)],
2
where
τ
denotes trajectories with
Sh=s
,
Ah=a
,
StP(·|St1, At1), Atµ(·|Sh)
and
rewards R(St, At). The unique optimal Q-value function at time his defined as:
q
h(s, a) = qµ
h(s, a) = sup
µ
qµ
h(s, a).
The optimal Q-values also satisfy the optimal Bellman equation for all states and actions:
q
h(s, a) = E[R(s, a)] + γES0P(·|s,a)[max
a0q
h+1(S0, a0)] =: Th+1q
h+1(s, a), h = 0,· · · , H 1
q
H(s, a)=0,(s, a).(1)
The optimal Bellman operator
T
is defined by the equation. Moreover, the optimal Q-values at
each time
h
for all state-action pairs can be combined into one vector,
q
hR|S|×|A|
, each element
representing the optimal Q-value for a state-action pair.
Q-learning
Optimal Q-values can be computed by Q-learning [Watkins and Dayan, 1992, Jin et al., 2018], which
estimates Q-values directly and is based on the following update rule:
Qh(s, a) = (1 α)Qh(s, a) + α[R(s, a) + γmax
a0Qh+1(S0, a0),
where
Q
is an estimation of Q-values,
S0P(·|s, a)
is a sampled next state, and
α(0,1)
is the
learning rate. A choice of adaptive learning rate is to assign each state-action pair a learning rate and
decay it with respect to the number of visitations to the state-action pair. We use
nl(s, a)
to represent
the number of visitation times of state-action pair (s, a)until and including episode l.
Bayesian Model-free Framework
Bayesian reinforcement learning assumes that there is an initial guess or a prior probability over the
model parameters, denoted by
P(M)
. Then in the model-based case, the agent gradually learns a
posterior probability for the model parameters conditioned on observed trajectories [Duff, 2003].
Next, an agent can learn a policy based on the most likely MDP or by sampling MDPs from the
posterior. In the model-free case, we treat Q-values as random variables and implicitly embed
parametric uncertainties from different MDPs. A prior distribution of the optimal Q-values at time
h
is defined as:
P(q
h(s, a)c) = P({M :q
M,h(s, a)c}).
Then algorithms compute the posterior probability of Q-values [Dearden et al., 1998, Osband et al.,
2018].
4 Model
In this section, we present the probabilistic model used for Bayesian inference and then derive the
update rule of Q-values by maximizing posterior probability.
4.1 Probabilistic Model of Suboptimal Expert Actions
We start by modeling the relationship between expert actions and the optimal Q-values. The expert
chooses the optimal actions most frequently but may contain mistakes and select low probability
actions. The Boltzmann distribution can describe this behavior with the optimal Q-values as parame-
ters. Under this distribution, expert actions are sampled proportional to the exponential of optimal
Q-values up to a constant multiplier. As known, this assumed expert policy maximizes the expected
returns regularized by the entropy [Haarnoja et al., 2018], describing almost optimal but sometimes
mistaken expert behaviors. The above assumption is formally presented as follows and shown on the
left of figure 1, where the expert action at time hand state sis denoted by Aexp,h(s).
Assumption 1.
Assume that expert demonstrations are drawn from a policy
πexpert
dependent on
the optimal Q-values. Also, this policy follows the Boltzmann distribution:
πexpert(a|s) = eηq(s,a)
Pb∈A eηq(s,b),(2)
where ηis any positive constant.
3
Figure 1: (Left) The figure demonstrates the relationship between optimal Q-values and expert
actions. Expert actions are sampled proportional to the exponential of optimal Q-values up to a
constant multiplier. (Right) The figure also describes the Bellman update rule for the Q-values along
a trajectory.
Moreover, we capture the relationship between the optimal Q-values at different time steps in our
probabilistic model through the Bellman equation 1. However, the Bellman equation cannot be
computed without the knowledge of environment dynamics. Therefore, our model relies on samples
of the reward and the next state, which are widely accepted in the field [Watkins and Dayan, 1992].
We re-write the Bellman equation in a stochastic form as
q(s, a) = R(s, a) + maxa0∈Aγq(S0, a0) + ν, (3)
where
R(s, a)
and the next state
S0
are sampled according to the rules of the underlying MDP. The
random noise νis defined as
ν=E[R(s, a) + maxa0∈Aγq(S0, a0)] R(s, a)maxa0∈Aγq(S0, a0),
which captures the randomness from the dynamics. Then, the whole model of relationships among
rewards, optimal Q-values, and expert actions along a trajectory is presented on the right of figure 1.
4.2 GEKF Framework
Next, our desired estimated Q-values should not only follow the stochastic Bellman equation, but they
should also most likely give the Boltzmann distribution that the expert is following. This objective is
equivalent to maximizing the posterior probabilities of optimal Q-values equaling our estimations
conditioned on expert information while constraining to the stochastic Bellman equation.
In order to compute the posterior probability, we need help from a framework under which the
posterior probability density function or the posterior mode is of analytical form. Thus, our paper
leverages the generalized extended Kalman filter (GEKF) [Fahrmeir, 1992], which can analyze the
time series of Q-values with extra expert information and provides an estimation of the posterior
maximum. This framework requires extra expert information following an exponential family
distribution, which is already assumed.
Furthermore, the framework requires the random noise
ν
in the stochastic Bellman equation 3 at each
Q-value update step to be Gaussian, which is also assumed in Osband et al. (2019) [Osband et al.,
2019]. Meanwhile, this assumption does not hurt in the latter stage of training since the influence of
the random noise gradually approaches zero as the number of samples increases, and the learning
rate decays. Notice that the random noise should have a zero expectation and bounded variance when
rewards and time horizons are bounded. Thus, our paper considers
ν
as a Gaussian random variable
with zero expectation and a fixed variance λ, as shown in the following assumption.
Assumption 2.
For all state-action pair
(s, a)
,the noise
ν
is modeled independently by a Gaussian
random variable νN(0, λ).
Also, the GEKF framework approximates the posterior distribution by Gaussian and treats the
maximization operator in Q-value updates linearly as in Osband et al. (2019) [Osband et al., 2019].
4
摘要:

BayesianQ-learningWithImperfectExpertDemonstrationsFengdiCheUniversityofAlbertaXiruZhuyMcGillUniversityDoinaPrecupMcGillUniversity,MILA,DeepMindDavidMegerMcGillUniversityGregoryDudekMcGillUniversity,SamsungAbstractGuidedexplorationwithexpertdemonstrationsimprovesdataefciencyforrein-forcementlearni...

展开>> 收起<<
Bayesian Q-learning With Imperfect Expert Demonstrations Fengdi Che.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2.7MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注