Under the assumptions from the GEKF framework, we derive an iterative forward way to compute
Q-values, which consists of a Q-value-update step as in the Q-learning algorithm and an expert
correction step. In the correction step, our update rule weighs expert information according to
an agent’s uncertainty of its self-learning result, measured by the estimated posterior variance of
learned Q-values. Thus, the larger Q-values’ variance is, the more our expert correction encourages
expert behaviours by increasing expert actions’ Q-values and decreasing non-experts’ Q-values.
This mechanism reduces the influence of uninformative expert data and avoids excessive guided
exploration.
Furthermore, we propose our computationally efficient deep algorithm, Bayesian Q-learning from
Demonstrations (BQfD), built on top of the Deep Q-learning Network (DQN) [Mnih et al., 2013]. The
algorithm embeds reliable expert knowledge into Q-values and leads to a more efficient exploration, as
shown on a sparse-reward chain environment DeepSea and six randomly chosen Atari games. In most
environments, our algorithm learns faster than Deep Q-learning from Demonstrations (DQfD) [Hester
et al., 2018] and Prioritized Double Duelling (PDD) DQN [Wang et al., 2016].
2 Related Work
Reinforcement learning from demonstrations (RLfD) has drawn attention in recent years. This
method only requires a small number of offline expert demonstrations and can noticeably improve
performance. Deep Q-learning from demonstrations (DQfD) [Hester et al., 2018] includes an
additional margin classification loss to ensure that
Q
-values of expert actions are higher than other
actions. Reinforcement learning from demonstrations through shaping [Brys et al., 2015] assigns high
potential to a state-action pair
(s, a)
when the action
a
is used by the expert in the neighbourhood of
the state
s
. Wu et al. [Wu et al., 2020] further leverages reward potential computed by generative
models. However, this requires an added complexity of training generative models. Expert data is
also used implicitly to learn rewards inversely at the beginning [Abbeel and Ng, 2004, Brown et al.,
2019].
However, these approaches do not consider the case where the expert is imperfect and can often have
difficulty exceeding expert performance. In contrast, Nair et al. [Nair et al., 2018] handle suboptimal
demonstrations by only cloning expert actions when their current estimated Q-values are higher than
others. Zhang et al. [Zhang et al., 2022] utilizes expert information when the cumulative Q-values on
expert state-action pairs are larger than cumulative value functions. However, the learned Q-value is
often unreliable and frequently changes, making it a poor judgment of expert data quality. At the same
time, our measurement based on the posterior variance of estimated Q-values is more reasonable.
Jing et al. [Jing et al., 2020] models the suboptimal expert policy as a local optimum to maximize
the expected returns and then constrains the agent to learn within a region around the expert policy,
measured by KL divergence between occupancy measures. However, the local optimum condition
is hard to satisfy, and the limitation of learning around the expert policy cannot deal with mistaken
expert actions. In contrast, our assumption on Boltzmann distributed expert policy is much weaker.
3 Background
A Markov decision process [Sutton and Barto, 2018] is a tuple
M=hS,A, P, γ, R, ρi
where
S
is the state space,
A
is the action space,
P
is the time-homogeneous transition probability matrix
with
P(s0|s, a)
as elements,
γ∈(0,1)
is the discount factor,
R
is the reward function, with
R(s, a)
denoting a random vector describing rewards received after state-action pair
(s, a)
and
ρ
is the
distribution of the initial state. At each time step
h
, the agent samples an action
Ah
from a policy
µ
,
and then transits to the next state
Sh+1
and gains a reward
R(Sh, Ah)
. Our paper focuses on finite
horizon cases with horizon
H
, where an agent stops at time step
H
. Also, our paper works on finite
action and state spaces.
The expected discounted cumulative return starting from each state-action pair
(s, a)
at time step
h
and following the policy µis called the Q-value, denoted by qµ
h(s, a), and is defined as follows:
qµ
h(s, a) = Eτ[
H
X
t=h
γtR(St, At)],
2