Model-based Lifelong Reinforcement Learning with Bayesian Exploration Haotian Fu Shangqun Yu Michael Littman George Konidaris

2025-05-06 0 0 2.18MB 26 页 10玖币
侵权投诉
Model-based Lifelong Reinforcement Learning
with Bayesian Exploration
Haotian Fu, Shangqun Yu, Michael Littman, George Konidaris
Department of Computer Science, Brown University
{hfu7,syu68,mlittman,gdk}@cs.brown.edu
Abstract
We propose a model-based lifelong reinforcement-learning approach that estimates
a hierarchical Bayesian posterior distilling the common structure shared across
different tasks. The learned posterior combined with a sample-based Bayesian
exploration procedure increases the sample efficiency of learning across a family
of related tasks. We first derive an analysis of the relationship between the sample
complexity and the initialization quality of the posterior in the finite MDP setting.
We next scale the approach to continuous-state domains by introducing a Varia-
tional Bayesian Lifelong Reinforcement Learning algorithm that can be combined
with recent model-based deep RL methods, and that exhibits backward transfer.
Experimental results on several challenging domains show that our algorithms
achieve both better forward and backward transfer performance than state-of-the-art
lifelong RL methods.1
1 Introduction
Reinforcement learning (RL) [42; 26] has been successfully applied to solve challenging individual
tasks such as learning robotic control [
11
] and playing Go [
38
]. However, the typical RL setting
assumes that the agent solves exactly one task, which it has the opportunity to interact with repeatedly.
In many real-world settings, an agent instead experiences a collection of distinct tasks that arrive
sequentially throughout its operational lifetime; learning each new task from scratch is inefficient, but
treating them all as a single task will fail. Therefore, recent research has focused on algorithms that
enable agents to learn across multiple, sequentially posed tasks, leveraging knowledge from previous
tasks to accelerate the learning of new tasks. This problem setting is known as lifelong reinforcement
learning [
7
;
50
;
25
]. The key questions in lifelong RL research are: How can an algorithm exploit
knowledge gained from past tasks to quickly adapt to new tasks (forward transfer), and how can data
from new tasks help the agent perform better on previously learned tasks (backward transfer)?
We propose to address these problems by extracting the common structure existing in previously
encountered tasks so that the agent can quickly learn the dynamics specific to the new tasks. We
consider lifelong RL problems that can be modeled as hidden-parameter MDPs or HiP-MDPs [
10
;
27
],
where variations among the true task dynamics can be described by a set of hidden parameters. Our
algorithm goes further than previous work in both lifelong learning and HiP-MDPs by
1)
Separately
modeling epistemic and aleatory uncertainty over different levels of abstraction across the collection
of tasks: the uncertainty captured by a world-model distribution describing the probability distribution
over tasks, and the uncertainty captured by a task-specific model of the (stochastic) dynamics within a
single task. To enable more accurate sequential knowledge transfer, we separate the learning process
for these two quantities and maintain a hierarchical Bayesian posterior that approximates them.
2)
Performing Bayesian exploration enabled by the hierarchical posterior: The method lets the agent act
optimistically according to models sampled from the posterior, and thus increases sample efficiency.
1Code repository available at https://github.com/Minusadd/VBLRL.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11579v1 [cs.LG] 20 Oct 2022
Specifically, we propose a model-based lifelong RL approach with Bayesian exploration that estimates
a Bayesian world-model posterior that distills the common structure of previous tasks, and then uses
this between-task posterior as a within-task prior to learn a task-specific model in each subsequent
task. The learned hierarchical posterior model combined with sample-based Bayesian exploration
procedures can increase the sample efficiency of learning. We first derive an explicit performance
bound that shows that the task-specific model requires fewer samples to become accurate as the
world-model posterior approaches the true underlying world-model distribution for the discrete case.
We further develop Variational Bayesian exploration for Lifelong RL (VBLRL), a more scalable
version that uses variational inference to approximate the distribution and leverages Bayesian Neural
Networks (BNNs) [
17
;
3
] to build the hierarchical Bayesian posterior. VBLRL provides a novel
way to separately estimate different kinds of uncertainties in the HiP-MDP setting. Based on the
same framework, we also propose a backward transfer version of VBLRL that is able to provide
improvements for previously encountered tasks. Our experimental results on a set of challenging
domains show that our algorithms achieve better both forward and backward transfer performance
than state-of-the-art lifelong RL algorithms when given only limited interactions with each task.
2 Background
RL is the problem of maximizing the long-term expected reward of an agent interacting with an
environment [
42
]. We usually model the environment as a Markov Decision Process or MDP [
36
],
described by a five tuple:
hS, A, R, T, γi
, where
S
is a finite set of states;
A
is a finite set of actions;
R:S×A7→ [0,1]
is a reward function, with a lower and upper bound of
0
and
1
;
T:S×A7→ Pr(S)
is a transition function, with
T(s0|s, a)
denoting the probability of arriving in state
s0S
after
executing action
aA
in state
s
; and
γ[0,1)
is a discount factor, expressing the agent’s preference
for delayed over immediate rewards.
An MDP is a suitable model for the task facing a single agent. In the lifelong RL setting, the
agent instead faces a series of tasks
m1, ..., mn
, each of which can be modeled as an MDP:
hS(i), A(i), R(i), T (i), γ(i)i
. For lifelong RL problems, the performance of a specific algorithm
is usually evaluated based on both forward transfer and backward transfer results [30]:
Forward transfer: the influence that learning task thas on the performance in future task kt.
Backward transfer: the influence that learning task
t
has on the performance in earlier tasks
kt
.
A key question in the lifelong setting is how the series of task MDPs are related; we model the
collection of tasks as a HiP-MDP, where a family of tasks is generated by varying a latent task
parameter
ω
drawn for each task according to the world-model distribution
P
. Each setting of
ω
specifies a unique MDP, but the agent neither observes
ω
nor has access to the function that generates
the task family. The dynamics
T(s0|s, a;ωi)
and reward function
R(r|s, a;ωi)
for task
i
then depend
on
ωi
, which is fixed for the duration of the task. The tasks are i.i.d. sampled from a fixed
distribution and arrive one at a time.
3 Related work
The first category of lifelong RL algorithms learns a single model that encourages transfer across
tasks by modifying objective functions. EWC [
28
] imposes a quadratic penalty that pulls each
weight back towards its old values by an amount proportional to its importance for performance on
previously-learned tasks to avoid forgetting. There are several extensions of this work based on the
core idea of modifying the form of the penalty [
29
;
53
;
33
]. Another category of lifelong RL methods
uses multiple models with shared parameters and task-specific parameters to avoid or alleviate the
catastrophic problem [
5
;
24
;
31
]. The drawback of this method is that it is hard to incorporate the
knowledge learned from previous tasks during initial training on a new task [
31
]. Nagabandi et al.
[32]
introduce a model-based continual learning framework based on MAML, but they focus on
discovering when new tasks were encountered without access to task indicators.
Published HiP-MDP methods use Gaussian Processes [
10
] or Bayesian neural networks [
27
] to find
a single model that works for all tasks, which may trigger catastrophic forgetting [
5
;
24
]. Meta-
RL [
47
;
12
] and multi-task RL [
35
;
43
] settings also attempt to accelerate learning by transferring
knowledge from different tasks. Some work employs the MAML framework with Bayesian methods
2
to learn a stochastic distribution over initial parameters [
51
;
16
;
13
]. Other work uses the collected
trajectories to infer the hidden parameter, which is taken as an additional input when computing the
policy [
37
;
56
;
14
]. Our method, however, focuses on problems where the tasks arrive sequentially
instead of having a large number of tasks available at the beginning of training. This sequential
setting makes it hard to accurately infer the hidden parameters, but opens the door for algorithms that
support backward transfer.
Some prior work uses Bayesian methods in RL to quantify uncertainty over initial MDP models [
15
;
1
;
18
]. Several algorithms start from the idea of sampling from a posterior over MDPs for Bayesian
RL, maintaining Bayesian posteriors and sampling one complete MDP [
41
;
49
] or multiple MDPs [
2
].
Instead of focusing on single-task RL, our algorithm aims to find a posterior over the common
structure among multiple tasks. Wilson et al.
[49]
uses a hierarchical Bayesian infinite mixture model
to learn a strong prior that allows the agent to rapidly infer the characteristics of a new environment
based on previous tasks. However, it only infers the category label of a new MDP and only works in
discrete settings.
4 Model-based Lifelong Reinforcement Learning
Our approach is built upon two main intuitions: First, transferring the transition model instead of
policy/value function leads to more efficient usage of the data when “finetuning” on a new task.
As we show empirically in Section 5.1, although some model-free lifelong RL algorithms perform
better than the proposed model-based method in single-task cases, in lifelong RL setting of the
same task type the model-based method is still able to achieve comparable/better performance with
only half the amount of data. Secondly, with a model that is able to capture different levels of the
uncertainty within HiP-MDPs, an agent can employ sample-based Bayesian exploration to further
improve sample-efficiency.
The model underlying our approach is a hierarchical Bayesian posterior over task MDPs controlled by
the hidden parameter
ω
. Intuitively, we maintain probability distributions that separately capture two
categories of uncertainty within lifelong learning tasks: The world-model posterior
P(ω)
captures
the epistemic uncertainty of the world-model distribution over all future and past tasks
m1,· · · , mn
controlled by the hidden parameter
ω1,· · · , ωnP
. As the learner is exposed to more and more
tasks, this posterior should converge to the world-model distribution
P
. The task-specific posterior
P(ωi)
captures the epistemic uncertainty of the current task
mi
(Throughout the paper we will often
write
i
for simplicity.). As the learner is exposed to more and more transitions within the task, this
posterior should approach the true distribution corresponding to
ωi
, i.e. peaking at the true
ωi
for
this specific task
i
, leaving only the aleatoric uncertainty of transitions within the task, which is
independent of other tasks. Each time the agent encounters a new task, we initialize the task-specific
model using the world-model posterior and further train it with data collected only from the new task.
One of our key insights here is that the sample complexity of learning a new task will decrease as
the initial prior of task-specific model approaches the true underlying distribution of the transition
function. Thus, the agent can learn new tasks faster by exploiting knowledge common to previous
tasks, thereby exhibiting positive forward transfer.
Specifically, we model the task-specific posterior via the transition dynamics using
p(st+1, rt|st, at;ωi)
. The task-specific posterior, given a new state–action pair from task
i
, can
be rewritten via Bayes’ rule:
P(ωi|Dt
i, at, st+1, rt) = P(ωi|Dt
i)P(st+1, rt|Dt
i, at;ωi)
P(st+1, rt|Dt
i, at),(1)
where
Dt
i={s1, a1,· · · , st}
is the agent’s history of task
i
until time step
t
. The world-model
posterior, given the new data from task i, can be rewritten as:
P(ω|D1:i)=P(ω|D1:i1)P(Di|D1:i1;ω)
P(Di|D1:i1),(2)
where
D1:i
denotes the agent’s history with all the experienced tasks
1i
until current task
i
. In
particular, each time when the agent faces a new task
i
and has not started updating its task specific
posterior yet (that is,
Dt
i=
), we first use the world-model posterior to initialize the task-specific
prior:
P(ωi|Dt
i)=P(ω|D1:i)
. The world-model distribution aims to approximate the underlying
3
P
. The task-specific distribution aims to approximate the distribution that peaks at the true
ωi
for
this specific task i.
In Section 4.1, we derive a sample complexity bound in the finite MDP case which explicitly show
how the distance between the distribution of a task’s true transition model and our task-specific
model’s prior initialized by the parameters of the world-model posterior will affect the learning
efficiency of a new task. Then, in Section 4.2 and 4.3 we extend our high-level idea to a scalable
version that can be combined with recent model-based RL approaches and exhibits positive forward
& backward transfer.
4.1 Sample Complexity Analysis
In this subsection, we consider the finite MDP setting and use a standard Bayesian exploration
algorithm BOSS [
2
] as a single-task baseline. Note that BOSS creates optimism in the face of
uncertainty as the agent can choose actions based on the highest performing transition of the
K
models sampled, which drives exploration. We included explanations of the finite MDP version of
our algorithm BLRL (Bayesian Lifelong RL) based on BOSS in appendix C and simple experiments
on Gridworlds in appendix J.
BLRL uses the world-model posterior
P(ω|D1:i1)
learned from previous
i1
tasks to initialize
(copy distribution) the task-specific prior of
P(ωi)
of new task
i
, aiming to decrease the number
of samples needed to learn an accurate task-specific posterior. Our analysis focuses on how the
properties of the Bayesian prior affect the sample complexity of learning each specific task.
Let
π(ωi)
denote the prior distribution on the parameter space
Γ
. We consider a set of transition-
probability densities
p(·|ωi) = p(st+1, rt|st, at, ωi)
indexed by
ωi
, and the true underlying density
q
.
We also denote the model family {p(·|ωi) : ωiΓ}by the same symbol Γ.
Lemma 4.1.
The task-specific posterior in Equation 1 can be regarded as the probability density
g(ωi)with respect to πthat attains the infimum of:
Rn(g) = Eπg(ωi)
T
X
t=1
ln q(st+1, rt|Dt
i, at)
p(st+1, rt|Dt
i, at;ωi)+DKL(gdπ||).(3)
inf Rn(g)
controls the complexity of the density estimation process for
g(ωi)
. Intuitively, Lemma 4.1
converts the Bayesian posterior into an information theoretical minimizer that allows us to further
investigate the relationship between the properties of the Bayesian prior and the risk/complexity of
attaining the posterior.
Proposition 4.2. Define the prior-mass radius of the transition-probability densities as:
dπ= inf{d:d ln π({pΓ : DKL(q||p)d})}.(4)
Intuitively, this quantity measures the distance between the Bayesian prior of the task-specific model
and the true underlying task-specific distribution. Then, we adopt the same settings in Zhang
[55]
:
ρ(0,1) and η1, let
εn= (1 + 1
n)ηdπ+ (ηρ)εupper,n((η1)/(ηρ)),(5)
where
εupper,n
is the
critical upper-bracketing radius
[
55
]. The decay rate of
εupper,n
controls the
consistency of the Bayesian posterior distribution [
2
]. Let
ρ=1
2
, we have for all states and actions,
h0and δ(0,1), with probability at least 1δ,
πnnpΓ : ||pq||2
1/22εn+ (4η2)h
δ/4o
X1
1 + enh ,(6)
where
πn(ωi)
denotes the posterior distribution over
ω
after collecting
n
samples,
X
denotes the
n
collected samples. It functions the same as Dt
iin Equation 1.
Similar to BOSS, for a new MDP
mM
with hidden parameters
ωm
, we can define the Bayesian
concentration sample complexity for the task-specific posterior:
f(s, a, 0, δ0, ρ0)
, as the minimum
number
u
such that, if
u
IID transitions from
(s, a)
are observed, then, with probability at least
1δ0
,
P rmposterior (||T(·|s, a, ωm)T(·|s, a, ωm)||1< 0)1ρ0.(7)
4
Intuitively, the inequality means that for a model with the hidden parameter sampled from the learned
posterior distribution, the probability that that it is within
0
far away from the true model is larger
than 1ρ0.
Lemma 4.3.
Assume the posterior of each task is consistent (that is,
εupper,n =o(1)
) and set
η= 2
,
then the Bayesian concentration sample complexity for the task-specific posterior
f(s, a, , δ, ρ) =
Odπ+ln 1
ρ
2δdπ.
Proof (sketch). This bound can be derived by directly combining Lemma 4.2 and Equation 7.
The above lemma suggests an upper bound of the Bayesian concentration sample complexity using
the prior-mass radius. We can further combine this result with PAC-MDP theory [
39
] and derive the
sample complexity of the algorithm for each new task.
Proposition 4.4.
For each new task, set the sample size
K= Θ(S2A
δln SA
δ)
and the parameters
0=(1γ)2, δ0=δ
SA , ρ0=δ
S2A2K
, then, with probability at least
14δ
,
Vt(st)V(st)40
in all but ˜
O(S2A2dπ
δ3(1γ)6)steps, where ˜
O(·)suppresses logarithmic dependence.
Proof (sketch).
The central part of the proof is Proposition 4.2 (detailed proof in appendix), and
the re- maining parts are exactly the same as those for BOSS. In general, the proof is based on
the PAC-MDP theorem [
40
] combined with the new bound for the Bayesian concentration sample
complexity we derived in Lemma 2. For each new task, the main difference between BLRL and
BOSS is that we use the world-model posterior to initialize the task-specific posterior, which results
in a new sample complexity bound with dπ.
The result formalizes how the sample complexity of the lifelong RL algorithm will change with
respect to the initialization quality of the posterior: if we put a larger prior mass at a density close
to the true
q
such that
dπ
is small, the sample efficiency of the algorithm will increase accordingly.
In other words, the sample complexity of our algorithm drops proportionally to
dπ
, which is the
distance between the Bayesian prior of the task-specific model initialized by the parameters of the
world-model posterior and the true underlying task-specific distribution. We provide a illustrative
example in the appendix O.
4.2 Variational Bayesian Lifelong RL
The intuition from the last section is that, if we initialize the task-specific distribution with a prior that
is close to the true distribution, sample complexity will decrease accordingly. To scale our approach,
we must find a efficient way to explicitly approximate these distributions. We propose a practical
approximate algorithm, VBLRL, that uses neural networks and variational inference [21].
We choose Bayesian neural networks (BNN) to approximate the posterior. The intuition is that, in the
context of stochastic outputs, BNNs naturally approximate the hierarchical Bayesian model since they
also maintain a learnable distribution over their weights and biases [
17
;
23
]. We use the uncertainty
embedded in the weights and biases of networks to capture the epistemic uncertainty introduced
by hidden parameters of different tasks, while we also set the outputs of the neural networks to be
stochastic to capture the aleatory uncertainty within each specific task. In our case, the BNN weights
and biases distribution
q(ω;φ)
(a distribution over
ω
but parameterized by
φ
) can be modeled as fully
factorized Gaussian distributions [3]:
q(ω;φ) =
||
Y
j=1
N(ωj|µj, σ2
j),(8)
where φ={µ, σ}, and µis the Gaussian’s mean vector while σis the covariance matrix diagonal.
We maintain a world-model BNN across all the tasks and a task-specific BNN for each task. The input
for all the BNNs is a state–action pair, and the output are the mean and variance of the prediction for
reward and next state. Then, the posterior distribution over the model parameters can be computed
leveraging variational lower bounds [22; 23]:
φt= arg min
φhDKL[q(ω;φ)||p(ω)] Eωq(·;φ)[log p(st+1, rt|Dt, at;ω)]i,(9)
5
摘要:

Model-basedLifelongReinforcementLearningwithBayesianExplorationHaotianFu,ShangqunYu,MichaelLittman,GeorgeKonidarisDepartmentofComputerScience,BrownUniversity{hfu7,syu68,mlittman,gdk}@cs.brown.eduAbstractWeproposeamodel-basedlifelongreinforcement-learningapproachthatestimatesahierarchicalBayesianpost...

展开>> 收起<<
Model-based Lifelong Reinforcement Learning with Bayesian Exploration Haotian Fu Shangqun Yu Michael Littman George Konidaris.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:2.18MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注