Model-based Lifelong Reinforcement Learning with Bayesian Exploration Haotian Fu Shangqun Yu Michael Littman George Konidaris

2025-05-06 0 0 2.18MB 26 页 10玖币

侵权投诉

Model-based Lifelong Reinforcement Learning

with Bayesian Exploration

Haotian Fu, Shangqun Yu, Michael Littman, George Konidaris

Department of Computer Science, Brown University

{hfu7,syu68,mlittman,gdk}@cs.brown.edu

Abstract

We propose a model-based lifelong reinforcement-learning approach that estimates

a hierarchical Bayesian posterior distilling the common structure shared across

different tasks. The learned posterior combined with a sample-based Bayesian

exploration procedure increases the sample efﬁciency of learning across a family

of related tasks. We ﬁrst derive an analysis of the relationship between the sample

complexity and the initialization quality of the posterior in the ﬁnite MDP setting.

We next scale the approach to continuous-state domains by introducing a Varia-

tional Bayesian Lifelong Reinforcement Learning algorithm that can be combined

with recent model-based deep RL methods, and that exhibits backward transfer.

Experimental results on several challenging domains show that our algorithms

achieve both better forward and backward transfer performance than state-of-the-art

lifelong RL methods.1

1 Introduction

Reinforcement learning (RL) [42; 26] has been successfully applied to solve challenging individual

tasks such as learning robotic control [

] and playing Go [

]. However, the typical RL setting

assumes that the agent solves exactly one task, which it has the opportunity to interact with repeatedly.

In many real-world settings, an agent instead experiences a collection of distinct tasks that arrive

sequentially throughout its operational lifetime; learning each new task from scratch is inefﬁcient, but

treating them all as a single task will fail. Therefore, recent research has focused on algorithms that

enable agents to learn across multiple, sequentially posed tasks, leveraging knowledge from previous

tasks to accelerate the learning of new tasks. This problem setting is known as lifelong reinforcement

learning [

;

]. The key questions in lifelong RL research are: How can an algorithm exploit

knowledge gained from past tasks to quickly adapt to new tasks (forward transfer), and how can data

from new tasks help the agent perform better on previously learned tasks (backward transfer)?

We propose to address these problems by extracting the common structure existing in previously

encountered tasks so that the agent can quickly learn the dynamics speciﬁc to the new tasks. We

consider lifelong RL problems that can be modeled as hidden-parameter MDPs or HiP-MDPs [

;

where variations among the true task dynamics can be described by a set of hidden parameters. Our

algorithm goes further than previous work in both lifelong learning and HiP-MDPs by

Separately

modeling epistemic and aleatory uncertainty over different levels of abstraction across the collection

of tasks: the uncertainty captured by a world-model distribution describing the probability distribution

over tasks, and the uncertainty captured by a task-speciﬁc model of the (stochastic) dynamics within a

single task. To enable more accurate sequential knowledge transfer, we separate the learning process

for these two quantities and maintain a hierarchical Bayesian posterior that approximates them.

Performing Bayesian exploration enabled by the hierarchical posterior: The method lets the agent act

optimistically according to models sampled from the posterior, and thus increases sample efﬁciency.

1Code repository available at https://github.com/Minusadd/VBLRL.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11579v1 [cs.LG] 20 Oct 2022

Speciﬁcally, we propose a model-based lifelong RL approach with Bayesian exploration that estimates

a Bayesian world-model posterior that distills the common structure of previous tasks, and then uses

this between-task posterior as a within-task prior to learn a task-speciﬁc model in each subsequent

task. The learned hierarchical posterior model combined with sample-based Bayesian exploration

procedures can increase the sample efﬁciency of learning. We ﬁrst derive an explicit performance

bound that shows that the task-speciﬁc model requires fewer samples to become accurate as the

world-model posterior approaches the true underlying world-model distribution for the discrete case.

We further develop Variational Bayesian exploration for Lifelong RL (VBLRL), a more scalable

version that uses variational inference to approximate the distribution and leverages Bayesian Neural

Networks (BNNs) [

;

] to build the hierarchical Bayesian posterior. VBLRL provides a novel

way to separately estimate different kinds of uncertainties in the HiP-MDP setting. Based on the

same framework, we also propose a backward transfer version of VBLRL that is able to provide

improvements for previously encountered tasks. Our experimental results on a set of challenging

domains show that our algorithms achieve better both forward and backward transfer performance

than state-of-the-art lifelong RL algorithms when given only limited interactions with each task.

2 Background

RL is the problem of maximizing the long-term expected reward of an agent interacting with an

environment [

]. We usually model the environment as a Markov Decision Process or MDP [

described by a ﬁve tuple:

hS, A, R, T, γi

, where

is a ﬁnite set of states;

is a ﬁnite set of actions;

R:S×A7→ [0,1]

is a reward function, with a lower and upper bound of

and

;

T:S×A7→ Pr(S)

is a transition function, with

T(s0|s, a)

denoting the probability of arriving in state

s0∈S

after

executing action

a∈A

in state

; and

γ∈[0,1)

is a discount factor, expressing the agent’s preference

for delayed over immediate rewards.

An MDP is a suitable model for the task facing a single agent. In the lifelong RL setting, the

agent instead faces a series of tasks

m1, ..., mn

, each of which can be modeled as an MDP:

hS(i), A(i), R(i), T (i), γ(i)i

. For lifelong RL problems, the performance of a speciﬁc algorithm

is usually evaluated based on both forward transfer and backward transfer results [30]:

•Forward transfer: the inﬂuence that learning task thas on the performance in future task kt.

•

Backward transfer: the inﬂuence that learning task

has on the performance in earlier tasks

k≺t

A key question in the lifelong setting is how the series of task MDPs are related; we model the

collection of tasks as a HiP-MDP, where a family of tasks is generated by varying a latent task

parameter

drawn for each task according to the world-model distribution

PΩ

. Each setting of

speciﬁes a unique MDP, but the agent neither observes

nor has access to the function that generates

the task family. The dynamics

T(s0|s, a;ωi)

and reward function

R(r|s, a;ωi)

for task

then depend

ωi∈Ω

, which is ﬁxed for the duration of the task. The tasks are i.i.d. sampled from a ﬁxed

distribution and arrive one at a time.

3 Related work

The ﬁrst category of lifelong RL algorithms learns a single model that encourages transfer across

tasks by modifying objective functions. EWC [

] imposes a quadratic penalty that pulls each

weight back towards its old values by an amount proportional to its importance for performance on

previously-learned tasks to avoid forgetting. There are several extensions of this work based on the

core idea of modifying the form of the penalty [

;

]. Another category of lifelong RL methods

uses multiple models with shared parameters and task-speciﬁc parameters to avoid or alleviate the

catastrophic problem [

;

]. The drawback of this method is that it is hard to incorporate the

knowledge learned from previous tasks during initial training on a new task [

]. Nagabandi et al.

[32]

introduce a model-based continual learning framework based on MAML, but they focus on

discovering when new tasks were encountered without access to task indicators.

Published HiP-MDP methods use Gaussian Processes [

] or Bayesian neural networks [

] to ﬁnd

a single model that works for all tasks, which may trigger catastrophic forgetting [

;

]. Meta-

RL [

;

] and multi-task RL [

;

] settings also attempt to accelerate learning by transferring

knowledge from different tasks. Some work employs the MAML framework with Bayesian methods

to learn a stochastic distribution over initial parameters [

;

]. Other work uses the collected

trajectories to infer the hidden parameter, which is taken as an additional input when computing the

policy [

;

]. Our method, however, focuses on problems where the tasks arrive sequentially

instead of having a large number of tasks available at the beginning of training. This sequential

setting makes it hard to accurately infer the hidden parameters, but opens the door for algorithms that

support backward transfer.

Some prior work uses Bayesian methods in RL to quantify uncertainty over initial MDP models [

;

]. Several algorithms start from the idea of sampling from a posterior over MDPs for Bayesian

RL, maintaining Bayesian posteriors and sampling one complete MDP [

;

] or multiple MDPs [

Instead of focusing on single-task RL, our algorithm aims to ﬁnd a posterior over the common

structure among multiple tasks. Wilson et al.

[49]

uses a hierarchical Bayesian inﬁnite mixture model

to learn a strong prior that allows the agent to rapidly infer the characteristics of a new environment

based on previous tasks. However, it only infers the category label of a new MDP and only works in

discrete settings.

4 Model-based Lifelong Reinforcement Learning

Our approach is built upon two main intuitions: First, transferring the transition model instead of

policy/value function leads to more efﬁcient usage of the data when “ﬁnetuning” on a new task.

As we show empirically in Section 5.1, although some model-free lifelong RL algorithms perform

better than the proposed model-based method in single-task cases, in lifelong RL setting of the

same task type the model-based method is still able to achieve comparable/better performance with

only half the amount of data. Secondly, with a model that is able to capture different levels of the

uncertainty within HiP-MDPs, an agent can employ sample-based Bayesian exploration to further

improve sample-efﬁciency.

The model underlying our approach is a hierarchical Bayesian posterior over task MDPs controlled by

the hidden parameter

. Intuitively, we maintain probability distributions that separately capture two

categories of uncertainty within lifelong learning tasks: The world-model posterior

P(ω)

captures

the epistemic uncertainty of the world-model distribution over all future and past tasks

m1,· · · , mn

controlled by the hidden parameter

ω1,· · · , ωn∼PΩ

. As the learner is exposed to more and more

tasks, this posterior should converge to the world-model distribution

PΩ

. The task-speciﬁc posterior

P(ωi)

captures the epistemic uncertainty of the current task

(Throughout the paper we will often

write

for simplicity.). As the learner is exposed to more and more transitions within the task, this

posterior should approach the true distribution corresponding to

ωi

, i.e. peaking at the true

ωi

for

this speciﬁc task

, leaving only the aleatoric uncertainty of transitions within the task, which is

independent of other tasks. Each time the agent encounters a new task, we initialize the task-speciﬁc

model using the world-model posterior and further train it with data collected only from the new task.

One of our key insights here is that the sample complexity of learning a new task will decrease as

the initial prior of task-speciﬁc model approaches the true underlying distribution of the transition

function. Thus, the agent can learn new tasks faster by exploiting knowledge common to previous

tasks, thereby exhibiting positive forward transfer.

Speciﬁcally, we model the task-speciﬁc posterior via the transition dynamics using

p(st+1, rt|st, at;ωi)

. The task-speciﬁc posterior, given a new state–action pair from task

, can

be rewritten via Bayes’ rule:

P(ωi|Dt

i, at, st+1, rt) = P(ωi|Dt

i)P(st+1, rt|Dt

i, at;ωi)

P(st+1, rt|Dt

i, at),(1)

where

i={s1, a1,· · · , st}

is the agent’s history of task

until time step

. The world-model

posterior, given the new data from task i, can be rewritten as:

P(ω|D1:i)=P(ω|D1:i−1)P(Di|D1:i−1;ω)

P(Di|D1:i−1),(2)

where

D1:i

denotes the agent’s history with all the experienced tasks

1∼i

until current task

. In

particular, each time when the agent faces a new task

and has not started updating its task speciﬁc

posterior yet (that is,

i=∅

), we ﬁrst use the world-model posterior to initialize the task-speciﬁc

prior:

P(ωi|Dt

i)=P(ω|D1:i)

. The world-model distribution aims to approximate the underlying

PΩ

. The task-speciﬁc distribution aims to approximate the distribution that peaks at the true

ωi

for

this speciﬁc task i.

In Section 4.1, we derive a sample complexity bound in the ﬁnite MDP case which explicitly show

how the distance between the distribution of a task’s true transition model and our task-speciﬁc

model’s prior initialized by the parameters of the world-model posterior will affect the learning

efﬁciency of a new task. Then, in Section 4.2 and 4.3 we extend our high-level idea to a scalable

version that can be combined with recent model-based RL approaches and exhibits positive forward

& backward transfer.

4.1 Sample Complexity Analysis

In this subsection, we consider the ﬁnite MDP setting and use a standard Bayesian exploration

algorithm BOSS [

] as a single-task baseline. Note that BOSS creates optimism in the face of

uncertainty as the agent can choose actions based on the highest performing transition of the

models sampled, which drives exploration. We included explanations of the ﬁnite MDP version of

our algorithm BLRL (Bayesian Lifelong RL) based on BOSS in appendix C and simple experiments

on Gridworlds in appendix J.

BLRL uses the world-model posterior

P(ω|D1:i−1)

learned from previous

i−1

tasks to initialize

(copy distribution) the task-speciﬁc prior of

P(ωi)

of new task

, aiming to decrease the number

of samples needed to learn an accurate task-speciﬁc posterior. Our analysis focuses on how the

properties of the Bayesian prior affect the sample complexity of learning each speciﬁc task.

Let

π(ωi)

denote the prior distribution on the parameter space

. We consider a set of transition-

probability densities

p(·|ωi) = p(st+1, rt|st, at, ωi)

indexed by

ωi

, and the true underlying density

We also denote the model family {p(·|ωi) : ωi∈Γ}by the same symbol Γ.

Lemma 4.1.

The task-speciﬁc posterior in Equation 1 can be regarded as the probability density

g(ωi)with respect to πthat attains the inﬁmum of:

Rn(g) = Eπg(ωi)

t=1

ln q(st+1, rt|Dt

i, at)

p(st+1, rt|Dt

i, at;ωi)+DKL(gdπ||dπ).(3)

inf Rn(g)

controls the complexity of the density estimation process for

g(ωi)

. Intuitively, Lemma 4.1

converts the Bayesian posterior into an information theoretical minimizer that allows us to further

investigate the relationship between the properties of the Bayesian prior and the risk/complexity of

attaining the posterior.

Proposition 4.2. Deﬁne the prior-mass radius of the transition-probability densities as:

dπ= inf{d:d≥ − ln π({p∈Γ : DKL(q||p)≤d})}.(4)

Intuitively, this quantity measures the distance between the Bayesian prior of the task-speciﬁc model

and the true underlying task-speciﬁc distribution. Then, we adopt the same settings in Zhang

[55]

∀ρ∈(0,1) and η≥1, let

εn= (1 + 1

n)ηdπ+ (η−ρ)εupper,n((η−1)/(η−ρ)),(5)

where

εupper,n

is the

critical upper-bracketing radius

[

]. The decay rate of

εupper,n

controls the

consistency of the Bayesian posterior distribution [

]. Let

ρ=1

, we have for all states and actions,

h≥0and δ∈(0,1), with probability at least 1−δ,

πnnp∈Γ : ||p−q||2

1/2≥2εn+ (4η−2)h

δ/4o



X≤1

1 + enh ,(6)

where

πn(ωi)

denotes the posterior distribution over

after collecting

samples,

denotes the

collected samples. It functions the same as Dt

iin Equation 1.

Similar to BOSS, for a new MDP

m∗∼M

with hidden parameters

ωm∗

, we can deﬁne the Bayesian

concentration sample complexity for the task-speciﬁc posterior:

f(s, a, 0, δ0, ρ0)

, as the minimum

number

such that, if

IID transitions from

(s, a)

are observed, then, with probability at least

1−δ0

P rm∼posterior (||T(·|s, a, ωm)−T(·|s, a, ωm∗)||1< 0)≥1−ρ0.(7)

Intuitively, the inequality means that for a model with the hidden parameter sampled from the learned

posterior distribution, the probability that that it is within

0

far away from the true model is larger

than 1−ρ0.

Lemma 4.3.

Assume the posterior of each task is consistent (that is,

εupper,n =o(1)

) and set

η= 2

then the Bayesian concentration sample complexity for the task-speciﬁc posterior

f(s, a, , δ, ρ) =

Odπ+ln 1

2δ−dπ.

Proof (sketch). This bound can be derived by directly combining Lemma 4.2 and Equation 7.

The above lemma suggests an upper bound of the Bayesian concentration sample complexity using

the prior-mass radius. We can further combine this result with PAC-MDP theory [

] and derive the

sample complexity of the algorithm for each new task.

Proposition 4.4.

For each new task, set the sample size

K= Θ(S2A

δln SA

δ)

and the parameters

0=(1−γ)2, δ0=δ

SA , ρ0=δ

S2A2K

, then, with probability at least

1−4δ

Vt(st)≥V∗(st)−40

in all but ˜

O(S2A2dπ

δ3(1−γ)6)steps, where ˜

O(·)suppresses logarithmic dependence.

Proof (sketch).

The central part of the proof is Proposition 4.2 (detailed proof in appendix), and

the re- maining parts are exactly the same as those for BOSS. In general, the proof is based on

the PAC-MDP theorem [

] combined with the new bound for the Bayesian concentration sample

complexity we derived in Lemma 2. For each new task, the main difference between BLRL and

BOSS is that we use the world-model posterior to initialize the task-speciﬁc posterior, which results

in a new sample complexity bound with dπ.

The result formalizes how the sample complexity of the lifelong RL algorithm will change with

respect to the initialization quality of the posterior: if we put a larger prior mass at a density close

to the true

such that

dπ

is small, the sample efﬁciency of the algorithm will increase accordingly.

In other words, the sample complexity of our algorithm drops proportionally to

dπ

, which is the

distance between the Bayesian prior of the task-speciﬁc model initialized by the parameters of the

world-model posterior and the true underlying task-speciﬁc distribution. We provide a illustrative

example in the appendix O.

4.2 Variational Bayesian Lifelong RL

The intuition from the last section is that, if we initialize the task-speciﬁc distribution with a prior that

is close to the true distribution, sample complexity will decrease accordingly. To scale our approach,

we must ﬁnd a efﬁcient way to explicitly approximate these distributions. We propose a practical

approximate algorithm, VBLRL, that uses neural networks and variational inference [21].

We choose Bayesian neural networks (BNN) to approximate the posterior. The intuition is that, in the

context of stochastic outputs, BNNs naturally approximate the hierarchical Bayesian model since they

also maintain a learnable distribution over their weights and biases [

;

]. We use the uncertainty

embedded in the weights and biases of networks to capture the epistemic uncertainty introduced

by hidden parameters of different tasks, while we also set the outputs of the neural networks to be

stochastic to capture the aleatory uncertainty within each speciﬁc task. In our case, the BNN weights

and biases distribution

q(ω;φ)

(a distribution over

but parameterized by

) can be modeled as fully

factorized Gaussian distributions [3]:

q(ω;φ) =

|Ω|

j=1

N(ωj|µj, σ2

j),(8)

where φ={µ, σ}, and µis the Gaussian’s mean vector while σis the covariance matrix diagonal.

We maintain a world-model BNN across all the tasks and a task-speciﬁc BNN for each task. The input

for all the BNNs is a state–action pair, and the output are the mean and variance of the prediction for

reward and next state. Then, the posterior distribution over the model parameters can be computed

leveraging variational lower bounds [22; 23]:

φt= arg min

φhDKL[q(ω;φ)||p(ω)] −Eω∼q(·;φ)[log p(st+1, rt|Dt, at;ω)]i,(9)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Model-basedLifelongReinforcementLearningwithBayesianExplorationHaotianFu,ShangqunYu,MichaelLittman,GeorgeKonidarisDepartmentofComputerScience,BrownUniversity{hfu7,syu68,mlittman,gdk}@cs.brown.eduAbstractWeproposeamodel-basedlifelongreinforcement-learningapproachthatestimatesahierarchicalBayesianpost...

展开>> 收起<<

Model-based Lifelong Reinforcement Learning with Bayesian Exploration Haotian Fu Shangqun Yu Michael Littman George Konidaris.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Model-based Lifelong Reinforcement Learning with Bayesian Exploration Haotian Fu Shangqun Yu Michael Littman George Konidaris

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: