
to learn a stochastic distribution over initial parameters [
51
;
16
;
13
]. Other work uses the collected
trajectories to infer the hidden parameter, which is taken as an additional input when computing the
policy [
37
;
56
;
14
]. Our method, however, focuses on problems where the tasks arrive sequentially
instead of having a large number of tasks available at the beginning of training. This sequential
setting makes it hard to accurately infer the hidden parameters, but opens the door for algorithms that
support backward transfer.
Some prior work uses Bayesian methods in RL to quantify uncertainty over initial MDP models [
15
;
1
;
18
]. Several algorithms start from the idea of sampling from a posterior over MDPs for Bayesian
RL, maintaining Bayesian posteriors and sampling one complete MDP [
41
;
49
] or multiple MDPs [
2
].
Instead of focusing on single-task RL, our algorithm aims to find a posterior over the common
structure among multiple tasks. Wilson et al.
[49]
uses a hierarchical Bayesian infinite mixture model
to learn a strong prior that allows the agent to rapidly infer the characteristics of a new environment
based on previous tasks. However, it only infers the category label of a new MDP and only works in
discrete settings.
4 Model-based Lifelong Reinforcement Learning
Our approach is built upon two main intuitions: First, transferring the transition model instead of
policy/value function leads to more efficient usage of the data when “finetuning” on a new task.
As we show empirically in Section 5.1, although some model-free lifelong RL algorithms perform
better than the proposed model-based method in single-task cases, in lifelong RL setting of the
same task type the model-based method is still able to achieve comparable/better performance with
only half the amount of data. Secondly, with a model that is able to capture different levels of the
uncertainty within HiP-MDPs, an agent can employ sample-based Bayesian exploration to further
improve sample-efficiency.
The model underlying our approach is a hierarchical Bayesian posterior over task MDPs controlled by
the hidden parameter
ω
. Intuitively, we maintain probability distributions that separately capture two
categories of uncertainty within lifelong learning tasks: The world-model posterior
P(ω)
captures
the epistemic uncertainty of the world-model distribution over all future and past tasks
m1,· · · , mn
controlled by the hidden parameter
ω1,· · · , ωn∼PΩ
. As the learner is exposed to more and more
tasks, this posterior should converge to the world-model distribution
PΩ
. The task-specific posterior
P(ωi)
captures the epistemic uncertainty of the current task
mi
(Throughout the paper we will often
write
i
for simplicity.). As the learner is exposed to more and more transitions within the task, this
posterior should approach the true distribution corresponding to
ωi
, i.e. peaking at the true
ωi
for
this specific task
i
, leaving only the aleatoric uncertainty of transitions within the task, which is
independent of other tasks. Each time the agent encounters a new task, we initialize the task-specific
model using the world-model posterior and further train it with data collected only from the new task.
One of our key insights here is that the sample complexity of learning a new task will decrease as
the initial prior of task-specific model approaches the true underlying distribution of the transition
function. Thus, the agent can learn new tasks faster by exploiting knowledge common to previous
tasks, thereby exhibiting positive forward transfer.
Specifically, we model the task-specific posterior via the transition dynamics using
p(st+1, rt|st, at;ωi)
. The task-specific posterior, given a new state–action pair from task
i
, can
be rewritten via Bayes’ rule:
P(ωi|Dt
i, at, st+1, rt) = P(ωi|Dt
i)P(st+1, rt|Dt
i, at;ωi)
P(st+1, rt|Dt
i, at),(1)
where
Dt
i={s1, a1,· · · , st}
is the agent’s history of task
i
until time step
t
. The world-model
posterior, given the new data from task i, can be rewritten as:
P(ω|D1:i)=P(ω|D1:i−1)P(Di|D1:i−1;ω)
P(Di|D1:i−1),(2)
where
D1:i
denotes the agent’s history with all the experienced tasks
1∼i
until current task
i
. In
particular, each time when the agent faces a new task
i
and has not started updating its task specific
posterior yet (that is,
Dt
i=∅
), we first use the world-model posterior to initialize the task-specific
prior:
P(ωi|Dt
i)=P(ω|D1:i)
. The world-model distribution aims to approximate the underlying
3