
Multi-task RL. Learning a suite of tasks with an RL
agent has been studied under different frameworks [3,44],
such as Latent MDP [45], Multi-model MDP [5], Con-
textual MDP [46], Hidden Parameter MDP [47], and etc
[48]. Our proposed HLMDP builds on the Latent MDP
[45] which contains a finite number of MDPs, each accom-
panied by a weight. In contrast to Latent MDP utilizing
a flat structure to model each MDP’s probability, HLMDP
leverages a rich hierarchical model to cluster MDPs to a
finite number of mixtures. In addition, HLMDP is a spe-
cial yet important subclass of POMDP [49]. It treats the
latent task mixture that the current environment belongs to
as the unobservable variable. HLMDP resembles the re-
cently proposed Hierarchical Bayesian Bandit [50] model
but focuses on more complex MDP settings.
3 Preliminary
This section introduces Latent MDP and the adaptive belief
setting, both serving as building blocks for our proposed
HLMDP (Section 4) and GDR-MDP (Section 5).
Latent MDP. An episodic Latent MDP [45] is specified
by a tuple (M, T, S,A, µ).Mis a set of MDPs with
cardinality |M| =M. Here T,S, and Aare the shared
episode length (planning horizon), state, and action space,
respectively. µis a categorical distribution over MDPs and
PM
m=1 µ(m) = 1. Each MDP Mm∈ M, m ∈[M]is
a tuple (T, S,A,Pm,Rm, νm)where Pmis the transition
probability, Rmis the reward function and νmis the initial
state distribution.
Latent MDP assumes that at the beginning of each episode,
one MDP from set Mis sampled based on µ(m). It
aims to find a policy πthat maximizes the accumulated
expected return solving maxπPM
m=1 µ(m)Eπ
mPT
t=1 rt,
where Em[·]denotes EPm,Rm[·].
The Adaptive Belief Setting In general, a belief distribu-
tion contains the probability of each possible MDP that the
current environment belongs to. The adaptive belief setting
[5] holds a belief distribution that is dynamically updated
with streamingly observed interactions and prior knowl-
edge about the MDPs. In practice, prior knowledge may
be acquired by rule-based policies or data-driven learning
methods. For example, it is possible to pre-train in sim-
ulated complete information scenarios or exploit unsuper-
vised learning methods based on online collected data [51].
There also exist multiple choices for updating the belief,
such as applying the Bayesian rule as in POMDPs [49] and
representing beliefs with deep recurrent neural nets [52].
4 Hierarchical Latent MDP
In realistic settings, tasks share similarities, and task sub-
populations are common. Although different MDP formu-
lations are proposed to solve multi-task RL, the task rela-
tionships are in general overlooked. To fill in the gap, we
first propose Hierarchical Latent MDP (HLMDP), which
utilizes a hierarchical mixture model to represent distribu-
tions over MDPs. Moreover, we consider the adaptive be-
lief setting to leverage prior information about tasks.
Definition 1 (Hierarchical Latent MDPs).An episodic
HLMDP is defined by a tuple (Z,M, T, S,A, w).Zde-
notes a set of Latent MDPs and |Z| =Z.Mis a set of
MDPs with cardinality |M| =Mshared by different La-
tent MDPs. T,S, and Aare the shared episode length
(planning horizon), state, and action space, respectively.
Each Latent MDP Zz∈ Z, z ∈[Z]consists of a set of
joint MDPs {Mm}M
m=1 and their weights µzsatisfying
PM
m=1 µz(m) = 1.wis the categorical distribution over
Latent MDPs and PZ
z=1 w(z) = 1.
We provide a graphical model of HLMDP in Figure 1(c).
HLMDP assumes that at the beginning of each episode, the
environment first samples a Latent MDP z∼w(z)and
then samples an MDP m∼µz(m). HLMDP encodes task
similarity information via the mixture model, and thus con-
tains richer task information than Latent MDP proposed in
[45]. For instance, we could always find one Latent MDP
for each HLMDP. However, there may exist infinitely many
corresponding HLMDPs given one Latent MDP.
HLMDP in Adaptive Belief Setting. When solving
multi-task RL problems, the adaptive setting is shown to
help generate a policy with a higher performance [5] than
the non-adaptive one since it leverages prior knowledge
about the transition model as well as the online collected
data tailored to the unseen environment. Hence we are mo-
tivated to formulate HLMDP in the adaptive belief setting.
HLMDP maintains a belief distribution b(z)over task
groups to model the probability that the current environ-
ment belongs to each group z. At the beginning of each
episode, we initialize the belief distribution with a uniform
distribution b0. We use the Bayesian rule to update beliefs
based on interactions and a prior knowledge base. Note that
the knowledge base are not accurate enough and may lead
to inaccurate belief updates. At timestep t, we get the next
belief estimate bt+1 with the state estimation function SE:
SE(bt, st) = bt(j)L(j)
Pi∈[Z]bt(i)L(i),∀j∈[Z],(1)
wher Under the adaptive belief setting, HLMDP aims to
find an optimal policy ¯π?within a history-dependent pol-
icy class Π, under which the discounted expected cumula-
tive reward is maximized as in Equation 2. Following gen-
eral notations in POMDPs, we denote the history at time
tas ht= (s0, a1, s1, . . . , st−1, at−1, st)∈ Htcontaining
state-action pairs (s, a). At timestep t, we use both the ob-
served state stand the inferred belief distribution bt(z)as
the sufficient statistics for history ht.
¯
V?= max
π∈Π
Eb0:T(z)Eµz(m)Eπ
m
T
X
t=1
γtrt,(2)