
Horizon-Free and Variance-Dependent RL for LMDPs
2023;2022) and linear MDPs (Li & Sun,2023).
3. Problem Setup
In this section, we give a formal definition of Latent
Markov Decision Processes (Latent MDPs).
Notations. For any event E, we use rEsto denote the
indicator function, i.e., rEs “ 1if Eholds and rEs “
0otherwise. For any set X, we use ∆pXqto denote
the probability simplex over X. For any positive inte-
ger n, we use rnsto denote the set t1,2,...,nu. For
any probability distribution P, we use supppPq “ }P}0
to denote the size of support of P, i.e., řxrPpxq ą
0s. There are three ways to denote a d-dimensional
vector (function): suppose pis any parameter, xppq “
px1ppq, x2ppq,...,xdppqq if the indices are natural num-
bers, xp¨|pq “ pxpi1|pq, xpi2|pq,...,xpid|pqq and xpp¨q “
pxppi1q, xppi2q,...,xppidqq if the indices are from the set
I“ ti1, i2,...,idu. For any number q, we use xqto de-
note the vector pxq
1, xq
2,...,xq
dq. For two d-dimensional
vectors xand y, we use xJy“řixiyito denote the
inner product. If xis a probability distribution, we use
Vpx, yq “ řixipyi´xJyq2“xJpy2q´pxJyq2to denote
the empirical variance. We use ι“2 ln `2M SAHK
δ˘as a
log term where δis the confidence parameter.
3.1. Latent Markov Decision Process
Latent MDP (Kwon et al.,2021) is a collection of finitely
many MDPs M“ tM1,M2,...,MMuwhere M“
|M|. All the MDPs share state set S, action set Aand
horizon H. Each MDP Mm“ pS,A, H, νm, Pm, Rmq
has its own initial state distribution νmP∆pSq, transition
model Pm:SˆAÑ∆pSqand a deterministic reward
function Rm:SˆAÑ r0,1s. Let w1, w2,...,wMbe the
mixing weights of MDPs such that wmą0for any mand
řM
m“1wm“1.
Denote S“ |S|, A “ |A|and Γ“
maxm,s,a supp pPmp¨|s, aqq.Γcan be interpreted as
the maximum degree of each transition, which is a quantity
our regret bound depends on. Note we always have ΓďS.
In previous work, Lattimore & Hutter (2012) assumes
Γ“2, and Fruit et al. (2020) also has a regret bound that
scales with Γ.
In the worst case, the optimal policy of an LMDP is history-
dependent and is PSPACE-hard to find (Corollary 1 and
Proposition 3 in Steimle et al. (2021)). Aside from compu-
tational difficulty, storing a history-dependent policy needs
a space which is exponentially large, so it is generally im-
practical. In this paper, we seek to provide a result for
any fixed policy class Π. For example, we can have Π
to be the set of all history-independent, deterministic poli-
cies to alleviate the space issue. Following previous work
(Kwon et al.,2021), we assume access to oracles for plan-
ning and optimization. See Section 5for the formal defini-
tions.
We consider an episodic, finite-horizon and undiscounted
reinforcement learning problem on LMDPs. In this prob-
lem, the agent interacts with the environment for K
episodes. At the start of every episode, one MDP MmP
Mis randomly chosen with probability wm. Throughout
the episode, the true context is hidden. The agent can only
choose actions based on the history information up until
current time. However, at the end of each episode (after
Hsteps), the agent gets revealed the true context m. This
permits an unbiased model estimation for the LMDP. As in
Cohen et al. (2020), the central difficulty is to estimate the
transition, we also focus on learning Ponly. For simplic-
ity, we assume that wand νare known to the agent, because
they can be estimated easily.
Conditions on rewards. We assume that Ris deter-
ministic and known to the agent. The assumption is for
simplicity, and our analysis can be extended to unknown
and bounded-support reward distributions. We study the
LMDPs that the total reward within an episode is upper-
bounded by 1almost surely for any policy. This condition
poses more difficulty to the design of a horizon-free algo-
rithm, because the ordinary case of uniform-bounded re-
wards can be converted to total-bounded rewards by mul-
tiplying 1{H. Under this condition, an algorithm needs to
consider a spike of reward at certain step.
3.2. Value functions, Q-functions and alpha vectors
By convention, the expected reward of executing a policy
on any MDP can be defined via value function Vand Q-
function Q. Since for MDPs there is always an optimal
policy which is history-independent, Vand Qonly need
the current state and action as parameters.
However, these notations fall short of history-independent
policies under the LMDP setting. The full information
is encoded in the history, so here we use a more gen-
eralized definition called alpha vector (following nota-
tions in Kwon et al. (2021)). For any time tě1, let
ht“ ps, a, rq1:t´1stbe the history up until time t. De-
fine Htas the set of histories observable at time step t, and
H:“ YH
t“1Htas the set of all possible histories. We define
the alpha vectors απ
mphqfor pm, hq P rMsˆHas follows:
απ
mphq:“Eπ,Mm«H
ÿ
t1“t
Rmpst1, at1qˇˇˇˇˇht“hff,
απ
mph, aq:“Eπ,Mm«H
ÿ
t1“t
Rmpst1, at1qˇˇˇˇˇpht, atq “ ph, aqff.
4