
On the Power of Pre-training for Generalization in RL: Provable Benefits and Hardness
icy for deterministic MDPs. However, their work is dif-
ferent from ours since they studied a restricted setting un-
der structural assumptions of the environments, such as
all MDPs share a common optimal policy, and their al-
gorithm requires access to a query model. Our paper is
also related to recent works studying multi-task learning in
RL (Brunskill & Li,2013;Tirinzoni et al.,2020;Hu et al.,
2021;Zhang & Wang,2021;Lu et al.,2021), in which they
studied how to transfer the knowledge learned from previ-
ous tasks to new tasks. Their problem formulation is differ-
ent from ours since they study the multi-task setting where
the MDP is selected from a given MDP set without prob-
ability mechanism. In addition, they typically assume that
all the tasks have similar transition dynamics or share com-
mon representations.
Provably Efficient Exploration in RL. Recent years have
witnessed many theoretical results studying provably effi-
cient exploration in RL (Osband et al.,2013;Azar et al.,
2017;Osband & Van Roy,2017;Jin et al.,2018;2020b;
Wang et al.,2020;Zhang et al.,2021) with the minimax re-
gret for tabular MDPs with non-stationary transition being
˜
O(√HSAK). These results indicate that polynomial de-
pendence on the whole state-action space is unavoidable
without additional assumptions. Their formulation corre-
sponds to the single-task setting where the agent only in-
teracts with a single environment aiming to maximize its
cumulative rewards without pre-training. The regret de-
fined in the fine-tuning setting coincides with the concept
of Bayesian regret in the previous literature (Osband et al.,
2013;Osband & Van Roy,2017;O’Donoghue,2021). The
best-known Bayesian regret for tabular RL is ˜
O(√HSAK)
when applied to our setting (O’Donoghue,2021).
Latent MDPs. This work is also related to previous results
on latent MDPs (Kwon et al.,2021b;a), which is a special
class of Partially Observable MDPs (Azizzadenesheli et al.,
2016;Guo et al.,2016;Jin et al.,2020a). In latent MDPs,
the MDP representing the dynamics of the environment is
sampled from an unknown distribution at the start of each
episode. Their works are different from ours since they
focus on tackling the challenge of partial observability.
3. Preliminary and Framework
Notations Throughout the paper, we use [N]to denote
the set {1,···, N}where N∈N+. For an event E, let
I[E]be the indicator function of event E, i.e. I[E] = 1 if
and only if Eis true. For any domain Ω, we use C(Ω)
to denote the continuous function on Ω. We use O(·)to
denote the standard big Onotation, and ˜
O(·)to denote the
big Onotation with log(·)term omitted.
3.1. Episodic MDPs
An episodic MDP Mis specified as a tuple
(S,A,PM, RM, H), where S,Aare the state and ac-
tion space with cardinality Sand Arespectively, and His
the steps in one episode. PM,h :S × A7→ ∆(S)is the
transition such that PM,h(s′|s, a)denotes the probability
to transit to state s′if action ais taken in state sin step h.
RM,h :S × A 7→ ∆(R)is the reward function such that
RM,h(s, a)is the distribution of reward with non-negative
mean rM,h(s, a)when action ais taken in state sat step
h. In order to compare with traditional generalization, we
make the following assumption:
Assumption 3.1. The total mean reward is bounded by 1,
i.e. ∀M ∈ Ω,PH
h=1 rM,h(sh, ah)≤1for all trajectory
(s1, a1,···, sH, aH)with positive probability in M; The
reward mechanism RM(s, a)is 1-subgaussian, i.e.
EX∼RM,h(s,a)[exp(λ[X−rM,h(s, a)])] ≤exp λ2
2
for all λ∈R.
The total reward assumption follows the previous works
on horizon-free RL (Ren et al.,2021;Zhang et al.,2021;
Li et al.,2022) and covers the traditional setting where
rM,h(s, a)∈[0,1] by scaling H, and it is more natural
in environments with sparse rewards (Vecerik et al.,2017;
Riedmiller et al.,2018). In addition, it allows us to com-
pare with supervised learning bound where H= 1 and
the loss is bounded by [0,1]. The subgaussian assumption
is more common in practice and is widely used in ban-
dits (Lattimore & Szepesv´ari,2020). It also covers tradi-
tional RL setting where RM,h(s, a)∈∆([0,1]), and al-
lows us to study MDP environment with a wider range. For
the convenience of explanation, we assume the agent al-
ways starts from the same state s1. It is straightforward to
recover the initial state distribution µfrom this setting by
adding an initial state s0with transition µ(Du et al.,2019;
Chen et al.,2021).
Policy and Value Function. A policy πis set of Hfunc-
tions where each maps a state to an action distribution, i.e.
π={πh}H
h=1, πh:S 7→ ∆(A)and πcan be stochastic.
We denote the set of all policies described above as Π. We
define NΠ
ǫas the ǫ-covering number of the policy space
Πw.r.t. distance d(π1, π2) = maxs∈S,h∈[H]kπ1
h(·|s)−
π2
h(·|s)k1. Given πand h∈[H], we define the Q-function
Qπ
M,h :S × A 7→ R+, where
Qπ
M,h(s, a) = rM,h(s, a)+ X
s′∈S
PM,h(s′|s, a)Vπ
M,h+1(s′),
and the V-function Vπ
M,h :S 7→ R+, where
Vπ
M,h(s) = Ea∼πh(·|s)Qπ
M,h(s, a)
3