Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

2025-05-06 0 0 482.46KB 26 页 10玖币
侵权投诉
arXiv:2210.11604v3 [cs.LG] 21 May 2023
Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes
Runlong Zhou 1Ruosong Wang 1Simon S. Du 1
Abstract
We study regret minimization for reinforcement
learning (RL) in Latent Markov Decision Pro-
cesses (LMDPs) with context in hindsight. We
design a novel model-based algorithmic frame-
work which can be instantiated with both a
model-optimistic and a value-optimistic solver.
We prove an r
Op?VarMΓSAKqregret bound
where r
Ohides logarithm factors, Mis the num-
ber of contexts, Sis the number of states, Ais the
number of actions, Kis the number of episodes,
ΓďSis the maximum transition degree of any
state-action pair, and Varis a variance quantity
describing the determinism of the LMDP. The re-
gret bound only scales logarithmically with the
planning horizon, thus yielding the first (nearly)
horizon-free regret bound for LMDP. This is
also the first problem-dependent regret bound for
LMDP. Key in our proof is an analysis of the to-
tal variance of alpha vectors (a generalization of
value functions), which is handled with a trun-
cation method. We complement our positive re-
sult with a novel p?VarMSAKqregret lower
bound with Γ2, which shows our upper bound
minimax optimal when Γis a constant for the
class of variance-bounded LMDPs. Our lower
bound relies on new constructions of hard in-
stances and an argument inspired by the sym-
metrization technique from theoretical computer
science, both of which are technically different
from existing lower bound proof for MDPs, and
thus can be of independent interest.
1Paul G. Allen School of Computer Science & Engineering,
University of Washington, Seattle, WA, USA. Correspondence to:
Simon S. Du <ssdu@cs.washington.edu>.
Proceedings of the 40 th International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
1. Introduction
One of the most popular model for Reinforcement Learn-
ing(RL) is Markov Decision Process (MDP), in which the
transitions and rewards are dependent only on current state
and agent’s action. In standard MDPs, the agent has full
observation of the state, so the optimal policy for the agent
also only depends on states (called a history-independent
policy). There is a line of research on MDPs, and the min-
imax regret and sample complexity guarantees have been
derived.
Another popular model is Partially Observable MDPs
(POMDPs) in which the agent only has partial observa-
tions of states. Even though the underlying transition is
still Markovian, the lower bound for sample complexity
has been proven to be exponential in state and action sizes.
This is in part because the optimal policies for POMDPs
are history-dependent.
In this paper we focus on a middle ground between MDP
and POMDP, namely Latent MDP (LMDP). An LMDP
can be viewed as a collection of MDPs sharing the same
state and action spaces, but the transitions and rewards
may vary across them. Each MDP has a probability to
be sampled at the beginning of each episode, and it will
not change during the episode. The agent needs to find
a policy which works well on these MDPs in an average
sense. Empirically, LMDPs can be used for a wide
variety of applications (Yu et al.,2020;Iakovleva et al.,
2020;Finn et al.,2018;Ramamoorthy et al.,2013;
Doshi-Velez & Konidaris,2016;Yao et al.,2018). In
general, there exists no policy that is optimally on every
single MDP simultaneously, so this task is definitely harder
than MDPs. On the other hand, LMDP is a special case
of POMDP because for each MDP, the unobserved state is
static in each episode and the observable state is just the
state of MDP.
Unfortunately, for generic LMDPs, there exists exponen-
tial sample complexity lower bound (Kwon et al.,2021),
so additional assumptions are needed to make the problem
tractable. In this paper, we consider the setting that after
each episode ends, the agent will get the context on which
MDP it played with. This is called context in hindsight.
1
Horizon-Free and Variance-Dependent RL for LMDPs
Such information is often available. For example, in a maze
navigation task, the location of the goal state can be viewed
as the context.
In this setting, Kwon et al. (2021) obtained an
r
Op?MS2AHKq1regret upper bound where His
the planning horizon. They did not study the regret lower
bound. To benchmark this result, the only available bound
is p?SAKqfrom standard MDP by viewing MDP as a
special case of LMDP.
When we view MDP as a special case of LMDP, there
are problem-dependent results (Zanette & Brunskill,2019;
Bartlett & Tewari,2012;Fruit et al.,2018;Maillard et al.,
2014;Jin et al.,2020;Wagenmaker et al.,2022) which
raise our attention. RL algorithms often perform far better
on MDPs with special structures than what their worst-case
guarantees would suggest. Algorithms with a problem-
dependent regret guarantee should automatically adapt to
the MDP instance without the prior knowledge of problem-
dependent quantities.Zanette & Brunskill (2019) provides
an algorithm for MDPs whose regret scales with the max-
imum per-step conditional variance of the MDP instance.
This quantity, Q, is determined by the value function of
the optimal policy. Their regret bound r
Op?HQ¨SAK `
H5{2S2Aqreduces to a constant r
OpH5{2S2Aqwhen the
MDP is deterministic (Q0). This work motivates us to
further study how to have a variance-dependent regret.
Comparing these bounds, we find significant gaps: Is
the dependency on Min LMDP necessary? The bound
for MDP is (nearly) horizon-free (no dependency on H),
is the polynomial dependency on Hin LMDP necessary?
If the LMDP is reduced to a deterministic MDP, can the
algorithm automatically have a constant regret (up to log-
arithm factors)? Further, is it possible to have variance-
dependency in the regret bound for LMDPs? The de-
pendency on the number of states is ?Sfor MDP but the
bound in (Kwon et al.,2021) for LMDP is S.
In this paper, we resolve the first three questions and par-
tially answer the fourth.
1.1. Main contributions and technical novelties
We obtain the following new results:
A suitable notation of variances for LMDPs. There
is no easy notion of Qunder the LMDP setting, because
different optimal policies may have different behaviors in a
certain MDP. The optimal alpha vector (the counterpart of
value function for LMDPs) is not unique, so it is hard to de-
1Their original bound is r
Op?MS2AH3Kqwith the scaling
that the reward from each step is bounded by 1. We rescale the
reward to be bounded by 1{Hin order to make the total reward
from each episode bounded by 1, which is the setting we consider.
fine a variance quantity depending only on the optimal al-
pha vector. We define a suitable notation of variance, Var,
which is the maximum variance of total rewards induced
by any deterministic policy (see Section 4). By definition,
VarďOp1qbecause the total reward is bounded by 1, so
a regret bound that depends on ?Varwill automatically
reduce to the worst-case regret bound.
Near-optimal regret guarantee for LMDPs. We
present an algorithm framework for LMDPs with context in
hindsight. This framework can be instantiated with a plug-
in solver for planning problems. We consider two types
of solvers, one model-optimistic and one value-optimistic,
and prove their regrets to be r
Op?VarMΓSAK `MS2Aq
where ΓďSis the maximum transition degree of any
state-action pair. Compared with the result in (Kwon et al.,
2021), ours only requires the total reward to be bounded
whereas they required a bounded reward for each step.
We improve the H-dependency from ?Hto logarithmic,
making our bound (nearly) horizon-free. Furthermore, our
bound scales with ?SΓ, which is strictly better than S
in their bound. Finally, our main order term is variance-
dependent, which means when the LMDP reduces to a de-
terministic MDP (Var0), our regret is a constant up to
logarithm factors.
The main technique of our model-optimistic algorithm is
to use a Bernstein-type confidence set on each position
of transition dynamics, leading to a small Bellman error.
The main difference between our value-optimistic algo-
rithm and Kwon et al. (2021)’s is that we use a bonus de-
pending on the variance of next-step values according to
Bennett’s inequality, instead of using Bernstein’s inequal-
ity. It helps propagate the optimism from the last step to
the first step, avoiding the ?H-dependency. We analyse
these two solvers in a unified way, as their Bellman error
are of the same order. We derive the variance-dependent
regret by upper-bounding the total variance of estimated
alpha vectors using martingale concentration inequalities
with a truncation method.
New regret lower bound for LMDPs. We obtain a
novel `?VMSAK˘minimax regret lower bound for
the class of LMDPs with VarďOpVq. This re-
gret lower bound shows the dependency on Mis neces-
sary for LMDPs. Notably the lower bound also implies
r
O´?VarMΓSAK¯upper bound is optimal up to a ?Γ
factor. Furthermore, our lower bound holds even for Γ2,
which shows our upper bound is minimax optimal for a
class of LMDPs with ΓOp1q. Maze navigation prob-
lems are a typical class with ΓďOp1q: the agent can only
have at most 4next states to transition into (up, down, left,
right).
Our proof relies on new constructions of hard instances,
2
Horizon-Free and Variance-Dependent RL for LMDPs
different from existing ones for MDPs (Domingues et al.,
2021). In particular, we use a two-phase structure to con-
struct hard instances (cf. Figure 1). Furthermore, the pre-
vious approaches for proving lower bounds of MDPs do
not work on LMDPs. For example, in the MDP instance
of Domingues et al. (2021), the randomness comes from
the algorithm and the last transition step before entering
the good state or bad state. In an LMDP, the randomness
of sampling the MDP from multiple MDPs must also be
considered. Such randomness not only dilutes the value
function by averaging over each MDP, but also divides
the pushforward measure (see Page 3 of Domingues et al.
(2021)) into Mparts. As a result, the Mterms in KL diver-
gence in Equation (2) of Domingues et al. (2021) and that
in Equation (10) cancels out the final lower bound does
not contain M. To overcome this, we adopt the core spirit
of the symmetrization technique from theoretical computer
science. We randomly give a single MDP of a M-MDP
LMDP to the agent, while making it unable to distinguish
between which position this MDP is in the whole system.
The idea to reduce Mcomponents to 1component shares
the same spirit as that of symmetrization (Section 1.1.1 of
Phillips et al. (2012)). This novel technique helps general-
ize the bounds from a single-party result to a multiple-party
result, which gives rise to a tighter lower bound.
2. Related Works
LMDPs. As shown by Steimle et al. (2021), in the gen-
eral cases, optimal policies for LMDPs are history depen-
dent and P-SPACE hard to find. This is different from
standard MDP cases where there always exists an opti-
mal history-independent policy. However, even finding the
optimal history-independent policy is NP-hard (Littman,
1994). Chades et al. (2012) provided heuristics for finding
the optimal history-independent policy.
Kwon et al. (2021) investigated the sample complexity and
regret bounds of LMDPs. Specifically, they presented an
exponential lower-bound for general LMDPs without con-
text in hindsight, and then they derived an algorithm with
polynomial sample complexity and sub-linear regret for
two special cases (with context in hindsight, or δ-strongly
separated MDPs).
LMDP has been studied as a type of multi-task RL
(Taylor & Stone,2009;Brunskill & Li,2013;Liu et al.,
2016;Hallak et al.,2015). It has been applied to
model combinatorial optimization problems (Zhou et al.,
2022). There are also some related studies such as
model transfer (Lazaric,2012;Zhang & Wang,2021)
and contextual decision processes (Jiang et al.,2017).
In empirical works, LMDP has has wide applica-
tions in multi-task RL (Yu et al.,2020), meta RL
(Iakovleva et al.,2020;Finn et al.,2018), latent-variable
MDPs (Ramamoorthy et al.,2013) and hidden parameter
MDPs (Doshi-Velez & Konidaris,2016;Yao et al.,2018).
Regret Analysis for MDPs. LMDPs are generalizations
of MDPs, so some previous approaches to solving MDPs
can provide insights. There is a long line of work on regret
analysis for MDPs (Azar et al.,2017;Dann et al.,2017;
2019;Zanette & Brunskill,2019;Zhang et al.,2021a). In
this paper, we focus on time-homogeneous, finite horizon,
undiscounted MDPs whose total reward is upper-bounded
by 1. Recent work showed in this setting the regret can
be (nearly) horizon-free for tabular MDPs (Wang et al.,
2020;Zhang et al.,2021a;2022;2020;Ren et al.,2021).
Importantly these results indicate RL may not be more
difficult than bandits in the minimax sense. More re-
cent work generalized the horizon-free results to other
MDP problems (Zhang et al.,2021b;Kim et al.,2021;
Tarbouriech et al.,2021;Zhou & Gu,2022). However, all
existing work with horizon-free guarantees only considered
single-environment problems. Ours is the first horizon-free
guarantee that goes beyond MDP.
Neu & Pike-Burke (2020) summarized up the “optimism
in the face of uncertainty” (OFU) principle in RL. They
named two types of optimism: model-optimistic algo-
rithms construct confidence sets around empirical transi-
tions and rewards, and select the policy with the highest
value in the best possible models in these sets. value-
optimistic algorithms construct upper bounds on the opti-
mal value functions, and select the policy which maximizes
this optimistic value function. Our paper follows their idea
and provide one algorithm for each type of optimism.
Variance-dependent regrets for MDPs. Variance-
dependent regrets have been studied under the MDP
setting. Talebi & Maillard (2018) provides a regret bound
that scales with the variance of the next step value func-
tions under strong assumptions on ergodicity of the MDP.
Namely, they define V
s,a for each ps, aqpair and derives
a regret of ˜
OpbSřs,a V
s,aTqunder the infinite horizon
setting.
Simchowitz & Jamieson (2019) combines gap-dependent
regret with variances. The standard notation gapps, aqis
the gap between the optimal value function and the opti-
mal Q-function, and gapmin is the minimum non-zero gap.
Let Var
h,s,a be the variance of optimal value function at
ph, s, aqtriple, their regret approximately scales as
˜
O˜ÿ
s,a
HmaxhVar
h,s,a
maxtgapps, aq,gapminulogpKq¸.
Variance-aware bounds also exist in bandits (Kim et al.,
2021;Zhang et al.,2021b;Zhou et al.,2021;Zhao et al.,
3
Horizon-Free and Variance-Dependent RL for LMDPs
2023;2022) and linear MDPs (Li & Sun,2023).
3. Problem Setup
In this section, we give a formal definition of Latent
Markov Decision Processes (Latent MDPs).
Notations. For any event E, we use rEsto denote the
indicator function, i.e., rEs “ 1if Eholds and rEs “
0otherwise. For any set X, we use pXqto denote
the probability simplex over X. For any positive inte-
ger n, we use rnsto denote the set t1,2,...,nu. For
any probability distribution P, we use supppPq “ }P}0
to denote the size of support of P, i.e., řxrPpxq ą
0s. There are three ways to denote a d-dimensional
vector (function): suppose pis any parameter, xppq “
px1ppq, x2ppq,...,xdppqq if the indices are natural num-
bers, xp¨|pq “ pxpi1|pq, xpi2|pq,...,xpid|pqq and xpp¨q “
pxppi1q, xppi2q,...,xppidqq if the indices are from the set
I“ ti1, i2,...,idu. For any number q, we use xqto de-
note the vector pxq
1, xq
2,...,xq
dq. For two d-dimensional
vectors xand y, we use xJyřixiyito denote the
inner product. If xis a probability distribution, we use
Vpx, yq “ řixipyi´xJyq2xJpy2q´pxJyq2to denote
the empirical variance. We use ι2 ln `2M SAHK
δ˘as a
log term where δis the confidence parameter.
3.1. Latent Markov Decision Process
Latent MDP (Kwon et al.,2021) is a collection of finitely
many MDPs M“ tM1,M2,...,MMuwhere M
|M|. All the MDPs share state set S, action set Aand
horizon H. Each MDP Mm“ pS,A, H, νm, Pm, Rmq
has its own initial state distribution νmPpSq, transition
model Pm:SˆAÑpSqand a deterministic reward
function Rm:SˆAÑ r0,1s. Let w1, w2,...,wMbe the
mixing weights of MDPs such that wmą0for any mand
řM
m1wm1.
Denote S“ |S|, A “ |A|and Γ
maxm,s,a supp pPmp¨|s, aqq.Γcan be interpreted as
the maximum degree of each transition, which is a quantity
our regret bound depends on. Note we always have ΓďS.
In previous work, Lattimore & Hutter (2012) assumes
Γ2, and Fruit et al. (2020) also has a regret bound that
scales with Γ.
In the worst case, the optimal policy of an LMDP is history-
dependent and is PSPACE-hard to find (Corollary 1 and
Proposition 3 in Steimle et al. (2021)). Aside from compu-
tational difficulty, storing a history-dependent policy needs
a space which is exponentially large, so it is generally im-
practical. In this paper, we seek to provide a result for
any fixed policy class Π. For example, we can have Π
to be the set of all history-independent, deterministic poli-
cies to alleviate the space issue. Following previous work
(Kwon et al.,2021), we assume access to oracles for plan-
ning and optimization. See Section 5for the formal defini-
tions.
We consider an episodic, finite-horizon and undiscounted
reinforcement learning problem on LMDPs. In this prob-
lem, the agent interacts with the environment for K
episodes. At the start of every episode, one MDP MmP
Mis randomly chosen with probability wm. Throughout
the episode, the true context is hidden. The agent can only
choose actions based on the history information up until
current time. However, at the end of each episode (after
Hsteps), the agent gets revealed the true context m. This
permits an unbiased model estimation for the LMDP. As in
Cohen et al. (2020), the central difficulty is to estimate the
transition, we also focus on learning Ponly. For simplic-
ity, we assume that wand νare known to the agent, because
they can be estimated easily.
Conditions on rewards. We assume that Ris deter-
ministic and known to the agent. The assumption is for
simplicity, and our analysis can be extended to unknown
and bounded-support reward distributions. We study the
LMDPs that the total reward within an episode is upper-
bounded by 1almost surely for any policy. This condition
poses more difficulty to the design of a horizon-free algo-
rithm, because the ordinary case of uniform-bounded re-
wards can be converted to total-bounded rewards by mul-
tiplying 1{H. Under this condition, an algorithm needs to
consider a spike of reward at certain step.
3.2. Value functions, Q-functions and alpha vectors
By convention, the expected reward of executing a policy
on any MDP can be defined via value function Vand Q-
function Q. Since for MDPs there is always an optimal
policy which is history-independent, Vand Qonly need
the current state and action as parameters.
However, these notations fall short of history-independent
policies under the LMDP setting. The full information
is encoded in the history, so here we use a more gen-
eralized definition called alpha vector (following nota-
tions in Kwon et al. (2021)). For any time tě1, let
ht“ ps, a, rq1:t´1stbe the history up until time t. De-
fine Htas the set of histories observable at time step t, and
H:“ YH
t1Htas the set of all possible histories. We define
the alpha vectors απ
mphqfor pm, hq P rMsˆHas follows:
απ
mphq:Eπ,Mm«H
ÿ
t1t
Rmpst1, at1qˇˇˇˇˇhth,
απ
mph, aq:Eπ,Mm«H
ÿ
t1t
Rmpst1, at1qˇˇˇˇˇpht, atq “ ph, aq.
4
Horizon-Free and Variance-Dependent RL for LMDPs
The alpha vectors are indeed value functions and Q-
functions on each individual MDP.
Next, we introduce the concepts of belief state to show how
to do planning in LMDP. Let bmphqdenote the belief state
over MMDPs corresponding to a history h, i.e., the proba-
bility of the true MDP being Mmconditioned on observing
history h. We have the following recursion:
bmpsq “ wmνmpsq
řM
m11wm1νm1psq,
bmphars1q “ bmphqPmps1|s, aqrrRmps, aqs
řM
m11bm1phqPm1ps1|s, aqrrRm1ps, aqs.
The value functions and Q-functions for LMDP is defined
via belief states and alpha vectors:
Vπphq:bphqJαπphqand Qπph, aq:bphqJαπph, aq.
Direct computation (see Appendix B.1) gives
Vπphq “ ÿ
aPA
πpa|hqQπph, aq “ ÿ
aPA
πpa|hq˜bphqJRps, aq
`ÿ
s1PS,r
M
ÿ
m11
bm1phqPm1ps1|s, aqrrRm1ps, aqsVπphars1q¸.
So planning in LMDP can be viewed as planning in belief
states. For the optimal history-dependent policy, we can
select
πphq “ arg max
aPA˜bphqJRps, aq
`ÿ
s1PS,r
M
ÿ
m11
bm1phqPm1ps1|s, aqrrRm1ps, aqsVπphars1q¸,
(1)
using dynamic programming in descending order of hs
length.
3.3. Performance measure
We use cumulative regret to measure the algorithm’s per-
formance. The optimal policy is πarg maxπPΠVπ,
which also does not know the context when interacting with
the LMDP. Suppose the agent interacts with the environ-
ment for Kepisodes, and for each episode ka policy πkis
played. The regret is defined as
RegretpKq:
K
ÿ
k1pV´Vπkq.
Since the planning problem for LMDPs is time-consuming,
we may assume access to an efficient planning-oracle (e.g.,
greedy algorithm, approximation algorithm) with the fol-
lowing performance guarantee (Kwon et al.,2021): given
M, the policy πit returns satisfies Vπ
Měρ1V
M´ρ2.
Then we define the regret as:
Č
RegretpKq:
K
ÿ
k1pρ1V´ρ2´Vπkq.
Notice that Č
RegretpKq ď ρ1RegretpKq, so we essentially
study the upper bound of RegretpKq.
4. Variance for LMDPs
We introduce the maximum policy-value variance, which is
novel in the literature.
Definition 1. For any policy πPΠ, its maximum value
variance is defined as
Varπ:Vpw˝ν, απ
¨p¨qq
`Eπ«H
ÿ
t1
VpPmp¨|st, atq, απ
mphtatrt¨qq.
The maximum policy-value variance for a particular
LMDP is defined as:
Var:max
πPΠVarπ.
Varπis the variance of total reward of π, and the justifica-
tion can be found in Appendix B.2.
5. Main Algorithms and Results
In this section, we present two algorithms, and show their
minimax regret guarantee. The first is to use a Bernstein
confidence set on transition probabilities, which was first
applied to SSP in Cohen et al. (2020) to derive a horizon-
free regret. This algorithm uses a bi-level optimization or-
acle: for the inner layer, an oracle is needed to find the
optimal policy inside Πunder a given LMDP; for the outer
layer, an oracle finds the best transition inside the confi-
dence set which maximizes the optimal expected reward.
The second is to adapt the Monotoic Value Propagation
(MVP) algorithm (Zhang et al.,2021a) to LMDPs. This
algorithm requires an oracle to solve an LMDP with dy-
namic bonus: the bonuses depends on the variances of the
next-step alpha vector. Both algorithms enjoy the following
regret guarantee.
Theorem 2. For both the Bernstein confidence set for
LMDP (Algorithm 1combined with Algorithm 2) and the
Monotonic Value Propagation for LMDP (Algorithm 1
combined with Algorithm 3), with probability at least 1´δ,
we have that
RegretpKq ď r
Op?VarMΓSAK `MS2Aq.
5
摘要:

arXiv:2210.11604v3[cs.LG]21May2023Horizon-FreeandVariance-DependentReinforcementLearningforLatentMarkovDecisionProcessesRunlongZhou1RuosongWang1SimonS.Du1AbstractWestudyregretminimizationforreinforcementlearning(RL)inLatentMarkovDecisionPro-cesses(LMDPs)withcontextinhindsight.Wedesignanovelmodel-bas...

展开>> 收起<<
Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:482.46KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注